Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timed reprocess system sometimes gets stuck #108

Open
joelverhagen opened this issue Aug 20, 2024 · 2 comments
Open

Timed reprocess system sometimes gets stuck #108

joelverhagen opened this issue Aug 20, 2024 · 2 comments

Comments

@joelverhagen
Copy link
Member

The "timed reprocess" system sometimes gets stuck. For this reason, it's not ready to be enabled by default.

@joelverhagen
Copy link
Member Author

2024-08-18

The LoadSymbolPackageArchive driver was stuck in the Working state.
image

It is blocked waiting for enqueue table scan to complete.

if (!await AreTableScanStepsCompleteAsync(taskStateKey.StorageSuffix, taskStateKey.PartitionKey) || !await AreLeafScansCompleteAsync(scan))

All of the leaf scans are complete, but 2 of the enqueue task states are left over. This should not be possible. This indicates that the table scan enqueue succeeded (because all of the leaf scans are done) but somehow the enqueue task states got recreated or failed to be deleted somehow.

The parameters for the leftover task state record are here:

{
    "n": "tps.pq",
    "v": 1,
    "d": {
        "sf": 1,
        "ss": 1,
        "d": 59,
        "p": "08584776361820073217-gaxyddyi2cqubp5af23sl7m4ei-r100-B894-",
        "m": "08584776361820073217-gaxyddyi2cqubp5af23sl7m4ei-r100-B894-czlovek.files",
        "u": "08584776361820073217-gaxyddyi2cqubp5af23sl7m4ei-r100-B894-\uFFFF"
    }
}
{
    "n": "tps.pkq",
    "v": 1,
    "d": {
        "sf": 1,
        "ss": 1,
        "d": 59,
        "p": "08584776361820073217-gaxyddyi2cqubp5af23sl7m4ei-r100-B894-czlovek.files",
        "r": "1.0.9"
    }
}

Task state entities (2 left dangling):
image

No leaf scans left:
image

The task state entities are added before their corresponding message (which is desired) by perhaps a transient table or queue error, or slowness cause the problem.

await _taskStateStorageService.AddAsync(
originalMessage.TaskStateKey.StorageSuffix,
originalMessage.TaskStateKey.PartitionKey,
taskStates);
await _enqueuer.EnqueueAsync(tableScanMessages);

The only hint I could from the existing telemetry is that some table batch operations failed with a server timeout. The storage no-op system kicked in, but seemed to be related to other buckets.

Next step: enhance telemetry around message enqueue and table entity changes so we can better understand what happened.

@joelverhagen
Copy link
Member Author

joelverhagen commented Sep 4, 2024

2024-09-04

The LoadSymbolPackageArchive driver is stuck again in the Working state.
image

Similar to last time, there are no leaf scans left but some table scan task records.
Table name: taskstatedqcwugivcvqu5pbqns44odgk3m
image

Index scan started: 2024-09-04T10:39:01.5008431Z
Index scan timestamp (last updated): 2024-09-04T10:59:33.276884Z
First moved to the Working state: 2024-09-04T10:59:33.2711682Z


PK: 08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-EnqueueCatalogLeafScans
RK: step-gkxlxqhuij4mqsfgmkz55lro4qwnvqkuknoro4lb5jgsm6lhcxbq
Timestamp: 2024-09-04T11:00:21.9243814Z
Client request ID: 894af93f-7a44-4ae1-83d7-9db87ea4b382

{
    "n": "tps.pq",
    "v": 1,
    "d": {
        "sf": 1,
        "ss": 1,
        "d": 59,
        "p": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B661-",
        "m": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B661-definitelytyped.net",
        "u": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B661-\uFFFF"
    }
}

bucketedpackages table:
image

symbolpackagearchives table:
image

This page was processed, because: definux.emeraude.presentation 1.0.10-beta~00 in the symbolpackagearchives table was updated at 2024-09-04T11:13:47.5050676Z (after the catalog index scan began)

Similarly definux.emeraude.resources 1.2.0.13~00 was updated at 2024-09-04T11:13:47.6952307Z


PK: 08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-EnqueueCatalogLeafScans
RK: step-xvnazptqsva5fiv2cxm6oiq6vfqs2u6w5umjyuksclh5fes2epiq
Timestamp: 2024-09-04T11:00:22.2452016Z
Client request ID: fba1936c-ed58-4aa9-9605-f2419e07b753

{
    "n": "tps.pq",
    "v": 1,
    "d": {
        "sf": 1,
        "ss": 1,
        "d": 59,
        "p": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B658-",
        "m": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B658-devrating.sqliteclient",
        "u": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B658-\uFFFF"
    }
}

bucketedpackages table:
image

symbolpackagearchives table:
image

devteam.testtool.contracts 1.0.108-beta~00 was updated at 2024-09-04T11:13:53.5514588Z

dexih.connections.postgressql 1.1.957~00 was updated at 2024-09-04T11:13:53.7346526Z

Exceptions

Only two records after the catalog index scan began.

Azure.Data.Tables.TableTransactionFailedException while cleaning up leaf items

image

Timestamp: 2024-09-04T11:05:18.304516Z
Invocation ID: daf4c0d0-db3b-4f30-ba43-4d986e901e17
Formatted message: [transient] An exception was thrown while processing a batch of 30 messages with schema cls.
Operation ID: 71e7c38c95cc718539c41908d2bfb3c1

This happened inside here (while cleaning up completed leaf scan records):

await UpdateCatalogLeafScansAsync(scanToMessage.Keys.ToList(), _storageService.DeleteAsync);

Azure.Data.Tables.TableTransactionFailedException somewhere in CatalogScanStorageService

The call stack is truncated... not sure why.
image

Timestamp: 2024-09-04T11:05:18.3031638Z
Invocation ID: daf4c0d0-db3b-4f30-ba43-4d986e901e17
Operation ID: 71e7c38c95cc718539c41908d2bfb3c1
FormattedMessage: [transient] Batch failed due to HTTP 500, with storage suffix 'dqcwugivcvqu5pbqns44odgk3m', first partition key '08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B636-boost_date_time-vc140', first row key '1.59.0'.

This is a duplicate of the previous.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant