Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control-plane: design for improved data-plane monitoring and shard restarts #1666

Open
jgraettinger opened this issue Sep 30, 2024 · 0 comments
Assignees

Comments

@jgraettinger
Copy link
Member

jgraettinger commented Sep 30, 2024

Today, we use a rather blunt means for restarting failed shards (every five minutes, we un-assign any shard which is currently FAILED).

This is simultaneously too large a delay for a first flaky failure of an otherwise healthy task, and too short a delay for a task which only ever errors. We'd like to instead track failures of a task over time and use a more graduated back-off if it continues to fail after successive restarts, likely ultimately disabling the task automatically after sustained failure.

We also now have multiple data-planes, and we want a consolidated mechanism for managing shard failures across all data-planes.

@jgraettinger jgraettinger self-assigned this Sep 30, 2024
jgraettinger added a commit that referenced this issue Sep 30, 2024
The runtime invokes a new /notify/shard-failure control-plane API which
is told of shard failures that have occurred within a data-plane.

At the moment, this API verifies the data-plane token and logs the
failure, but takes no further action.

Update the taskBase.heartbeatLoop() to perform this notification if the
shard's primary loop exits with a non-cancellation error.

Issue #1666
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant