control-plane: design for improved data-plane monitoring and shard restarts #1666

jgraettinger · 2024-09-30T18:08:01Z

Today, we use a rather blunt means for restarting failed shards (every five minutes, we un-assign any shard which is currently FAILED).

This is simultaneously too large a delay for a first flaky failure of an otherwise healthy task, and too short a delay for a task which only ever errors. We'd like to instead track failures of a task over time and use a more graduated back-off if it continues to fail after successive restarts, likely ultimately disabling the task automatically after sustained failure.

We also now have multiple data-planes, and we want a consolidated mechanism for managing shard failures across all data-planes.

The runtime invokes a new /notify/shard-failure control-plane API which is told of shard failures that have occurred within a data-plane. At the moment, this API verifies the data-plane token and logs the failure, but takes no further action. Update the taskBase.heartbeatLoop() to perform this notification if the shard's primary loop exits with a non-cancellation error. Issue #1666

jgraettinger self-assigned this Sep 30, 2024

kiahna-tucker mentioned this issue Oct 8, 2024

Release the data-plane selector in client workflows estuary/ui#1297

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

control-plane: design for improved data-plane monitoring and shard restarts #1666

control-plane: design for improved data-plane monitoring and shard restarts #1666

jgraettinger commented Sep 30, 2024 •

edited

Loading

control-plane: design for improved data-plane monitoring and shard restarts #1666

control-plane: design for improved data-plane monitoring and shard restarts #1666

Comments

jgraettinger commented Sep 30, 2024 • edited Loading

jgraettinger commented Sep 30, 2024 •

edited

Loading