Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automations crate and various chores #1696

Merged
merged 5 commits into from
Oct 15, 2024
Merged

Conversation

jgraettinger
Copy link
Member

@jgraettinger jgraettinger commented Oct 12, 2024

Description:

Chores:

  • Support NO_COLOR so we can fix Cloud Run agent logs.
  • Truncate overly long Apply action_descriptions that we currently choke on.
  • Tighten schema inference to a maximum depth of 12.
  • Compact all migrations by refreshing from the production DB.

Then, add an automations crate for durable, distributed execution of tasks modeled as coroutines.

#1667
#1697

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

Which disablses ANSI color logging output.
These are frequently so large, they overflow our current 1MB ops log
line limit.
Deeply nested schemas are not useful and, in practice, are always
recursive data-structures.

Fixes #1667
Minor adjustments to account for drift between migrations and the
production DB.
Introduce a new crate `automations` which offers a programming model
for distributed execution of arbitrary, stateful tasks modeled as coroutines.

A supporting migration introduces table `internal.tasks` which tracks
the states of tasks, including task inner states and pending recieved
messages.

Implement a Fibonacci executor which is a load test and test-bed for
automations behaviors, with various tune-able parameters.
@jgraettinger jgraettinger added chore Internal cleanups, refactoring, or improvements change:planned This is a planned change labels Oct 12, 2024
Copy link
Member

@psFried psFried left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment about the transaction in persist_outcome got me thinking. Supposing we don't use a transaction for sending messages or spawning tasks, would there still be any advantage to doing so by returning the task/message as part of the PollOutcome? We could instead pass in a helper object that tasks could call methods on to spawn children or send messages. I don't want to get too far ahead of myself, though, so hoping we can first talk though error handling

create role marketplace_integration;
create role gatsby_reader;
create role github_action_connector_refresh;
create role wgd_automation;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need wgd_automation for something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, it's granted some usages in the fetched / compacted schema, so it needs to exist.


async fn persist_outcome(
outcome: PollOutcome<BoxedRaw>,
txn: &mut sqlx::PgConnection,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started out thinking: txn is a bit misleading. db maybe better?

But then I got to wondering whether we should be using a transaction here. I'm trying to think through whether there's any scenarios that we wouldn't recover gracefully from, if we were to fail midway through this function. It seems like we might create a child task and then fail before updating the task state. So seems like we'd have at-least-once semantics on spawning child tasks, and also on sending messages to other tasks. That seems fine (I think?) for sending messages, but I wonder whether we'd want a stronger guarantee for creating child tasks.

It seems like maybe it could be good to just use a transaction here, or else try to cram all the updates into one giant query with CTEs.

Copy link
Member Author

@jgraettinger jgraettinger Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a transaction -- it's just been downcast to the PgConnection. I did that because in newer versions of sqlx, transaction is not an executor and can't be used with sqlx::query -- though admittedly that downcast coulda been done just as easily in this routine 🤷. I was also setting up for a test case I didn't write, where persist_outcome could be driven, inspected, and rolled back.

We definitely want full transaction semantics around task management!

(ready.task.id, ready.task.type_, ready.task.parent_id);

if let Err(err) = executors::poll_task(ready, heartbeat_timeout).await {
tracing::warn!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we might want to make these errors more visible, perhaps by persisting them on the tasks table? I'm supposing that the idea is we should do error handling within the poll function, and this is really only for errors that we didn't otherwise catch or anticipate, which seems reasonable. Just want to check whether we're on the same page there, that the state ought to expose error message if the task is doing its own error handling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly -- the place I got to was tasks are free to handled expected errors in any way they like, including sending them as messages or modeling them in task state, but they're not an "error" from this crate's perspective.

@jgraettinger
Copy link
Member Author

As discussed over VC, persisting outcomes is transactional. Anything else to cover before ✔️ ?

Copy link
Member

@psFried psFried left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
I tried playing around with how controllers might be implemented as an Executor, and immediately ran into the issue of needing some way to associate a particular catalog_name with a tasks row. I'm supposing that we'll need to add a column to live_specs to relate each row to a tasks row that corresponds to the controller. We'll need that in order to trigger controller runs after specs are published, or to trigger tasks in response to inferred schema updates, shard failures, etc.
The migration from controller_jobs to tasks is going to be a bit tricky also. I'm currently thinking that the first order of business for each of the new controller tasks could be to migrate its state from controller_jobs into the tasks table.
I don't see any blockers to either of those, so I think we're good to merge here.

@jgraettinger
Copy link
Member Author

I'm supposing that we'll need to add a column to live_specs to relate each row to a tasks row that corresponds to the controller.

Yea, i think that's the right pattern. The entity table for a catalog name contains related task IDs.

I'm currently thinking that the first order of business for each of the new controller tasks could be to migrate its state from controller_jobs into the tasks table.

Agreed, I think that's the right pattern.

But it also raises a gap with this implementation which I think needs to be addressed: an Executor must be able to return "side-effects" which commit transactionally with a PollOutcome, such as cleaning up a controller_jobs row, so that we're guaranteed they commit together.

I'm gonna merge but think this needs to be a next step.

@jgraettinger jgraettinger merged commit 0aa1dd6 into master Oct 15, 2024
6 checks passed
@jgraettinger jgraettinger deleted the johnny/automations branch October 15, 2024 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:planned This is a planned change chore Internal cleanups, refactoring, or improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants