Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: scheduler has negative "buffer" value #852

Open
sharnoff opened this issue Mar 9, 2024 · 1 comment
Open

Bug: scheduler has negative "buffer" value #852

sharnoff opened this issue Mar 9, 2024 · 1 comment
Assignees
Labels
c/autoscaling/scheduler Component: autoscaling: k8s scheduler t/bug Issue Type: Bug

Comments

@sharnoff
Copy link
Member

sharnoff commented Mar 9, 2024

Environment

Prod (occurred twice recently)

Steps to reproduce

Not yet clear. Here's an example:

{"level":"info","ts":1709922373.111944,"logger":"autoscale-scheduler","caller":"plugin/state.go:1379","msg":"Adding VM pod to node","action":"read cluster state","virtualmachine":{"namespace":"default","name":"compute-falling-cake-a6d84vya"},"pod":{"namespace":"default","name":"compute-falling-cake-a6d84vya-dv647"},"node":"i-0d216a75a106c181d.us-west-2.compute.internal","verdict":{"cpu":"pod = 0.25/0.25 (node 14.25 -> 14.5 / 127.61, 0 -> 4.294967046e+06 buffer)","mem":"pod = 1Gi/1Gi (node 57Gi -> 58Gi / 519497968Ki, 0 -> -1Gi buffer"}}

I think it's entirely caused by faulty logic in (*AutoscaleEnforcer).readClusterState(), but haven't looked into it thoroughly.

And tbh, it's a little weird that readClusterState has its own implementation of reserve logic, rather than using the shared version that was added in #666.

Expected result

Any buffer value from adding a VM should be non-negative.

Actual result

The memory "buffer" value was negative (see: -1Gi buffer), and the value for CPU underflowed.

Other logs, links

@sharnoff sharnoff added t/bug Issue Type: Bug c/autoscaling/scheduler Component: autoscaling: k8s scheduler labels Mar 9, 2024
@Omrigan
Copy link
Contributor

Omrigan commented Mar 14, 2024

Can be solved by #840?

sharnoff added a commit that referenced this issue Apr 15, 2024
In short, readClusterState is super complicated, separately reimplements
the reserveResources() logic, and may be the source of several
startup-related bugs (probably #671 and #852).

So, given that we *already* have a pathway for updating our internal
state from changes in the cluster (i.e. the watch events), we should
just use that instead.
sharnoff added a commit that referenced this issue Apr 16, 2024
In short, readClusterState is super complicated, separately reimplements
the reserveResources() logic, and may be the source of several
startup-related bugs (probably #671 and #852).

So, given that we *already* have a pathway for updating our internal
state from changes in the cluster (i.e. the watch events), we should
just use that instead.
sharnoff added a commit that referenced this issue May 10, 2024
In short, readClusterState is super complicated, separately reimplements
the reserveResources() logic, and may be the source of several
startup-related bugs (probably #671 and #852).

So, given that we *already* have a pathway for updating our internal
state from changes in the cluster (i.e. the watch events), we should
just use that instead.
sharnoff added a commit that referenced this issue May 21, 2024
In short, readClusterState is super complicated, separately reimplements
the reserveResources() logic, and may be the source of several
startup-related bugs (probably #671 and #852).

So, given that we *already* have a pathway for updating our internal
state from changes in the cluster (i.e. the watch events), we should
just use that instead.
sharnoff added a commit that referenced this issue May 22, 2024
In short, readClusterState is super complicated, separately reimplements
the reserveResources() logic, and may be the source of several
startup-related bugs (probably #671 and #852).

So, given that we *already* have a pathway for updating our internal
state from changes in the cluster (i.e. the watch events), we should
just use that instead.
sharnoff added a commit that referenced this issue May 22, 2024
In short, readClusterState is super complicated, separately reimplements
the reserveResources() logic, and may be the source of several
startup-related bugs (probably #671 and #852).

So, given that we *already* have a pathway for updating our internal
state from changes in the cluster (i.e. the watch events), we should
just use that instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/autoscaling/scheduler Component: autoscaling: k8s scheduler t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants