Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stress test - keep increasing the bar #529

Open
dougsland opened this issue Sep 1, 2023 · 0 comments
Open

stress test - keep increasing the bar #529

dougsland opened this issue Sep 1, 2023 · 0 comments
Labels
backlog This is next up in priority testing This issue adds or improves the testing

Comments

@dougsland
Copy link
Contributor

dougsland commented Sep 1, 2023

Describe the bug

Following our previous stress test we have more feedback from owners/developers/users/etc.
Based on the feedback, let's improve tools and tests to generate reports frequently.

Description:

BlueChi extends D-Bus for multi-node environments. This means some D-Bus load from an external system (bluechi-agent) shows up where bluechi (master) is running. A set of basic stress tests could include a measurement of the time until a change on a bluechi-agent is visible on the master node. This measurement could be done for a wide variety of different conditions below.

Before we start I would recommend address this issue first, so we can be faster: containers/qm#164

  • number of bluechi-agents attached to a master

  • number of state changes per time interval per bluechi-agent

  • various NIC port speeds and loads (try causing congestion with with iperf utilizing basically 100% of the network bandwidth)

  • various CPU/memory/disk/cache utilizations (what happens if CPU load is high? does it mean that state changes are reflected later and if so what delay does it cause?)

  • various base loads on d-bus (simulate a large number of messages being sent and received to measure the system's performance and responsiveness)

  • The same measurement can be done vice-versa, what happens if these conditions materialize on a bluechi-agent while the master node is idling?

  • Another test could be fault injection in network layer where you introduce random package losses and measure error rate (number of erroneous states, mean time until the system recovers from failure)

  • Heartbeat interval N (see https://github.com/containers/bluechi/blob/main/config/agent/agent.conf#L30)
    Each agent will emit a small signal on the peer D-Bus to the controller every N ms. Meaning, the "idle" system alone generates a small amount of traffic. There was a bug where a missing or 0 value would lead the agent to spam as many signals as possible. Since the controller basically does nothing on the signal, this should affect only the agent. So what happens if it is set to 1ms? Does having 100 or 1000 agents spamming a signal every millisecond to the controller really not affect it?

  • Monitors (see https://github.com/containers/bluechi/blob/main/data/org.eclipse.bluechi.Monitor.xml)
    When a monitor with a subscription is registered, the state changes of systemd units are forwarded from the agent to the controller (which forwards it to the monitor). So monitors and subscriptions can be used to increase the traffic as well as the workload on the controller (as the agent keeps track of the unit state changes anyway).
    For example, starting units in a short period of time on 100 agents while a monitor subscription watching all nodes and all units is active would result in a huge peak load on the controller.
    Side note: The monitor can easily be set up via the bluechi python bindings.

How the previous stress test happened?

Most of the description about load of agent's into the controller is done in the initial stress test execution (below the steps). However, we must keep working to work on the items listed above.

Steps:

git clone https://github.com/containers/qm && cd qm/tests/e2e
./tools/remove-containers (remove any previous created container/image, helps not count the time consumed to remove old environment)
./run-test-e2e --number-of-nodes=500 &> output-500-nodes.txt
@dougsland dougsland added the bug Something isn't working label Sep 1, 2023
@engelmi engelmi added testing This issue adds or improves the testing and removed bug Something isn't working labels Sep 14, 2023
@mkemel mkemel added the backlog This is next up in priority label Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog This is next up in priority testing This issue adds or improves the testing
Projects
None yet
Development

No branches or pull requests

3 participants