Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample-based stream event output for protobuf and dnstap logging #14515

Open
johnhtodd opened this issue Jul 22, 2024 · 2 comments
Open

Sample-based stream event output for protobuf and dnstap logging #14515

johnhtodd opened this issue Jul 22, 2024 · 2 comments

Comments

@johnhtodd
Copy link

  • Program: recursor, dnsdist
  • Issue type: Feature request

Short description

I'd like to see if an additional feature could be added to newFrameStreamTcpLogger, newFrameStreamUnixLogger, protobufServer, & outgoingProtobufServer that would allow for a sampled set of events to be transmitted rather than 100%

Usecase

We are rapidly reaching the point where our telemetry systems are spending more of their time discarding messages than processing the small sample set of results that are left over. We implement a sampled ingestion model, but right now this sampled rate is applied at ingestion on the telemetry server, necessitating all messages being transmitted, received, processed, and then most being immediately thrown away. This is a big waste of resources. If the sampling could be applied at the transmitting side instead of the receiving side, there would be lower overall utilization of resources. This applies for us for both dnstap as well as protobuf streams.

Description

I would like to have a number applied as an option to all of the dnstap and protobuf outputs on dnsdist and recursor that would allow a sample rate to be applied. The rate would be expressed as a ratio number - so for instance the number "20" would mean 5%, 2 would mean 50%, 3 would mean 33%, 4 would mean 25% etc. This would be applied I suppose randomly to each message before transmission, or it could be even deeper in the code - I have no insight on that as long as the distribution is as even as possible across time and is not "bursty".

Additionally, it would be required for this to be reflected somewhere in monitoring statistics, since downstream systems would have to multiply by this figure in order to understand the rate at which samples were being taken. This would need to show up somewhere in the Prometheus stats, for instance, in some fashion that would allow understanding of the rate applied to each socket/session that is sending telemetry data. This might necessitate tags/labels that contain the IP:port of the destination (or socket name) so they could be kept distinct in the statistics set.

@omoerbeek
Copy link
Member

omoerbeek commented Jul 22, 2024

I agree that it would be nice to have built-in, for all products.

Currently, for dnsdist I think you can use ProbaRule to get the desired effect. This will lack reporting of the used probability unless you add a custom metric.

@johnhtodd
Copy link
Author

ProbaRule would work just for DNSDist; thanks, I had forgotten about that method. I could use the SetMetric to make my own reporting for the probability rule. That would solve the issue for the short term, but I suspect having a consistent, fast way to do this sampling that is built into the creation of the socket itself would be welcome as a universal method across the product line(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants