Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZSTDm (special mode of zstd optimized for memory compression, see paper inside) #4137

Open
ahydronous opened this issue Sep 5, 2024 · 3 comments
Assignees

Comments

@ahydronous
Copy link

ahydronous commented Sep 5, 2024

Is your feature request related to a problem? Please describe.
Zstd is currently used for Zram/Zswap compression as the default in a lot of places. However, it has quite a big latency penalty compared to lz4, and many applications care about memory latency.

Describe the solution you'd like
In 2017, at the Korean College of Information and Communication Engineering, a few researcher created LZ4m, a specialized version of lz4 that was optimized for memory structures. Perhaps zstdm could be created, based on their initial research?

Here is the paper: http://csl.snu.ac.kr/papers/icce17.pdf (LZ4m: A Fast Compression Algorithm for
In-Memory Data)

I tried contacting the researchers for the source, but I got 'e-mail not delivered'.

Here is the difference between LZ4 and LZ4m:
image

And a few excerpts from the paper:

In-memory data consists of virtual memory pages that contains the data from the stack and the heap regions of applications. The stack and the heap regions contain constants, variables, pointers, and arrays of basic types which are usually structured and aligned to the word boundary of the system [6]. Based on the observation, we can expect that data compression can be accelerated without a significant loss of compressing opportunity by scanning and matching data in the word granularity. In addition, as the larger granularity requires less bits to represent the length and the offset, the substrings (i.e.,literals and matches) can be encoded in a more concise form.

The stack and the heap regions contain constants, variables, pointers, and arrays of basic types which are usually structured and aligned to the word boundary of the system [6]. Based on the observation, we can expect that data compression can be accelerated without a significant loss of compressing opportunity by scanning and matching data in the word granularity. In addition, as the larger granularity requires less bits to represent the length and the offset, the substrings (i.e., literals and matches) can be encoded in a more concise form.

LZ4m uses the same scanning window and hash table of the original LZ4. In contrast to the original LZ4 algorithm, LZ4m scans an input stream and finds the match in a 4-byte granularity. If the hash table indicates no prefix match exists, LZ4m advances the window by 4 bytes and repeats identifying the prefix match.
As shown in the Figure 2(a), after examining the prefix at position 8, the next position of the window is 12 instead of 9. If a prefix match is found, LZ4m compares subsequent data and finds the longest match in the 4-byte granularity as well. In the example, although there is a 6-byte match starting from position 12, LZ4m only considers the 4-byte match from 12 to 15, and slides the scanning window forward by four bytes, to position 16.

We also optimize the encoding scheme to utilize the 4-byte granularity. As the offset is aligned to the 4-byte boundary and the length is a multiple of 4 bytes, two least significant bits of these fields are always 00(2). Thus, every field representing the length and the offset (the literal length and the match length in the token, and the literal length, the match length, and the match offset in the body) can be shortened by 2 bits.
Moreover, as LZ4m is targeting to compress 4 KB pages in a 4-byte granularity, the field for the match offset requires at most 10 bits. Consequently, we place the two most significant bits of the match offset in the token and put the remaining 8 bits in the body.

And here are the performance gains, LZ4m also achieves slightly higher compression ratio:
image

Perhaps that something similar could be achieved for Zstd, except with gains in compression/decompression speed and latency.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Oct 3, 2024

LZ4m is a variation of LZ4, designed to be more specifically targeted at compression of 4 KB RAM pages.
Focusing on this one topic makes it possible to hardwire some choices which only make sense for this case.

For example, since only 4 KB are compressed, offsets will never be larger than 4 KB. This limitation can be exploited to recycle some bits, resulting in a shorter representation, hence better compression.

LZ4m also go straight for compression of 4-byte fields, which is great for many aligned data structure, notably arrays.
And therefore, in this case, it works better.
But if the page contains other stuff, such as, for example, a simple natural language text, this will not work well at all.
So it's a specialized choice, rather than a general formula.

I could totally imagine these choices being generally better for compression of 4 KB RAM pages (not always, but on average).
Now, by how much, this is less clear, and one should do their own benchmark to get a clearer picture.
But the claimed gains are not large, which makes it more difficult to decide if this variant is worth its maintenance cost.

zstd is on the other side of the picture. It compresses much better than LZ4 (and lz4m). But it's also slower.
We could work on making zstd more efficient for the 4 KB RAM page scenario, and it's likely that a few benefits will be found in the process. Typically a bit better speed, but nothing too dramatic.

But do note that in the end, we don't have an LZ4m in the Linux kernel,
That's because, even though it does present itself as a variant of LZ4, it is still a different algorithm, which therefore requires its own specific maintenance. Maintenance is a pretty giant lingering cost, it's frequently overlooked, yet it makes the difference between adopted and abandoned technologies.

If we were to consider a ZSTDm variant following the same lines, it would be the same problem: it would result in a specialized incompatible variant, requiring its own adoption effort and maintenance cost. Is it worth it ? Who knows, maybe. But to justify that cost, the benefits better be good, not just marginally better. And at this point, I have no element to suggest it would make such a large difference.

In the meantime, the best we could do is optimize zstd for 4 KB page.
zstd is already in the kernel, it's already maintained, this will only make the code base a bit more complex, but nothing too dramatic.
Hence it's easier to make it happen.
On the other hand, improvements will just be incremental in nature. There is no more low hanging fruits at this point.

@Cyan4973 Cyan4973 self-assigned this Oct 3, 2024
@leonpano2006
Copy link

LZ4m is a variation of LZ4, designed to be more specifically targeted at compression of 4 KB RAM pages. Focusing on this one topic makes it possible to hardwire some choices which only make sense for this case.

For example, since only 4 KB are compressed, offsets will never be larger than 4 KB. This limitation can be exploited to recycle some bits, resulting in a shorter representation, hence better compression.

LZ4m also go straight for compression of 4-byte fields, which is great for many aligned data structure, notably arrays. And therefore, in this case, it works better. But if the page contains other stuff, such as, for example, a simple natural language text, this will not work well at all. So it's a specialized choice, rather than a general formula.

I could totally imagine these choices being generally better for compression of 4 KB RAM pages (not always, but on average). Now, by how much, this is less clear, and one should do their own benchmark to get a clearer picture. But the claimed gains are not large, which makes it more difficult to decide if this variant is worth its maintenance cost.

zstd is on the other side of the picture. It compresses much better than LZ4 (and lz4m). But it's also slower. We could work on making zstd more efficient for the 4 KB RAM page scenario, and it's likely that a few benefits will be found in the process. Typically a bit better speed, but nothing too dramatic.

But do note that in the end, we don't have an LZ4m in the Linux kernel, That's because, even though it does present itself as a variant of LZ4, it is still a different algorithm, which therefore requires its own specific maintenance. Maintenance is a pretty giant lingering cost, it's frequently overlooked, yet it makes the difference between adopted and abandoned technologies.

If we were to consider a ZSTDm variant following the same lines, it would be the same problem: it would result in a specialized incompatible variant, requiring its own adoption effort and maintenance cost. Is it worth it ? Who knows, maybe. But to justify that cost, the benefits better be good, not just marginally better. And at this point, I have no element to suggest it would make such a large difference.

In the meantime, the best we could do is optimize zstd for 4 KB page. zstd is already in the kernel, it's already maintained, this will only make the code base a bit more complex, but nothing too dramatic. Hence it's easier to make it happen. On the other hand, improvements will just be incremental in nature. There is no more low hanging fruits at this point.

this means there would be some limitation for this
such as 16k 64k pages for arm
for Transpranet hugepag then i am not sure if is sutible for hugepage

@ahydronous
Copy link
Author

ahydronous commented Oct 10, 2024

Oh hello Yann, I didn't realize you created zstd as well. You're a beast!

But do note that in the end, we don't have an LZ4m in the Linux kernel,
That's because, even though it does present itself as a variant of LZ4, it is still a different algorithm, which therefore requires its own specific maintenance. Maintenance is a pretty giant lingering cost, it's frequently overlooked, yet it makes the difference between adopted and abandoned technologies.

Isn't that rather because the researchers never contributed their research?
With how much use/attention zram / zswap are getting, I can't imagine (assuming the lz4 gains) that a 60%pt increase in compression speed and 20%pt increase in decompression speed wouldn't be most welcome, even at the scale of Facebook.
But I do see that another, semi-unique algorithm means another thing that has to be checked, tested, maintained constantly.

We could work on making zstd more efficient for the 4 KB RAM page scenario, and it's likely that a few benefits will be found in the process. Typically a bit better speed, but nothing too dramatic.

Since these days filesystems also mostly work with 4kiB logical (and physical) sectors, it might help universally.

In the meantime, the best we could do is optimize zstd for 4 KB page.

Would it also help to create a small dictionary that contains the foundational constants, variables, pointers, and arrays of basic types of memory?
Even if these weren't fully shared among x64, ARM64 and RISC-V, the dictionary could still be tiny.

Thinking more on it (and considering my similar issue here: lz4/lz4#1505), and the fact that Google uses snappy as their zram compressor, perhaps ultimately a dedicated lzm L77 compressor would be the best. But alas, the maintenance you already spoke of.

@leonpano2006
As far as I've heard/seen from optimizing gaming systems, THPs are a poor fit for zram / zswap regardless, because they don't get resized when moving into swap, making them more poorly compressible. They're still an overall performance gain though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants