Skip to content

Comments

Parallel compression for LZ4#158

Open
ParaN3xus wants to merge 2 commits intonesrak1:mainfrom
ParaN3xus:feat-parallel-compression
Open

Parallel compression for LZ4#158
ParaN3xus wants to merge 2 commits intonesrak1:mainfrom
ParaN3xus:feat-parallel-compression

Conversation

@ParaN3xus
Copy link

This PR introduces a parallel compression path for LZ4/LZ4Fast in bundle packing. To preserve compatibility with net35 targets, this parallel path is only enabled on non-net35 builds, while net35 continues using the original sequential implementation.

The mechanism is straightforward: it takes a batch of consecutive blocks and compresses them concurrently with Parallel.For. The batch size is configurable through Lz4ParallelPackBatchSize, with a default value of 32. This default was selected based on benchmarking on my local laptop (Intel(R) Core(TM) i7-13650HX, 32G DDR5 mem, Linux 6.6.87.2-microsoft-standard-WSL2). In the benchmarking, I used a byte-counting stream to eliminate the impact of disk I/O.

Benchmarking logs are

Data length: 1,142,981,560 bytes
CPU: 20 logical cores

=== LZ4Fast ===
Warmup: 1, Measured: 3
batch=   1 | avg=  3069.20 ms | min=  2996.73 ms | out= 699,313,746 bytes
batch=   2 | avg=  1949.53 ms | min=  1925.55 ms | out= 699,313,746 bytes
batch=   4 | avg=  1374.78 ms | min=  1360.78 ms | out= 699,313,746 bytes
batch=   8 | avg=  1235.22 ms | min=  1231.19 ms | out= 699,313,746 bytes
batch=  16 | avg=  1274.38 ms | min=  1255.26 ms | out= 699,313,746 bytes
batch=  32 | avg=  1000.09 ms | min=   981.35 ms | out= 699,313,746 bytes
batch=  64 | avg=  1015.87 ms | min=   971.02 ms | out= 699,313,746 bytes
batch=  96 | avg=   979.19 ms | min=   963.95 ms | out= 699,313,746 bytes
batch= 128 | avg=  1018.69 ms | min=  1016.70 ms | out= 699,313,746 bytes
batch= 192 | avg=  1127.88 ms | min=  1005.42 ms | out= 699,313,746 bytes
batch= 256 | avg=   962.32 ms | min=   860.11 ms | out= 699,313,746 bytes
batch= 384 | avg=   941.29 ms | min=   894.33 ms | out= 699,313,746 bytes
batch= 512 | avg=   951.07 ms | min=   918.96 ms | out= 699,313,746 bytes
Best for LZ4Fast: batch=384, avg=941.29 ms, min=894.33 ms

=== LZ4 ===
Warmup: 1, Measured: 3
batch=   1 | avg= 23305.73 ms | min= 23240.59 ms | out= 644,689,762 bytes
batch=   2 | avg= 13358.26 ms | min= 13265.65 ms | out= 644,689,762 bytes
batch=   4 | avg=  8143.68 ms | min=  8137.16 ms | out= 644,689,762 bytes
batch=   8 | avg=  6423.89 ms | min=  6346.14 ms | out= 644,689,762 bytes
batch=  16 | avg=  5454.16 ms | min=  5396.60 ms | out= 644,689,762 bytes
batch=  32 | avg=  5064.66 ms | min=  5032.62 ms | out= 644,689,762 bytes
batch=  64 | avg=  5174.30 ms | min=  5108.39 ms | out= 644,689,762 bytes
batch=  96 | avg=  4748.79 ms | min=  4643.74 ms | out= 644,689,762 bytes
batch= 128 | avg=  5115.96 ms | min=  4897.92 ms | out= 644,689,762 bytes
batch= 192 | avg=  5628.60 ms | min=  5033.43 ms | out= 644,689,762 bytes
batch= 256 | avg=  5217.21 ms | min=  4852.48 ms | out= 644,689,762 bytes
batch= 384 | avg=  4272.61 ms | min=  4254.14 ms | out= 644,689,762 bytes
batch= 512 | avg=  4529.81 ms | min=  4433.02 ms | out= 644,689,762 bytes
Best for LZ4: batch=384, avg=4272.61 ms, min=4254.14 ms

I know this implementation still does not fully saturate all available performance potential, but in my observation it already keeps CPU utilization stably above 80%, which is sufficient for most cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant