Home Benchmark Docs License Get Early Access

Benchmark Results

Jetson AGX Orin 64GB | JetPack 6.2.2 | CUDA 12.6 | MAXN mode | All results roundtrip verified

8,537 MB/s
GPU Decompress
nvCOMP LZ4 — in-memory
4,258 MB/s
GPU Decompress
nvCOMP LZ4 — 10GB roundtrip
4.3x
Decompress speedup
in-memory vs CPU zstd-1
5.8x
Roundtrip decompress
vs CPU zstd-1

Decompression Throughput (MB/s) — In-Memory Performance

Roundtrip Results — 10 GB

End-to-end performance including disk I/O.

Method Processor Compress MB/s Decompress MB/s Ratio Integrity
nvCOMP LZ4 GPU 517 4,258 1.98x PASS
zstd-1 CPU 1,094 733 2.00x PASS
zstd-3 CPU 1,014 741 2.00x PASS

In-Memory Performance

Raw algorithm throughput without disk I/O bottleneck.

Method Processor Compress MB/s Decompress MB/s Integrity
nvCOMP LZ4 GPU 705 8,537 PASS
nvCOMP Snappy GPU 1,615 5,756 PASS
zstd-1 CPU 1,747 2,001 PASS
Note: GPU crossover point is approximately 10 MB — below that, kernel launch overhead dominates. HammerIO automatically routes small files to CPU. Real-world compression ratios typically 1.8x–2.5x on ML datasets and log files.

What these numbers mean

8,537 MB/s means...

A 1GB model checkpoint restores in ~0.12 seconds in-memory. A forward-deployed AI node that reboots in the field is back to full inference in under a second.

GPU crossover at ~10 MB

Below 10 MB, kernel launch overhead dominates. Above that, GPU decompression is 4.3x faster than CPU. Large ML datasets and model weights see the biggest gains.

CPU fallback is still fast

CPU zstd-1 at 1,747 MB/s compress and 2,001 MB/s decompress in-memory. HammerIO only uses GPU when the overhead is worth it.

Run it yourself

pip install hammerio hammer benchmark --1gb

View benchmark source on GitHub →