Skip to content

Benchmarks

Below, rare is compared to various other common and popular tools on CPU user and real time.

It's worth noting that in many of these results rare is just as fast, but part of that reason is that it consumes CPU in a more efficient way (go is great at parallelization). So take that into account, for better or worse.

All tests were done on ~83MB of gzip'd (1.5GB gunzip'd) nginx logs spread across 10 files.

Each program was run 3 times and the last time was taken (to make sure things were cached equally).

zcat & grep

$ time zcat testdata/* | grep -Poa '" (\d{3})' | wc -l
8373328

real    0m11.272s
user    0m16.239s
sys     0m1.989s

$ time zcat testdata/* | grep -Poa '" 200' > /dev/null

real    0m5.416s
user    0m4.810s
sys     0m1.185s

I believe the largest holdup here is the fact that zcat will pass all the data to grep via a synchronous pipe, whereas rare can process everything in async batches. Using pigz instead didn't yield different results, but on single-file results they did perform comparibly.

Silver Searcher (ag)

Warning

ag version 2.2.0 has a bug where it won't scan all my testdata. I'll hold on benchmarking until there's a fix.

Old Benchmark (Less data by factor of ~8x)

$ ag --version
ag version 2.2.0

Features:
  +jit +lzma +zlib

$ time ag -z '" (\d{3})' testdata/* | wc -l
1131354

real    0m3.944s
user    0m3.904s
sys 0m0.152s

rare

At no point scanning the data does rare exceed ~76MB of resident memory.

$ rare -v
rare version 0.1.16, 11ca2bfc4ad35683c59929a74ad023cc762a29ae

$ time rare filter -m '" (\d{3})' -e "{1}" -z testdata/* | wc -l
Matched: 8,373,328 / 8,373,328
8373328

real    0m16.192s
user    0m20.298s
sys     0m20.697s

$ time rare histo -m '" (\d{3})' -e "{1}" -z testdata/*
404                 5,557,374 
200                 2,564,984 
400                 243,282   
405                 5,708     
408                 1,397     
Matched: 8,373,328 / 8,373,328 (Groups: 8)


real    0m3.869s
user    0m13.423s
sys     0m0.191s

pcre2

The PCRE2 version is approximately the same on a simple regular expression, but begins to shine on more complex regex's.

$ time rare table -z -m "\[(.+?)\].*\" (\d+)" -e "{buckettime {1} year nginx}" -e "{bucket {2} 100}" testdata/*
          2020      2019      
400       2,915,487 2,892,274           
200       1,716,107 848,925             
300       290       245                 
Matched: 8,373,328 / 8,373,328 (R: 3; C: 2)


real    0m31.419s
user    1m40.060s
sys     0m0.657s

$ time rare-pcre table -z -m "\[(.+?)\].*\" (\d+)" -e "{buckettime {1} year nginx}" -e "{bucket {2} 100}" testdata/*
          2020      2019      
400       2,915,487 2,892,274           
200       1,716,107 848,925             
300       290       245                 
Matched: 8,373,328 / 8,373,328 (R: 3; C: 2)


real    0m7.936s
user    0m27.600s
sys     0m0.301s