Make grep 50x faster

15.12.2013 - 14:33

Found this neat trick in Brendan Gregg's Blazing Performance with Flame Graphs talk.

Switching to LANG=C improved performance by 2000x

In a quick test I directly got a performance gain of factor 50.22.
This is quite an achievement for only changing one environment variable.

real:~# du -sh /var/log/querylog 
148M	/var/log/querylog
real:~# time grep -i e /var/log/querylog > /dev/null 

real	0m12.807s
user	0m12.437s
sys	0m0.068s
real:~# time LANG=C grep -i e /var/log/querylog > /dev/null

real	0m0.255s
user	0m0.196s
sys	0m0.052s

I suspect that the performance gain may vary quite a lot depending on the search pattern. Also, please note that this trick only works when you know that the involved files and search patterns are ASCII only.

(via Standalone Sysadmin)

Comments:

ever hear of vfs cache?

The main factor is the case insensitive search. Case insensitivity in locales other than C is hard (é == É, and so forth).

Hi beard, it does speed things up even on cached files, though it's not that impressive (~7 times). A really nice catch!

$ du -h . | tail -n 1
5.9G .

$ time LANG=C grep -ri e . > /dev/null
real 0m0.007s
user 0m0.004s
sys 0m0.003s
$ time LANG=C grep -ri e . > /dev/null
real 0m0.007s
user 0m0.004s
sys 0m0.003s

$ time grep -ri e . > /dev/null
real 0m0.044s
user 0m0.040s
sys 0m0.002s
$ time grep -ri e . > /dev/null
real 0m0.035s
user 0m0.032s
sys 0m0.001s

Imho anything under 1s is way too quick to call a benchmark. System timers are not always great (some seem to have resolutions of around 15ms) or can otherwise mess up such results. It might be better to create a random file (head -c 1024000 /dev/urandom > /tmp/data), trying how much random data takes about 1-10 seconds to search, and then benchmarking it.

Lucb1e So I just created a /tmp/data file of 500MB and ran the `grep` using `LANG=C` which did yield faster results. I do agree though that anything under 1s is too quick to be considered a benchmark.

Now this is interesting to me. When I ran `grep LANG=C` against my largest /var/log file (which was 20MB), it showed that grep was twice as fast. Now when I ran it against my /dev/urandom > /tmp/data which is 500MB it seemed to be even. Actually `LANG=C` was slower. Results below:

`$ time LANG=C grep -i e /tmp/data > /dev/null
real0m11.761s
user0m10.364s
sys0m0.199s`

`time grep -i e /tmp/data > /dev/null
real0m9.865s
user0m9.757s
sys0m0.097s`

I'm running: Darwin somemacbook.local 13.0.0 Darwin Kernel Version 13.0.0: Thu Sep 19 22:22:27 PDT 2013; root:xnu-2422.1.72~6/RELEASE_X86_64 x86_64

Very interesting. Been running test and it looks like adding `LANG=C` makes a huge difference.

I added the following to my bashrc:

`alias grep="LANG=C grep"`

Thanks for sharing.

Run the grep without the LANG=C and post results. As others pointed out, it may be cached. Does it still take 12s?

I wouldn't advise doing this.

First of all, this speed-up only works for case-insensitive greps. Secondly, a few months down the line, you'll have forgotten you made this alias in your bashrc, and will be wondering why a case-insensitive search for unicode text is returning inaccurate results.

Or, even worse, you won't notice the inaccurate results at all, and will be taking invalid readings from grep at face value.

No, not a good idea.

I just redefined ^ to * and now everything runs fater!

I found that on some larger files it runs around the same as without 'LANG=C'. So on various larger files it either ran the same, just a tad faster or just a tad slower.

This is a good tip though and I think this start a conversation and though process of other ways to improve our command line performance.

Why not just use ag instead?

Oh this finding might save a lot of time for guys whose running big data scripts where grep is part of their ETL or map/red process.

du -h . | tail -n 1 == du -h --max-depth=0

du -sh, for heaven's sake.

Good point. Changing the alias to igrep

Well caught, I didn't know about that flag, although I often do use --max-depth=1

Hi, I did similar tests few weeks ago but with grep and perl.

pol@soul ~/grep_vs_perl
$ time grep blah worldcitiespop.txt > /dev/null
real0m2.573s
user0m2.486s
sys0m0.038s

pol@soul ~/grep_vs_perl
$ time grep -i blah worldcitiespop.txt > /dev/null
real0m4.563s
user0m4.531s
sys0m0.031s

pol@soul ~/grep_vs_perl
$ time cat worldcitiespop.txt | perl -e 'while (<>) {print if (m/blah/);}' > /dev/null
real0m0.978s
user0m0.922s
sys0m0.221s

pol@soul ~/grep_vs_perl
$ time cat worldcitiespop.txt | perl -e 'while (<>) {print if (m/blah/i);}' > /dev/null
real0m1.267s
user0m1.214s
sys0m0.221s

worldcitiespop.txt is a 144MB file.

I stumbled across this same speed tip for grep, although using (LC_ALL=C) but with the same principal. I achieved over a 1400% speed increase using that along with (fgrep) on a 500MB log file.

I did a pretty extensive write up including an (strace) of why things were sped up so much (uses a 128 ASCII character set instead of 110,000) . But be careful cause it will affect things such as sorting so you might not want to bake it into all of your scripts.

If links aren't allowed here and you're interested just Google for (LC_ALL=C) and look for my face :)

http://www.inmotionhosting....

Use LC_ALL=C before your Grep command it will be faster