A handy feature of regular expressions is their ability to “or” match. Searching for two strings in a file is easy with a construct like “egrep ‘root|uucp’ /etc/passwd”. The vertical bar (“|”) acts as an “or” operator. Perl supports the vertical bar too, and the same match could be achieved in Perl thus:
bash-4.2$ perl -n -e 'print $_ if /root|uucp/' < /etc/passwd
root:x:0:0:root:/root:/bin/bash
uucp:x:10:14:uucp:/var/spool/uucp:/sbin/nologin
Weirdly though, this construction is up to 10 times slower than 2 separate matches performing the same search, as can be shown with a quick demonstration.
First, generate a big file. The following command creates a file with a million lines, each line consisting of a line number and the sentence below.
seq 1 1000000 | awk '{print $1" The quick brown fox jumped over the lazy dog"}' > bigfile
bash-4.2$ head -3 bigfile
1 The quick brown fox jumped over the lazy dog
2 The quick brown fox jumped over the lazy dog
3 The quick brown fox jumped over the lazy dog
bash-4.2$ ls -lh bigfile
-rw-rw-r--. 1 james james 50M Apr 22 18:32 bigfile
Now to filter out all those lines beginning with "1" or "2". First with a single match using the "|" or operator...
bash-4.2$ time cat bigfile | perl -n -e 'print $_ unless /^1|^2/' > out1
real 0m8.395s
user 0m8.166s
sys 0m0.312s
and then with 2 separate regexps instead:
bash-4.2$ time cat bigfile | perl -n -e 'print $_ unless /^1/ or /^2/;' > out2
real 0m0.844s
user 0m0.715s
sys 0m0.241s
Using a single OR match is 10 times slower than using 2 matches separated by a logical OR. The outputs were identical by the way:
bash-4.2$ cksum out*
485435066 40357989 out1
485435066 40357989 out2
bash-4.2$ wc out1 out2
777777 7777770 40357989 out1
777777 7777770 40357989 out2
1555554 15555540 80715978 total
Tested so far on Red Hat 4.4, Fedora 16 (perl 5.014002), Solaris 10 x86 06/06 (Perl 5.008004). (Used awk in place of the seq command on Solaris).
Did I miss something obvious ? Is this a bug in Perl ? Surely not ?
this is a test of cacheing