Perl – OR Pattern Match Slow, Use Two Patterns Instead

A handy feature of regular expressions is their ability to “or” match. Searching for two strings in a file is easy with a construct like “egrep ‘root|uucp’ /etc/passwd”. The vertical bar (“|”) acts as an “or” operator. Perl supports the vertical bar too, and the same match could be achieved in Perl thus:

bash-4.2$ perl -n -e 'print $_ if /root|uucp/' < /etc/passwd root:x:0:0:root:/root:/bin/bash uucp:x:10:14:uucp:/var/spool/uucp:/sbin/nologin

Weirdly though, this construction is up to 10 times slower than 2 separate matches performing the same search, as can be shown with a quick demonstration.

First, generate a big file. The following command creates a file with a million lines, each line consisting of a line number and the sentence below.

seq 1 1000000 | awk '{print $1" The quick brown fox jumped over the lazy dog"}' > bigfile

bash-4.2$ head -3 bigfile
1 The quick brown fox jumped over the lazy dog
2 The quick brown fox jumped over the lazy dog
3 The quick brown fox jumped over the lazy dog

bash-4.2$ ls -lh bigfile
-rw-rw-r--. 1 james james 50M Apr 22 18:32 bigfile

Now to filter out all those lines beginning with "1" or "2". First with a single match using the "|" or operator...

bash-4.2$ time cat bigfile | perl -n -e 'print $_ unless /^1|^2/' > out1

real 0m8.395s
user 0m8.166s
sys 0m0.312s

and then with 2 separate regexps instead:

bash-4.2$ time cat bigfile | perl -n -e 'print $_ unless /^1/ or /^2/;' > out2

real 0m0.844s
user 0m0.715s
sys 0m0.241s

Using a single OR match is 10 times slower than using 2 matches separated by a logical OR. The outputs were identical by the way:

bash-4.2$ cksum out*
485435066 40357989 out1
485435066 40357989 out2

bash-4.2$ wc out1 out2
777777 7777770 40357989 out1
777777 7777770 40357989 out2
1555554 15555540 80715978 total

Tested so far on Red Hat 4.4, Fedora 16 (perl 5.014002), Solaris 10 x86 06/06 (Perl 5.008004). (Used awk in place of the seq command on Solaris).

Did I miss something obvious ? Is this a bug in Perl ? Surely not ?

One thought on “Perl – OR Pattern Match Slow, Use Two Patterns Instead

Leave a Reply

Your email address will not be published. Required fields are marked *