Tuesday, November 9, 2010

Counting the occurrences of a substring - the fastest way

SkyHi @ Tuesday, November 09, 2010
So what is the fastest way to count how often a substring occurs inside a string, e.g. how many dashes can be found in “1-4-7-8-37-5-7-8-3-42″? I identified five ways to make Perl say “9″ to this challenge:

   1. my $size = (scalar(@{[$string =~ /-/g]}) + 1);
   2. my $size = scalar(split /-/, $string);
   3. my $size = (($string =~ s/-//g) + 1);
   4. my $size = (($string =~ tr/-//) + 1);
   5. my $size = 1; $size++ while $string =~ /-/g;

I would have assumed that number 1 would be the fastest way as it requires no modifcation of the string. A quick run with “cmpthese” from the Benchmark module revealed that my assumption was wrong:

Rate            match (1)       split (2)     while (5)     repl (3)     tr (4)
match (1)    51106/s          –               -90%         -93%      -94%        -95%
split (2)       495540/s        870%         –              -36%       -45%       -47%
while (5)     771605/s        1410%       56%          –            -15%        -18%
repl (3)       908265/s        1677%       83%       18%           –               -4%
tr (4)          941620/s         1742%        90%      22%          4%                –

The table shows a clear advantage when counting the occurrences via a while-loop (solution 5), count the replaced characters (solution 3) or truncating all occurrences (solution 4). While “tr” was the fastest way during this test run, it wasn’t the best way on other benchmark runs or others systems. If you want to try it yourself you can do so with the attached test-script.

Cluebringer tracking.pm :
my $p = '\<(.*?)\>';
#my $totalrecipient =0;
#$totalrecipient++ while DBQuote($sessionData->{'RecipientData'}) =~ /$p/g;
my $totalrecipient = (scalar(@{[DBQuote($sessionData->{'RecipientData'}) =~ /$p/g]}));


REFERENCES
http://www.chengfu.net/2005/10/count-occurrences-perl/