,

Python – Performance Tests of Regular Expressions

Regular Expressions: To Compile or not to Compile ?

When using regular expressions one often has to decide whether to compile the expressions before applying them.
Below is a simple test I ran and the results.


tee ./regextest.py <<"_EOF_"
import time, re

text1 = '

match me

' regex1= '

(.*?)

' text2 = '

match me

' regex2= '

(.*?)

' text3 = '

match me

' regex3= '

(.*?)

' re_flags = re.IGNORECASE|re.MULTILINE|re.DOTALL|re.UNICODE arr =[] for (step,max_iterations) in [(1,10), (10,100),(100,1000),(1000,10000),(10000,100000),(100000,1000000)]: arr += [num for num in range (step, max_iterations + step, step)] for max in arr: print('-'*50) t1 = time.time() re_compiled = re.compile(regex1+str(max), re_flags) # pre-compiling and storing regex for _ in range(0, max): re_compiled.findall(text1+str(max)+' '+str(_)) t2 = time.time() print("%5i calls in % 8.3f ms - compiled_once"%(max, (t2-t1)*1000.0) ) t3 = time.time() regex2_max = regex2+str(max) # doing this here to avoid costs of string concatation in a loop for _ in range(0, max): re.compile(regex2_max, re_flags).findall(text2+str(max)+' '+str(_)) t4 = time.time() print("%5i calls in % 8.3f ms - compiled_every_call"%(max, (t4-t3)*1000.0) ) t5 = time.time() regex3_max = regex3+str(max)# doing this here to avoid costs of string concatation in a loop for _ in range(0, max): re.findall(regex3_max, text3+str(max)+' '+str(_), re_flags) t6 = time.time() print("%5i calls in % 8.3f ms - uncompiled"%(max, (t6-t5)*1000.0) ) print('-'*50) _EOF_

Results

1-1.000.000 Calls


> python regextest.py

python regextest.py
--------------------------------------------------
    1 calls in    0.194 ms - compiled_once
    1 calls in    0.124 ms - compiled_every_call
    1 calls in    0.124 ms - uncompiled
--------------------------------------------------
    2 calls in    0.116 ms - compiled_once
    2 calls in    0.132 ms - compiled_every_call
    2 calls in    0.173 ms - uncompiled
--------------------------------------------------
    3 calls in    0.117 ms - compiled_once
    3 calls in    0.134 ms - compiled_every_call
    3 calls in    0.133 ms - uncompiled
--------------------------------------------------
    4 calls in    0.129 ms - compiled_once
    4 calls in    0.175 ms - compiled_every_call
    4 calls in    0.128 ms - uncompiled
--------------------------------------------------
    5 calls in    0.122 ms - compiled_once
    5 calls in    0.129 ms - compiled_every_call
    5 calls in    0.135 ms - uncompiled
--------------------------------------------------
    6 calls in    0.133 ms - compiled_once
    6 calls in    0.144 ms - compiled_every_call
    6 calls in    0.139 ms - uncompiled
--------------------------------------------------
    7 calls in    0.126 ms - compiled_once
    7 calls in    0.144 ms - compiled_every_call
    7 calls in    0.146 ms - uncompiled
--------------------------------------------------
    8 calls in    0.141 ms - compiled_once
    8 calls in    0.151 ms - compiled_every_call
    8 calls in    0.146 ms - uncompiled
--------------------------------------------------
    9 calls in    0.136 ms - compiled_once
    9 calls in    0.155 ms - compiled_every_call
    9 calls in    0.143 ms - uncompiled
--------------------------------------------------
   10 calls in    0.139 ms - compiled_once
   10 calls in    0.160 ms - compiled_every_call
   10 calls in    0.161 ms - uncompiled
--------------------------------------------------
   10 calls in    0.029 ms - compiled_once
   10 calls in    0.054 ms - compiled_every_call
   10 calls in    0.042 ms - uncompiled
--------------------------------------------------
   20 calls in    0.155 ms - compiled_once
   20 calls in    0.198 ms - compiled_every_call
   20 calls in    0.228 ms - uncompiled
--------------------------------------------------
   30 calls in    0.175 ms - compiled_once
   30 calls in    0.219 ms - compiled_every_call
   30 calls in    0.223 ms - uncompiled
--------------------------------------------------
   40 calls in    0.195 ms - compiled_once
   40 calls in    0.260 ms - compiled_every_call
   40 calls in    0.257 ms - uncompiled
--------------------------------------------------
   50 calls in    0.214 ms - compiled_once
   50 calls in    0.289 ms - compiled_every_call
   50 calls in    0.294 ms - uncompiled
--------------------------------------------------
   60 calls in    0.253 ms - compiled_once
   60 calls in    0.319 ms - compiled_every_call
   60 calls in    0.327 ms - uncompiled
--------------------------------------------------
   70 calls in    0.259 ms - compiled_once
   70 calls in    0.377 ms - compiled_every_call
   70 calls in    0.368 ms - uncompiled
--------------------------------------------------
   80 calls in    0.274 ms - compiled_once
   80 calls in    0.408 ms - compiled_every_call
   80 calls in    0.408 ms - uncompiled
--------------------------------------------------
   90 calls in    0.298 ms - compiled_once
   90 calls in    0.425 ms - compiled_every_call
   90 calls in    0.445 ms - uncompiled
--------------------------------------------------
  100 calls in    0.323 ms - compiled_once
  100 calls in    0.486 ms - compiled_every_call
  100 calls in    0.670 ms - uncompiled
--------------------------------------------------
  100 calls in    0.211 ms - compiled_once
  100 calls in    0.354 ms - compiled_every_call
  100 calls in    0.360 ms - uncompiled
--------------------------------------------------
  200 calls in    0.535 ms - compiled_once
  200 calls in    0.796 ms - compiled_every_call
  200 calls in    0.849 ms - uncompiled
--------------------------------------------------
  300 calls in    0.741 ms - compiled_once
  300 calls in    1.282 ms - compiled_every_call
  300 calls in    1.123 ms - uncompiled
--------------------------------------------------
  400 calls in    1.067 ms - compiled_once
  400 calls in    1.504 ms - compiled_every_call
  400 calls in    1.596 ms - uncompiled
--------------------------------------------------
  500 calls in    1.123 ms - compiled_once
  500 calls in    1.906 ms - compiled_every_call
  500 calls in    1.853 ms - uncompiled
--------------------------------------------------
  600 calls in    1.295 ms - compiled_once
  600 calls in    2.165 ms - compiled_every_call
  600 calls in    2.265 ms - uncompiled
--------------------------------------------------
  700 calls in    1.524 ms - compiled_once
  700 calls in    2.682 ms - compiled_every_call
  700 calls in    2.670 ms - uncompiled
--------------------------------------------------
  800 calls in    1.714 ms - compiled_once
  800 calls in    2.885 ms - compiled_every_call
  800 calls in    3.064 ms - uncompiled
--------------------------------------------------
  900 calls in    1.926 ms - compiled_once
  900 calls in    3.324 ms - compiled_every_call
  900 calls in    3.186 ms - uncompiled
--------------------------------------------------
 1000 calls in    2.056 ms - compiled_once
 1000 calls in    3.557 ms - compiled_every_call
 1000 calls in    3.983 ms - uncompiled
--------------------------------------------------
 1000 calls in    2.058 ms - compiled_once
 1000 calls in    3.790 ms - compiled_every_call
 1000 calls in    3.671 ms - uncompiled
--------------------------------------------------
 2000 calls in    4.053 ms - compiled_once
 2000 calls in    7.037 ms - compiled_every_call
 2000 calls in    7.269 ms - uncompiled
--------------------------------------------------
 3000 calls in    6.313 ms - compiled_once
 3000 calls in   10.690 ms - compiled_every_call
 3000 calls in   10.786 ms - uncompiled
--------------------------------------------------
 4000 calls in    8.843 ms - compiled_once
 4000 calls in   14.790 ms - compiled_every_call
 4000 calls in   14.555 ms - uncompiled
--------------------------------------------------
 5000 calls in   10.437 ms - compiled_once
 5000 calls in   17.659 ms - compiled_every_call
 5000 calls in   23.852 ms - uncompiled
--------------------------------------------------
 6000 calls in   12.337 ms - compiled_once
 6000 calls in   21.825 ms - compiled_every_call
 6000 calls in   21.183 ms - uncompiled
--------------------------------------------------
 7000 calls in   14.900 ms - compiled_once
 7000 calls in   25.740 ms - compiled_every_call
 7000 calls in   26.754 ms - uncompiled
--------------------------------------------------
 8000 calls in   17.000 ms - compiled_once
 8000 calls in   30.551 ms - compiled_every_call
 8000 calls in   33.780 ms - uncompiled
--------------------------------------------------
 9000 calls in   18.430 ms - compiled_once
 9000 calls in   31.664 ms - compiled_every_call
 9000 calls in   32.703 ms - uncompiled
--------------------------------------------------
10000 calls in   22.263 ms - compiled_once
10000 calls in   37.236 ms - compiled_every_call
10000 calls in   36.782 ms - uncompiled
--------------------------------------------------
10000 calls in   22.298 ms - compiled_once
10000 calls in   36.057 ms - compiled_every_call
10000 calls in   36.947 ms - uncompiled
--------------------------------------------------
20000 calls in   41.064 ms - compiled_once
20000 calls in   72.689 ms - compiled_every_call
20000 calls in   72.712 ms - uncompiled
--------------------------------------------------
30000 calls in   60.894 ms - compiled_once
30000 calls in  108.503 ms - compiled_every_call
30000 calls in  105.168 ms - uncompiled
--------------------------------------------------
40000 calls in   80.844 ms - compiled_once
40000 calls in  138.031 ms - compiled_every_call
40000 calls in  146.786 ms - uncompiled
--------------------------------------------------
50000 calls in  101.330 ms - compiled_once
50000 calls in  177.560 ms - compiled_every_call
50000 calls in  173.693 ms - uncompiled
--------------------------------------------------
60000 calls in  128.290 ms - compiled_once
60000 calls in  213.301 ms - compiled_every_call
60000 calls in  223.361 ms - uncompiled
--------------------------------------------------
70000 calls in  141.226 ms - compiled_once
70000 calls in  248.786 ms - compiled_every_call
70000 calls in  246.285 ms - uncompiled
--------------------------------------------------
80000 calls in  170.583 ms - compiled_once
80000 calls in  284.985 ms - compiled_every_call
80000 calls in  311.732 ms - uncompiled
--------------------------------------------------
90000 calls in  189.461 ms - compiled_once
90000 calls in  329.027 ms - compiled_every_call
90000 calls in  326.422 ms - uncompiled
--------------------------------------------------
100000 calls in  217.344 ms - compiled_once
100000 calls in  351.583 ms - compiled_every_call
100000 calls in  363.048 ms - uncompiled
--------------------------------------------------
100000 calls in  202.079 ms - compiled_once
100000 calls in  356.967 ms - compiled_every_call
100000 calls in  366.751 ms - uncompiled
--------------------------------------------------
200000 calls in  410.131 ms - compiled_once
200000 calls in  707.814 ms - compiled_every_call
200000 calls in  700.848 ms - uncompiled
--------------------------------------------------
300000 calls in  607.796 ms - compiled_once
300000 calls in  1033.031 ms - compiled_every_call
300000 calls in  1073.781 ms - uncompiled
--------------------------------------------------
400000 calls in  806.791 ms - compiled_once
400000 calls in  1417.941 ms - compiled_every_call
400000 calls in  1385.285 ms - uncompiled
--------------------------------------------------
500000 calls in  1007.073 ms - compiled_once
500000 calls in  1705.375 ms - compiled_every_call
500000 calls in  1862.589 ms - uncompiled
--------------------------------------------------
600000 calls in  1208.090 ms - compiled_once
600000 calls in  2125.673 ms - compiled_every_call
600000 calls in  2082.850 ms - uncompiled
--------------------------------------------------
700000 calls in  1429.684 ms - compiled_once
700000 calls in  2401.972 ms - compiled_every_call
700000 calls in  2543.541 ms - uncompiled
--------------------------------------------------
800000 calls in  1662.572 ms - compiled_once
800000 calls in  2849.011 ms - compiled_every_call
800000 calls in  2805.143 ms - uncompiled
--------------------------------------------------
900000 calls in  1940.630 ms - compiled_once
900000 calls in  3150.189 ms - compiled_every_call
900000 calls in  3312.602 ms - uncompiled
--------------------------------------------------
1000000 calls in  2137.428 ms - compiled_once
1000000 calls in  3732.527 ms - compiled_every_call
1000000 calls in  3600.674 ms - uncompiled
--------------------------------------------------

re.py module

def findall(pattern, string, flags=0):
    """Return a list of all non-overlapping matches in the string.

    If one or more groups are present in the pattern, return a
    list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result."""
    return _compile(pattern, flags).findall(string)

# ....

_cache = {}
_cache_repl = {}

_pattern_type = type(sre_compile.compile("", 0))

_MAXCACHE = 100

def _compile(*key):
    # internal: compile pattern
    cachekey = (type(key[0]),) + key
    p = _cache.get(cachekey)
    if p is not None:
        return p
    pattern, flags = key
    if isinstance(pattern, _pattern_type):
        if flags:
            raise ValueError('Cannot process flags argument with a compiled pattern')
        return pattern
    if not sre_compile.isstring(pattern):
        raise TypeError, "first argument must be string or compiled pattern"
    try:
        p = sre_compile.compile(pattern, flags)
    except error, v:
        raise error, v # invalid expression
    if len(_cache) >= _MAXCACHE:
        _cache.clear()
    _cache[cachekey] = p
    return p


Conclusion

1. Despite several tests, the very first call is always somewhat slower that others.
This could be due to the RE module file includes and initialisation.

2. Reusing pre-compiled regular expressions does indeed improve the execution speed (1.5-2x)
My previous tests showed much bigger differences, but this turned out to be nothing but a test error.
Python string concatenation “got” in the way.

regex2_max = regex2+str(max) <= this is expensive, especially when doing it in a loop the current tests now ensures that the same number of string concatenation are done in on 3 times of tests. 3. Pre-compiling for each call everytime is meaningless because re.match compiles the regular expression before matching any ways. 4. The timing differences between pre-compiled or uncompiled regular expressions is much smaller(1.5-2x), than I even anticipated 5. Based on the numbers above, pre-compilation of regular expressions is an over-optimization UNLESS: a. one has more than 100 different regular expression and b. they are ALL used very frequently - as in 1.000 of times in a few minutes. It may be interesting to actually count the number of times the regular expression are accessed beforehand.

References

1. Python Regular Expression Documentation: http://docs.python.org/2/library/re.html
2. Is Regex Pre-Compilation Worth It: http://stackoverflow.com/questions/452104/is-it-worth-using-pythons-re-compile