Approximate string matching with regular expressions

homeblogtwitterthingiverse



This came up for a microarray design. I wanted to find all the near matches to a string. The following Python snippet will generate a regular expression to find all matches with at most some given number of substitutions.

The basic idea is to divide the string in two and consider all possible assignments of the substitutions to the first and second halves. We then do the same thing for each half, and so on.

import re

def regex(pattern, n_errors=2):
    if n_errors == 0:
        return re.escape(pattern)
    if n_errors >= len(pattern):
        return '.'*len(pattern)
        
    mid = len(pattern)//2
    parts = [ 
        regex(pattern[:mid], i)+regex(pattern[mid:],n_errors-i)
        for i in xrange(min(n_errors+1, mid+1))
    ]
    
    return '(?:' + '|'.join(parts) + ')'


Update 24/10/2007: Disappointingly, it seems Python's regular expression library does not compile regular expressions into DFAs, so the utility of the above is limited.




[æ]