Finding and returning zero or one of a marked sub-expression in Python regex

& (verbiage overflow)Thu 01 August 2013RSS

[Moved to https://github.com/brannerchinese/notes/blob/master/CONTENT/Python/Zero-or-one_operator_within_group_in_perl_vs_python_regex.md. Any future edits will be done there. 20140215.]

One feature of Perl-style regular expressions that is not implemented in Python is the use of the zero-or-one operator (metacharacter “question mark”, ?) with a parenthesis-delimited “group” or marked sub-expression. In Python, if such a group is found (below, (ghi)?), it can be referred back to with a back-reference (below, \2; code below is Python 3.3 in Ipython 0.13.2):

1
2
In [1]: re.sub('(abc)\t(ghi)', r'\2', 'abc\tghi')
Out[1]: 'ghi'

but if it is not found an error is raised instead of an empty string being returned (which is the norm):

1
2
3
In [2]: re.sub('(abc)\t(ghi)?', r'\2', 'abc\t')
...
error: unmatched group

I’m aware of two work-arounds for this curious situation.

First, rewrite the optional sub-expression as an optional look-ahead marked sub-expression: ((?:ghi)?). There are two pairs of parentheses: one for the look-ahead syntax (?:...) and one for the optional matching group (...?):

1
2
3
4
5
In [3]: re.sub('(abc)\t((?:ghi)?)', r'\2', 'abc\tghi')
Out[3]: 'ghi'

In [4]: re.sub('(abc)\t((?:ghi)?)', r'\2', 'abc\t')
Out[4]: ''

Second, discard the zero-or-one operator and put a set-union operator (metacharacter “pipe”, |) inside the parentheses so that the sub-expression now means “either nothing or else the target in question” (below, (|ghi)):

1
2
3
4
5
In [5]: re.sub('(abc)\t(|ghi)', r'\2', 'abc\tghi')
Out[5]: 'ghi'

In [6]: re.sub('(abc)\t(|ghi)', r'\2', 'abc\t')
Out[6]: ''

The first work-around is marginally faster than the second:

1
2
3
4
5
6
7
8
$ python -m timeit -n 1000000 -s "from re import sub"
"sub('(abc)\t((?:ghi)?)', r'\2', 'abc\tghi')"
1000000 loops, best of 3: 6.48 usec per loop
$
$ python -m timeit -n 1000000 -s "from re import sub"
"sub('(abc)\t(|ghi)', r'\2', 'abc\tghi')"
1000000 loops, best of 3: 6.97 usec per loop
$

though it’s longer to type and more complex to understand and remember.

I vote for the second work-around.


I've just noticed that the replacement version of regex now at PyPi addresses this issue (“regex 2013-06-26”):

Unmatched group in replacement: An unmatched group is treated as an empty string in a replacement template.

[end]

Comments are enabled.