Gotta Love/Hate Regex

I remember once calling regular expressions “absolute magic”. I do think it is powerful, but it kind of gets messy and unreadable when it gets bigger. In fact, one of those was the reason I wrote this post. It was in the NLTK book I am currently reading, and it was a little more complex than what I usually use in my code. So, here I am rambling about regex and trying to explain the one I studied today as an example.

The goal

That regex was supposed to to be used as a sort of word tokenizer on this example string:

"'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
though), 'I won't have any pepper in my kitchen AT ALL. Soup does very well
without--Maybe it's always pepper that makes people hot-tempered,'..."

There is more to words than being a set of alphanumeric characters separated by whitespace. There is punctuation in the middle: hyphens, parentheses, full stops, etc., but not every punctuation symbol actually separates two words. You can see compound words (e.g. “hot-tempered”) and punctuation without whitespace like the double-hyphen.

The solution

"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*"

I never wrote a regex this long before. So, looking at it, my brain’s first response was “gobbledygook!” To be fair, though, this is relatively simple in the grand scheme of things: they can get really big. (See for yourself…)

Breaking it down

Just like any programming problem, this one is better solved by breaking it down. I will split it by the | because it means “or”, so the whole regex represent multiple choices, and we are trying to match any of them.

Part 1: \w+(?:[-']\w+)*

So this part can be thought of as the main pattern. It covers most of the desired tokens including compound words that are separated with a hyphen (e.g. “hot-tempered”) as well as those with apostrophes (e.g. “I’M”).

Part 2: '

The string of interest had quotes written between literal single quotes. So, we needed our pattern to cover those by, well… using a single quote.

Part 3: [-.(]+

This part captures punctuation marks like parentheses, full stops, ellipses, and the double hyphen.

Part 4: \S\w*

The last part cleans up with some final touches. It searches for a non-whitespace character followed by 0 or more alphanumeric characters. Here it picks up two tokens that wouldn’t be split otherwise: a comma and a closing parenthesis.


That’s it! Although it is not the most complicated thing in the world, I still it proves the point I said at the beginning: regex can be messy but still really powerful!

Tags: