Recursive regular expressions

Prelude

I found a strange interesting regex in the pragmatic programming ruby book:

re = /
  \A
    (?<brace_expression>
      {(
        (
          [^{}]
        |
          (\g<brace_expression>)
        )*
      )}
    )
  \Z
/x

correct = "{alma}"
incorrect = "al{ma"
p re =~ correct        # 0 (matches at position 0)
p re =~ incorrect      # nil (doesn't match)

The book explains that this beauty above checks whether a string has a balanced set of curly braces.

This is another, similarly interesting regex borrowed from the Perl Monks, that checks whether a word is a palindrome:

use re 'eval';
my $ch = '[a-zA-Z]';
my $palindrome;
my $r   = "(??{\$palindrome})";
$palindrome = qr/$ch|($ch)($r)?\1/;

while (<>) {
    print if /^$palindrome$/;
}

Here you can find the explanation how this expression works. Both of them made me raise an eyebrow. Didn’t make you? It should – unless you are already a regex ninja.

Etymology

Let’s see the etymology of the word ‘regex’.

No, first let’s see what does the word ‘etymology’ mean. According to the Wikipedia, “Etymology is the study of the history of words, their origins, and how their form and meaning have changed over time.”

The recursive example is this: etymology of etymology.

So what’s the origin of the word ‘regex’? It begun with Noam Chomsky:

Regular Languages

According to the Chomsky Hierarchy, the simplest formal languages are the Regular Languages. You can generate them by a very simple set of rules. Also, you can define a Regular Language by a regular expression. Equivalently, you can define a Regular Language by a finite state automaton.

The second simplest languages are the context free languages. You can define such a language with a pushdown automaton. At the other end of the spectrum you can find the unrestricted grammars: the one that can be defined by Turing automaton. I.e. most of the programming languages belong to this category.

Back to the regular languages. The fact that they can be defined by a ‘finite state automation’ has a particularly interesting result: an automaton defining a regular language cannot have a stack – because that would be a pushdown automaton. Since their automaton cannot have a stack, their structures cannot be recursive (only accidentally). So the regular expressions cannot have recursive structures either. Here we go.

We’ll get back to etymology later.

Parsing JSON

Since there is no recursion in a regular language, folks will quickly tell you that you cannot parse JSON with a regex. E.g. see the comments at this SO question. Theoretically they are right, however, Brian D Foy tries to explain here how to actually do that. However, sometimes he admits that he doesn’t fully comprehend the regex because he borrowed it from Randal L Schwartz. Here is the real thing, without any explanation.

Lesson learned: Regexes are capable of far more than I thought. Sooner or later I’ll read the regex book. A warning for my future self: if it’s hard to understand then it’ll be hard to maintain too. So, well, I’ll read the regex book just for the fun of it.

Back to etymology

Regex, 1956: N. Chomsky: a way to define a Regular Language, the simplest kind of formal languages.
Regex, now: a tricky language for pattern magic matching. It can become extremely cryptic.

Advertisements

About tamasrev

A software developer, specialized in Java, addressing himself as generalist. A proud daddy.
This entry was posted in Uncategorized and tagged , , , , , . Bookmark the permalink.

2 Responses to Recursive regular expressions

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s