Matching Strings with Regular Expressions

Appeared in Volume 6/1, February 1993

Keywords: strings.

ok@goanna.cs.rmit.oz.au
Richard A. O'Keefe
8th September 1992

Milan Richter writes:
I am looking for public domain code for matching strings with regular expressions. Essentially, given a string and a regular expression (such as those used in emacs), I want to tell whether they match.

The simplest thing to do is to use Prolog grammar rules.

For example, suppose you want to make the regular expression:

b(|r)e(|a)d

Then you can write

re --> [b], ([] | [r]), [e], ([] | [a]), [d].

Terminal symbol T ==> [T]

Sequencing A B ==> A, B

Alternation A | B ==> A | B

Parentheses ( A ) ==> ( A )

The only thing which doesn't map directly onto definite clause grammar syntax is the Kleene star. There are two ways of handling that.

If your Prolog system handles variables in grammar rules correctly (a hint to look for is the predicate phrase/3; if that is not provided, chances are your Prolog doesn't handle variables in grammar rules), then do the following:

:- op(500, fx, *).
NT* --> [] | NT, NT* .

For example,
(abra)*ca(dabra)*

would be
re --> [a,b,r,a]*, [c,a], [d,a,b,r,a]* .

A more portable method is to take each closure and give it a name. One then defines that name as a non-terminal with two choices:

(XXX)* ==> xxx where xxx --> [] | XXX, xxx.

In this particular example:

re --> abras, [c,a], dabras.
abras --> [] | [a,b,r,a], abras.
dabras --> [] | [d,a,b,r,a], dabras.

Emacs also has

(XXX)+ ==> xxx 
            where xxx --> XXX, ([] | xxx).
. ==> [_]
[c...d] ==> [Char], {member(Char, "c...d")}

although the latter translation doesn't handle ranges.

For C, you should look at LEX, flex, and Alan Holub's "Compiler Construction in C".

gregory@cs.sfu.ca
Gregory Sidebottom
7th September 1992

It's actually really easy in Prolog. Assuming symbols are atoms, strings are lists of symbols, '.' is concatentation, '+' is alternation, and '*' is Kleene closure, all you really need is lines 1 and 3-8 of the following program:

1.	in(W,E) :- in(E, W, []).
2.	in(E) --> {var(E), !}, prefix(E).
3.	in(X) --> {atom(X)}, [X].
4.	in([]) --> [].
5.	in([X|L]) --> [X], in(L).
6.	in(E1.E2) --> in(E1), in(E2).
7.	in(E1+E2) --> in(E1) | in(E2).
8.	in(E*) --> in([]) | in(E.E*).
9.	in(E^0) --> [].
10.	in(E^N) --> {N > 0, M is N - 1}, in(E), in(E^M).
11.	prefix([]) --> [].
12.	prefix([X|W]) --> [X], prefix(W).

The call:

?- in(Word, Expression)

is true if the string Word is in the set of strings Expression represents. This program can be used to verify that strings match expressions, generate strings matching expressions, or even generate expressions matching strings. But it is easy to send it off into infinite loops, especially if you don't put bounds on the length of the string.

Note that line 1 is a Prolog clause and the rest is a DCG, which many Prologs translate automatically to Prolog (check in Sterling and Shapiro's "The Art of Prolog" for more information). Note also that most Prologs won't handle '.' as in infix operator without quotes and probably won't handle '*' as a postfix operator very well, so you might have to use other symbols. Just make sure that '*' binds tighter than '.'.

Line 2, which uses lines 11 and 12, allows you to have variables in expressions which get bound to substrings. This gives you more power then regular expressions. For instance, call

?- in(X, A.x.W1), in(Y, W2.x.A}

force X and Y to denote strings containing a distinguished 'x', where X's prefix before the 'x' is the same as Y's suffix after the 'x'.

Lines 9 and 10 allow you to use repetition expressions for the form E^i where E is an expression and i is an integer, which denotes the concatination of i strings from the set denoted by E.

This program is described further in the paper:

Greg Sidebottom and Fred Popowich, "Extending PATR with Path Patterns and Constraints," In Proc. of the Ninth Canadian Conf. on AI (AI'92), Vancouver, 1992.

milan@bcsaic.boeing.com
Milan Richter
9th September 1992

I wasn't entirely clear about what I wanted: a procedure, let's say

match_regexp( +String, +Regexp)

that succeeds iff a character string String matches the regular expression Regexp. Both String and Regexp are represented as lists of characters. In other words, I need to be able to supply the regular expression as an argument. A grammar representing one specific expression will not do.

I received several pointers to C implementations of regular expression matchers:

regexp by Henry Spencer of Univ. of Toronto
/usr/include/regexp.h supplied with SunOS
wildmatch by Karl Heuer
GNU regular expression library

Thanks to all who helped. For my purposes, the /usr/include/regexp.h appears to fit best. I intend to link it with my Quintus Prolog code.

ok@goanna.cs.rmit.oz.au
Richard A. O'Keefe
10th December 1992

I would like to add a footnote about /usr/include/regexp.h

Using /usr/include/regexp.h to provide regular-expression string matching has an advantage and a disadvantage. The advantage is that your program will be consistent with other programs, not necessarily written in Prolog. The disadvantage is that it may not work. I recently ran across a System V Release 3.2 implementation where GCC complained about the function getrnge(). GCC was right: the source code assumed unsigned chars, while the system in question has signed chars. You would do far better to get a copy of Henry Spencer's free implementation.

No Prev