The Complexity of DCGs

Appeared in Volume 7/2, May 1994

Keywords: DCGs.

[This news thread was kindly supplied by Norbert E. Fuchs (fuchs@ifi.unizh.ch). Ed.]


From: dowding@ai.sri.com (John Dowding)

Norbert E. Fuchs writes:

The worst case complexity of DCG's is exponential. Does anyone have information on the average case complexity?

Fernando Pereira writes:

The worst case complexity of DCGs is *infinite* (-:), since DCGs are Turing-equivalent. I believe what you are referring to is the use of Prolog to execute CFGs (or any DCG whose set of rule ground instances is finite). If no left recursion is present, then indeed the worst-case parsing performance is exponential. But this is not a property of DCGs, rather a property of top-down backtrack parsing. As for average case, average over what? It all depends on the grammar and the grammar processor.

Norbert E. Fuchs writes:

Maybe, I should have used the word "experience" to avoid misunderstandings. What is the experience with DCGs that people developed for large(r) projects, e.g. DCG's for NL (Chat-80, Pundit etc.), or DCG's for compilers?

Pundit does not use DCG, but Restriction Grammar, which is much more powerful than CFGs. The Restriction Grammar is then translated into a form that is very close to DCGs, but it does use infinite-valued features.

I am not sure what you mean by 'experience' in this case. The comments that I can make are fairly obvious:

Parsing can be slow, very very slow.

Parsing time dominated overall execution-time.

Parsing time increased with grammar set size.

A poorly written grammar rule could kill you.

It is a good idea to collapse rules like this:

a -> b c.
a -> b d.
a -> b e.

to rules like this:

a -> b (c ; d; e).

or create new nonbranching rules like this:

a -> b a'.

a' -> c.
a' -> d.
a' -> e.

In general, the lower in the grammar that you can put your disjunctions, the less speculative processing you will do.

If you are not trying to find all analyses (you probably aren't in a syntax checker), then you can reorder disjunctions to improve efficiency. You can either put the most common option first, or put the easiest to rule out option first.

If you have a rule that contains explicit words, you can eliminate some work by doing look-ahead:

a -> big_category, [to, paris].

make sure that you have "to paris" in the input string before trying to parse big_category.

Norbert E. Fuchs writes:

The background to my request is that a student of mine developed, in the form of DCG's, a lexical scanner and a syntax checker for a slightly modified version of the standard of the Z specification language. The runtime behaviour of the syntax checker is absolutely unacceptable, and seems to come close to the worst case.

If you are really doing context-free parsing, then I think it is a good idea to use some sort of tabular parsing. Even if you not parsing CFLs, tabular parsing is the way to go.


From: pereira@alta.research.att.com (Fernando Pereira)

This is generally good advice for natural language, but for formal languages like Z, which are designed to be (at least globally) unambiguous, a less heavy approach is to transform the grammar (by hand, I don't know of any pratical mechanical aid for this. There are ways of applying standard CFG techniques to the CF backbone of a DCG but they don't deal generally with arguments.) into an equivalent one that when executed by Prolog follows the same sequence of steps as the original grammar would when executed by a bottom-up shift-reduce processor. With appropriate lookahead and preserved state information, such a grammar can often be deterministic. This is no more than a hand-adaptation for DCGs of LR parsing techniques. In other words, the typical problem for artificial languages is not that Prolog per se is slow, but that a naive grammar does not have enough information about context to avoid local ambiguity and thus its execution via Prolog requires backtracking. An example of this is the grammar for Prolog in the public-domain Prolog implementation of read/1. What it amounts to is a hand-build transformation of a DCG for Prolog into a program which when executed by Prolog behaves as a "left-cornerish" parser.


From: feliks@carlstedt.se (Feliks Kluzniak)

Doesn't everything depend on the kind of grammar we are using?

If the grammar is LL(1), i.e. suitable for deterministic top-down parsing, then it is a very easy matter to obtain a rather efficient DCG parser by systematic addition of cuts (these are needed to avoid falling into a black hole when parsing a non-sentence). It is even possible to provide sensible error diagnostics.

Several years ago I wrote a tool that checks whether an EBNF grammar is LL(1) and (if it is) converts it to a reasonably efficient DCG with error detection. (This has been published only as an internal report at Bristol University; see also a letter to the editor in "The Computer Journal", vol. 35, no. 3 (1992)).


From: fuchs@ifi.unizh.ch (Norbert E. Fuchs)

On my request I received a number of answers, some in this newsgroup, some by direct email. Thanks to all who responded.

Nevertheless, my question "What is the run time experience with DCGs for large projects?" remained unanswered. Most people suggested to improve the parsing efficiency by using bottom-up, left-corner, tabular, or other forms of parsing instead of the recursive-descent parsing generated by Prolog's direct execution of DCG's.

As it happened, yesterday our librarian gave me an inspection copy of Michael Covington's new book 'Natural Language Processing for Prolog Programmers'. In section 6.7 entitled 'Which parsing algorithm is really the best', Covington reports some sobering results: with "better" algorithms (BUP, left-corner, chart, Earley) parsing gets slower and slower. Though Covington points out that his results were achieved with a very small grammar, the sad results remain.

In section 6.7.2 Covington discusses the complexity of parsing algorithms. He writes " ... the n**3 [of Earley's algorithm] and the k**n [of recursive-descent] results apply only to the worst case, i.e. the situation in which the parser guesses wrong as much as possible and uses the right rule only after trying all the wrong ones. It is common for a k**n-time parser to take much less than k**n time on any particular parse, because it is almost certain to guess right some of the time and thereby avoid unneccessary work." My original question referred exactly to these "average" cases.


From: dowding@ai.sri.com (John Dowding)

I think that these comments apply in cases where either your grammar is unambiguous, or (if ambiguous) you don't want to generate all possible analyses. As Fernando points out, since the grammar for Z is presumably unambiguous, and a style-checker would not need to find all possible analyses even if it were not, you can probably hand-build a deterministic parser.

Since programs like Chat and Pundit are parsing with ambiguous grammars for NL, I don't think that experience with them will be all that relevant to parsing Z.


From: fuchs@ifi.unizh.ch (Norbert E. Fuchs)

In fact, the Z language - or rather its grammar published as BNF in the Z Base Standard - is not so well behaved. The grammar is not small (more or less direct translation BNF -> DCG gave 165 grammar rules), most rules are complex, the grammar is indirectly left-recursive (which we eliminated by local transformations), it is ambiguous, and as published incomplete and inconsistent. Also, we wanted more than a style-checker. In its current implementation, the parser generates a parse tree.

Because of the shortcomings of the grammar, it seems that our approach (translation of BNF to DCG's, recursive-descent parsing) which was dictated by extreme time constraints for the student's thesis and by our intention to adhere closely to the Z standard, was not optimal. We intent to clean up the Z grammar first, before we decide on a parsing strategy.


From: ted@crl.nmsu.edu (Ted Dunning)

Does anybody find it ironic that the published grammar for a language which is designed to support formal program verification is *wrong*?


From: pereira@alta.research.att.com (Fernando Pereira)

My comments were based on actual experience with substantial DCGs or DCG-like grammars (eg. constraint-based grammars compilable into Prolog) since 1977. There is no such thing as "average case" here, but as grammars become larger tabular methods easily outperform direct execution by Prolog, at least for NL grammars. As I and John Dowding noted, the situation may be different for artificial, unambiguous languages like Z, for which DCG versions of standard grammar complication techniques can be used to convert the initial DCG into another one that runs deterministically when executed directly by Prolog (the overheads of tabulation are not needed here I gave as example the Prolog public domain implementation of read/1).

Norbert E. Fuchs writes:

As it happened, yesterday our librarian gave me an inspection copy of Michael Covington's new book 'Natural Language Processing for Prolog Programmers'... .

I think this is rather misleading, at least as you put it (I haven't seen the book in question yet). The question of tabulation and the question of parsing direction (bottom- up vs. top-down) are to some extent independent. Tabulation is costly in Prolog because it involves using assert, a fairly expensive operation. Tabulation is what yields n**3 asymptotic complexity for CFGs and "well-behaved" DCGs (I'll leave a definition of "well-behaved" in this sense for another time). This better asymptotic complexity is essential for large NL grammars like the one the the SRI Core Language Engine. With specialized tabulation engines like the XWAM, it should be possible to reduce the constants and make tabulation even more competitive. Further improvements may be achievable by avoiding wholesale copying of tabulated terms in favor of an appropriate structure sharing technique, like the one proposed by Eric de la Clergerie. But even with the overheads of assert, tabulation is crucial for large, ambiguous grammars.

Bottom-up techniques (and left-corner techniques to a lesser extent) provide a parser with more context information, and thus can make it more deterministic (although the additional context information may itself be a problem if determinism is not complete). Artificial languages are often designed to be deterministic with small lookahead (eg. LALR(1)). In these cases, a corresponding "compilation" of the original DCG into a (deterministic) pushdown automaton represented in Prolog is an excellent solution, because search is avoided and the number of steps to complete a parse is of the same order as the number of steps in a successful path in the top-down search space (since the number of computation steps is O(number of steps in the derivation)). Thus such constructions for deterministic languages always win, although the may multiply some (small) constant factor that hides the advantage for short inputs (for long inputs the backtracking costs dominate). The reason why there are no analogues of YACC for DCGs around is that dealing with arguments messes up the termination of grammar compilation algorithms. It is possible to sidestep this issue by doing grammar compilation on a CFG approximation of the DCG (eg. its CF backbone) and reading the appropriate constraints to the resulting PDA, since the moves of the PDA can be reconnected to uses of the rules of the original grammar (shifts and reductions). The problem is that sometimes the arguments carry essential context information without which the resulting PDA has conflicts; it is however possible to include any bounded part of that information in the construction (in other words, use a tighter approximation). (An early proposal along thse lines is in a paper by Ulf Nilsson in 5th Generation Computing for which I unfortunately cannot find the exact ref, which was given in this newsgroup several months ago).

If BU techniques leave residual nondeterminism, one can apply general techniques to tabulate PDAs to replace exponential search by polynomial dynamic programming. general techniques to do this have been developed by Bernard Lang and his students, but practical implementations for the DCG case still need to be developed.

The difficulty with much of what I've described above is that there are no "ready-to- use" tools do the constructions automatically, and hand-compilation is laborious and error prone. As I said a while ago, there's an interesting project for a graduate student somewhere here...

References.

Bernard Lang, Deterministic Techniques for Efficient Non-deterministic Parsers, Proc. of the 2nd Colloquium on Automata, Languages and Programming, 1974, J. Loeckx (ed.), pp.255-269, Springer-Verlag,

Bernard Lang, Complete Evaluation of {Horn} Clauses: an Automata Theoretic Approach, INRIA, 1988, Rapport de Recherche, No. 913, November

Sylvie Billot and Bernard Lang, The Structure of Shared Forests in Ambiguous Parsing, 27th Annual Meeting of the Assoc. for Computational Linguistics, 1989, pp.143-151, ACL, 26-29 June

Eric Villemonte de la Clergerie, Layer Sharing: an Improved Structure-Sharing Framework, pp.345-356, ACM Symp. on the Principles of Programming Languages, 1993


From: ted@expert.demon.co.uk (Ted Walker)

DCG Performance

We have developed an alternative mechanism for the support of DCGs in Prolog. Rather than the standard recursive decent, we execute the grammar using a breadth first, bottom up, chart parser.

Using a 108 rule grammar of the English language the performance for two test cases (the first made up of short simple sentences, the second of compound sentences of about 30 words) is as follows:


        Input           Time (s)        Time (s)
        K bytes         Simple          Complex
          1               7.0             20.8
          2              15.8             51.2
          4              33.2
          8              71.8             231.4
         16             145.7
         32             289.9             946.9

The tests were performed using Prolog-2 for Windows on a 386 PC.


From: raaijmakers@rulxho.leidenuniv.nl (Stephan Raaijmakers)

For what it's worth, I have developed a small (Prolog) compiler from (deterministic) DCG's into Marcus grammars (grammars for Mitch Marcus' deterministic PDA parsing strategy). I did not run into the problems of non-termination Fernando Pereria mentioned; could be I missed something.


From: pereira@alta.research.att.com (Fernando Pereira)

Bernard De Cuyper writes:

I have used an idea of Uhlmann to transform left recursive DCGs in right recursive DCG solving. It seems the idea is ok but I have also no idea if it will work all the time.

No, it won't work in general if sufficiently complex argument passing is used (notermination). However, there is a generalization of Greibach Normal form for DCGs that can always be built, developed by Marc Dymetman in his 1992 University de Grenoble dissertation 'Transformations de Grammaires Logiques et Reversibilite' en traduction Automatique'. Marc should be reachable by e-mail at dymetman@ccrit.doc.ca. It a nice idea.