Strings vs Atoms

Appeared in Volume 7/3, August 1994

Keywords: strings, atoms.


deforest@indirect.com 
Darin Deforest
21st April 1994

I have a running argument with someone I work with. The argument deals with whether a string is more efficent to store in Prolog than the same characters represented as a term. I argue that a string is represented as a list of integers and thus requires a cons predicate and two cells (or boxes) for every characters, while a atom can be stored as simply a sequence of bytes. And what's more important, if the same atom occurs more than once, the same set of characters (the print representation) are shared, plus look-up of an atom is faster since hashing can be employed, while for a string unification needs to occur.

Am I off-base here? I realize that different versions of Prolog may optimize slightly differently but I can't forsee how a string can be represented (in space resources) any better than an atom.


jan@swi.psy.uva.nl 
Jan Wielemaker
22nd April 1994

Most WAM-based machines require two machine-words for storing a list-cell. Thus, the amount of space occupied by "text" would be 8-words which is 32 bytes (normally). Of course, if you generate "text" twice, 2 copies will be around. "text" lives on the runtime stacks and is discarded on backtracking in all Prolog systems and properly garbage collected in any Prolog system having a garbage collector.

'text' lives in the atom- or program-space. The exact amount of space occupied by it depends on the representation chosen. Normally it should be 5 bytes (possible rounded up to the word-boundery to 8) for the ASCII values, 2 words for the cell in the hash-table and a little for the main hash-table array. Some additional storage may be involved for system-specific book-keeping. I would estimate about 20-bytes for our 'text' example, adding 1 byte for each additional character. Only a few commercial Prolog systems however will reclaim the space occupied by atoms.


mcovingt@aisun3.ai.uga.edu 
Michael Covington
22nd April 1994

Darin Deforest writes:

Am I off-base here?

You are exactly right. Furthermore, some Prologs have a "compact string" representation ($like this in Arity Prolog$) which is stored compactly (like an atom) but does not take up an entry in the symbol table.


johowland@aol.com 
Jo Howland
24nd April 1994

I hardly see this as an advantage. All it means to me is that they didn't get atoms right. Atoms can be garbage collected just as easily as what they call "compact string"s.


andrew@cee.hw.ac.uk 
Andrew Dinn
25th April 1994

If I understand Michael correctly then he is talking about what Lispers call CDR-coded lists i.e. a block of memory containing successive members of a list in contiguous storage thereby avoiding the need to allocate two mempory cells per list cell. This gives an obvious real big advantage i.e. half the list cell storage that would be required for the normal representation. Furthermore, use of such lists is beneficial when it comes to garbage collection since it allows block copying, a less obvious but perhaps equally important advantage.

I will object to the garbage collection comment on principle, the principle being 'it depends'. Most systems which leave atoms in an atom table even after they are no longer referenced from code/data end up reusing the same atoms when they reappear (which they often do). Sometimes (especially if programmers try to be clever and construct atoms to hold vast amounts of text :-) they end up hanging on to large amounts of redundant memory. However, given the appropriate pattern of usage, garbage collecting system could end up recreating the same atoms again and again, wasting lots of CPU and churning over lots of memory. So getting it 'right' depends on what you want it to be 'right for'. As usual the questions is not clear-cut until you get down to nitty gritty by which time the answer is not generally valid.


ok@goanna.cs.rmit.oz.au 
Richard A. O'Keefe
26th April 1994

No, he wasn't talking about CDR-coding, but "packed vector of byte", the kind of string representation you get in Pop, Lisp, C, ...

In practice, CDR-coding is not a win. To start with, the way Prolog code tends to build lists is from front to back, with CDRs initially being unknown and later being instantiated. This defeats CDR-coding, which has to know _when the cons cell is allocated_ what the CDR is.

Second, CDR-coding costs run-time. Xerox Quintus Prolog is/was a version of Quintus Prolog that runs/ran on Xerox Lisp Machines. These machines ran a microcoded implementation of Interlisp-D, which used CDR-coding. The microcode for XQP was done by an experienced hacker who had done a lot of work on the Lisp microcode. The Prolog microcode for car/cdr was a lot smaller and simpler, and could be replicated in several places. The net effect was that simple list processing ran about 5 times faster in Prolog than in Lisp, on the same hardware, microcode by the same skilled hacker. (My benchmarks, not the latest version.)

The CDR-coded Lisps that I have knowledge of didn't move anything during garbage collection, so they didn't use any sort of copy, still less a "block" copy. (Note that CDR-coded lists don't have a length field, just some code that says whether the CDR is nearby or not, so


johowland@aol.com 
Jo Howland
22nd April 1994

First some terminology: what you call "strings" I call "char lists", which I guess is short for character codes lists. Some Prologs have an elementry type called strings, that are neither "atoms" nor "char lists".

The important thing to remember is that parsing char lists is much faster and elegant than parsing atoms. Prologs DCG's work on lists. Of course storing things as lists of character codes is a silly thing to do, since it wastes so much space.

I've written in Prolog lots of small utility programs that start out by reading in a file as one large char list, and then parses it.


pets@munta.cs.mu.OZ.AU 
Peter Schachte
28th April 1994

Darin Deforest is mostly right. Storing a string of characters as a list of character codes (a "chars") costs 8 bytes per character in most Prolog implementations, while an atom probably costs 1 byte per character, plus a fixed overhead, which is probably in the neighborhood of 8 to 32 bytes. So a character string of more than a half dozen characters or so will occupy more storage space as a chars than as an atom.

But be careful! In most Prolog systems, atoms are never garbage collected, so while an atom is usually smaller, it will be there forever. And if you want to do any manipulation or examination of character strings, chars is the preferred representation. In fact, it's the only representation: you want to look inside an atom, you'll have to turn it into a chars by calling name/2 or atom_chars/2 anyway. And note that name/2 and atom_chars/2 are fairly expensive.

Generally speaking, chars is usually the preferred representation for character strings. One exception to this is in constant character strings that are to be printed. If you write a predicate put_chars/1 to write out a chars (or use it from a library), you're still better with the line

write('hello, world')

than you are with

put_chars("hello, world")

In the former case, you have the atom 'hello, world' hogging space in your atom table, but in the latter case, you have the code to generate the chars "hello, world" hogging up code space, which is almost as permanent, and will be much bigger.


ok@goanna.cs.rmit.oz.au 
Richard A. O'Keefe
26 April 1994

One of the hardest lessons for many programmers to learn is:

	STRINGS ARE WRONG.

If you represent a string as a list of integral codes, then each character requires one cons cell. In a good Prolog implementation, a cons cell is likely to cost 2 machine "words", say 64 bits. (On a micro-computer it is quite possible to implement list cells so that they occupy just two heap addresses. So a PC Prolog *could* have had 32-bit list cells. With today's >=4Mb PCs, that's not an issue any more.) Is this really bad? Well, not if you consider the future. The current standard character set is ISO 10646, which is in principle a 32-bit character set. Currently only 16 bits worth of characters are defined, but it's due to overflow that limit fairly shortly. (The Unicode people did a wonderful job of compressing some many character sets into 16 bits, but as I said, they're running out of space.)

There are opportunities for sharing in lists. If I want to represent the strings 'ark" and "bark", I can do:

	X = "ark",
	Y = [0'b|ark].

This opportunity for sharing also exists with atoms, but is rarely exploited.

Some Prolog implementations have small limits on the size of the atom table (one Prolog I've used limited it to 907 atoms).

But the main point is that representing anything as a string is almost a mistake. If something can be represented more abstractly (with characters being generated only when it is time to print the thing) it almost always should be so represented.

Michael Covington writes:

The main use of such things is communicating with other programming languages. I am deadly serious when I say that strings are just about the worst possible representation for anything, including natural language text.

To give you one example of the kind of fun that's coming, consider the fact that ISO 10646 includes both "floating diacriticals" (e.g. sequences of codes like "e,grave accent" are possible) and a large number of precomposed characters (e.g. "e grave" exists). You really don't want to have to worry about this kind of stuff in the guts of your program. Have an input phase that normalises the code sequence from a file and presents it to you as a list of words (perhaps even a list of morphemes), and never ever have a string as such. There are huge space savings to be made that way.