Enabling Multi-Threaded Execution in SWI-Prolog


Jan Wielemaker
Social Science Informatics (SWI),
University of Amsterdam,
Roetersstraat 15, 1018 WB Amsterdam, The Netherlands,

jan@swi.psy.uva.nl

Area Editor: Roberto Bagnara


Abstract

We have explored the consequences of adding native preemptive threads to the originally single-threaded SWI-Prolog system. Unlike classical approaches such as Parlog, SWI-Prolog's approach follows recent initiatives to have multiple completely separated Prolog engines sharing the same predicates. In this article we describe multi-threading from the Prolog user as well as from the implementation aspects. It includes metrics comparing the multi-threaded version to the single-threaded version as well as speedup that can be achieved using SMP hardware.

Multi-threaded SWI-Prolog is in its late beta-releases and will become the default on all systems that provide the required primitives shortly.

1 Introduction

In todays computing world there is an important reason for concurrency: being able to talk to multiple agents in the outside world concurrently. We see programmers tackle this challenge in a variety of ways. The two most popular are embedding of a Prolog engine in a concurrent programming environment (C, C++, Java, etc.) which serialises requests on the Prolog engine or using event-driven programming. Neither is very attractive. Embedding is hard to program, requires skills in at least two languages and the resulting program is often hard to debug and maintain. Event-driven programming, realising multiple state-engines to handle events from multiple input sources is a non-trivial task at best.

A much cleaner solution can be realised using multiple Prolog threads, especially if native preemptive threading is used, avoiding complications with blocking system calls and exploiting SMP hardware if available.

Initially most of the research aimed at running multiple provers sharing at all Prolog data [3]. This approach is difficult to implement and the role of backtracking is complicated at best. Recently a number of systems (CIAO, Qu-Prolog, Jinni) have adopted threads only sharing the Prolog database. As access to the database is generally structured and localised in the implementation the implementation is relatively simple, while programming such systems is very close to programming a classical single-threaded Prolog system.

In this article we describe our API choices and experience implementing native preemptive threads in SWI-Prolog. In Section 2 we summary our rationale. Next we describe what constitutes a thread and what primitives are used to make threads communicate and synchronise. Section 5 describes implementation experience, followed by performance analysis in Section 6, issues and conclusions.

2 Requirements

Smooth cooperation with (threaded) foreign code
Prolog applications operating in the real world often require substantial amounts of `foreign' code for interaction with the outside world: window-system interface, interfaces to dedicated devices and networks. Prolog threads must be able to call arbitrary foreign code without blocking the other (Prolog-) threads and foreign code must be able to create, use and destroy Prolog engines.

Simple for the Prolog programmer
We want to introduces few and easy to use primitives to the Prolog programmer.

Robust during development
We want to be as robust as feasible during interactive use and the test-edit-reload development cycle. In particular this implies the use of synchronisation elements that will not easily create deadlocks when used incorrectly.

We have chosen to use the POSIX thread (pthread) library [2] as a basis for its clean design and the availability on many platforms. On Windows we use a mixture of pthread-win32 and the native Win32 thread-API.

3 The Prolog Thread-API

A Prolog thread is a native (POSIX) thread with a Prolog engine capable of proving a goal. A Prolog engine consists of the runtime stacks and required state for proving goals independently. In this view a Prolog term is entirely local to an engine and can only be communicated through other engines through the shared (code-)area. Threads share all other global data, such atoms, predicates, modules, records, streams, etc.

3.1 Creating Threads

thread_create(:Goal, -Id, +Options)
Create a thread which immediately starts executing Goal. Id is unified with the thread-identifier. The description of all Options is beyond the scope of this article. It allows for naming the thread, setting stack-size limits, etc. The thread_create/3 call completes immediately.

The new Prolog engine runs completely independent. If the thread is attached, any thread can wait for its completion using thread_join/2. Otherwise all resources are reclaimed silently on its completion.

thread_join(+Id, -Result)
Wait for the thread Id to finish and unify Result with the completion status, which is one of true(), false() or exception(Term).

3.2 Synchronisation

The most difficult aspect of multi-threaded programming is the need to synchronise the concurrently executing threads: ensure they use proper protocols to exchange data and maintain data consistency of shared-data as maintained in dynamic predicates.

Our primarily synchronisation primitive is a FIFO (first-in-first-out) queue of Prolog terms, an approach found in other programming languages under the name port if one queue is attached to each thread or channel if a queue is a stand-alone object. SWI-Prolog supports both ports and channels. Channels were added because they simplify the implementation of the popular worker-crew model significantly as illustrated in Section 4. Queues are manipulated through the following predicates:

message_queue_create(-Queue)
Create a FIFO message queue (channel). Message queues can be read from multiple threads. Each thread has a message queue (port) attached as it is created. Channels are used to realise a worker-crew (Figure 3).

thread_send_message(+QueueOrThread, +Term)
Add a term to the given queue or default queue of the thread. Return immediately. (1)

thread_get_message([+Queue], ?Term)
Get a message from the given queue (channel) or default queue (port). The first message that unifies with Term is removed from the queue and returned. If multiple threads are waiting, only one will be given the term. If the queue has no matching terms, execution of the calling thread is suspended.

To make consistent changes to shared data or execute code not designed for threading, SWI-Prolog provides access to mutex objects (2) to serialise access to a code-fragment. The principal API is with_mutex/2 as described below. Wrapping mutex ownership in a goal allow to unlock the mutex on failure or exception without user concern. This is vital as a mutex held incorrectly by a thread easily deadlocks the application. For convenience and to simplify enabling existing applications with thread-support, SWI-Prolog mutex objects are recursive. To facilitate code that can run in both the single- and multi-threaded version, the single-threaded version provides with_mutex/2 simply calling once/1.

with_mutex(Name, :Goal)
Execute Goal as once/1 while holding the named mutex.

Debugging interaction as well as manager threads are simplified using the last communication primitive called thread_signal/2, which is similar to Qu-Prolog's thread_push_goal/2. Figure 1 show this predicate in action to start the command line tracer in another thread, while Figure 2 shows it interacting with a worker for a work-crew.

Figure 1: Attach a console and start the debugger in another thread.

Worker Manager
worker(Queue) :-
   thread_get_message(Queue, Work),
   catch(do_work(Work), stop, cleanup),
   worker(Queue).
   ...
   thread_signal(Worker, throw(stop)),
   ...
Figure 2: Stopping a worker using thread_signal/2. Bold fragments show the relevant parts of the code.

thread_signal(+Thread, :Goal)
Make Thread execute Goal on the first opportunity. `First opportunity' is defined to be the first pass through the call-port or foreign code calling PL_handle_signals(). The latter mechanism is used to make threads handle signals during blocking I/O, etc.

3.3 Predicates

All predicates, both static and dynamic, are shared between all threads. Changes to static predicates only influence the test-edit-reload cycle, which is discussed in Section 5. For dynamic predicates we kept the `logical update semantics' as defined by the ISO standard [4]. This implies that a goal uses the predicate with the clause-set as found when the goal was started, regardless of whether or not clauses are asserted or retracted by the calling thread or another thread. The implementation ensures consistency of the predicate as seen from Prolog's perspective. Consistency as required by the application such as clause-order and consistency with other dynamic predicates must be ensured using synchronisation as discussed in Section 3.2.

We introduced the notion of thread-local predicates to deal with code that uses dynamic predicates to store intermediate results of a computation. Purists will advocate this is almost invariably bad programming style, but if one examines real-life code written in the Prolog community it is commonly found. Without a new construct, such code must be re-written to store their data in Prolog terms, something that may require a complete redesign or the assert-, retract- and query-code must add an extra argument to indicate the calling thread or the code must be fully serialised (i.e. only one thread may run it at any time). Re-write is often infeasible, adding an extra argument can seriously harm performance and makes it hard to maintain a single and a multi-threaded version in parallel, while serialisation breaks the whole point of multi-threading. The declaration below is similar to declaring foo/2 and bar/1 as dynamic and volatile.


:- thread_local
        foo/2,
        bar/1.

Thread-local predicates have the following properties:

4 A Short Example

This section describes the implementation of a simple network service. The SWI-Prolog socket commands are described in http://www.swi-prolog.org/packages/clib.html. Our service handles a single TCP/IP request per connection, using a specified number of `worker threads' and a single `accept-thread'. The accept-thread executes acceptor/2, accepting connection requests and adding them to the queue for the workers. The workers execute worker/1, getting the accepted socket from the queue, read the request and execute process/2 to compute a reply and write this to the output stream. After this, the worker returns to the queue for the next request.

:- use_module(library(socket)).

make_server(Port, Workers) :-
   create_socket(Port, Socket),
   message_queue_create(Queue),
   forall(between(1, Workers, _),
   thread_create(worker(Queue), _, [])),
   thread_create(acceptor(Socket, Queue), _, []).

create_socket(Port, Socket) :-
   tcp_socket(Socket),
   tcp_bind(Socket, Port),
   tcp_listen(Socket, 5).
acceptor(Socket, Queue) :-
   tcp_accept(Socket, Client, _Peer),
   thread_send_message(Queue, Client),
   acceptor(Socket, Queue).


worker(Queue) :-
   thread_get_message(Queue, Client),
   tcp_open_socket(Client, In, Out),
   read(In, Command),
   close(In),
   process(Command, Out),
   close(Out),
   worker(Queue).

process(hello, Out) :-
   format(Out, 'Hello world! n', []).

Figure 3: Implementation of a multi-threaded server. Threading primitives are set in bold. The left column builds the server. The top-right runs the acceptor thread, while the bottom-right contains the code for a worker of the crew.

The advantages of this implementation over a traditional single-threaded Prolog implementation are evident. Our server exploits SMP hardware and will show much more predicable response-times, especially if there is a large distribution in time required by process/1. In addition, we can easily improve on it with more monitoring components. For example, acceptor/2 could immediately respond with an estimated reply-time, or commands can be provided to examine and control activity of the workers. All this can be achieved without complicating the implementation of process/2 with code to be aware of its environment.

5 Implementation Issues

We tried to minimise the changes required to turn the single-engine and single-threaded SWI-Prolog system into a multi-threaded version. As explained by Butenhof [2], code for multi-threaded usage should be designed starting from shared data. Changing an existing implementation is often a compromise.

For the first implementation we split all global data into three sets: data that is initialised when Prolog is initialised and never changes afterward, data that is used for shared data-structures, such as atoms, predicates, modules, etc. and finally data that is only used from a single engine such as the stacks and state of the virtual machine. For each of these we defined a large nested C-structure. In the single-threaded version we have a single globally defined instance of all three. In the multi-threaded version the local structure is dynamically allocated and associated with the thread using pthread_setspecific().

Update to shared data was split into a number of independent subsets, each protected by its own mutex object.

In the second phase, fetching the current engine using pthread_getspecific() was reduced by caching this information inside functions that use it multiple times and passing it as an extra-variable to commonly used small functions as identified using the gprof [5] profiling tool. Mutex contention was analysed and reduced from some critical places:

All predicates
used reference-counting to clean up deleted clauses after retract/1 for dynamic or (re-)consult/1 for static code. Dynamic clauses require synchronisation to make changes visible and cleanup erased clauses, but static code can do without. Reclaiming dead clauses from static code as a result from the test-edit-reconsult cycle is left to a garbage collector that operates similar to the atom-garbage collection described in Section 5.1.

Permanent heap allocation
uses a pool of free memory chucks associated with the thread's engine. This allows threads to allocate and free permanent memory without synchronisation.

5.1 Garbage Collection

Normal (stack-) garbage collection is not affected by threading and continues completely concurrently as it only refers to the stacks. This is an important observation as it allows for threads monitoring and handling the external world under real-time constraints by writing them such that they do not perform garbage collections, while other threads can perform garbage collection.

Atom-garbage collection is more complicated because atoms are shared resources. Atoms referenced from global data such as clauses and records use reference-counting, while atoms reachable from the Prolog stacks are marked during the marking phase of the atom garbage collector. With multiple threads this implies that all threads have to mark their atoms before the collector can reclaim unused atoms. Marking references atoms is executed asynchronously based on signals in Unix and SuspendThread()/ResumeThread() on Windows. While atom garbage collection is in progress, threads that wish to create atoms or change the reference count of an atom (e.g. use assert/1 of a term holding atoms) will block.

5.1.1 Atom-GC and GC Interaction

SWI-Prolog uses a sliding garbage collector [1]. During the execution of GC, it is very hard to mark atoms. Therefore during atom-GC, normal GC cannot start. As threads cannot access atoms anyhow during atom-GC, this is not likely to cause much further harm. Because atom-GC is such a harmful activity, we should avoid it being blocked by a normal GC. Therefore the system keeps track of the number of normal GC instances active. If atom-GC finds GC is in progress, it signals the atom-GC request and aborts. After the last GC dies it re-schedules atom-GC.

6 Performance Evaluation

We studied the relative performance to the single-threaded version and speedup on SMP systems.

6.1 Comparing Multi-Threaded to Single Threaded Version

We used the benchmark suite by Fernando Pereira for comparing the single-threaded to the multi-threaded version running one thread. Achieving comparable performance is important to avoid the need for two distributions, leading to much confusion in the user community. We used a 550Mhz single-CPU Crusoe machine for our tests.

On Windows-2000 compiled with MSVC 5 professional edition, the multi-threaded version performs equal within the measuring accuracy on all tests. On Linux compiled with gcc 2.95 the overall performance of the multi-threaded version is comparable to Windows-2000, but tests involving synchronisation are upto 30% slower while other tests are upto 30% faster. Compiled for single-threading the Linux version is approx. 15% faster on the same hardware. Approx. 8% thereof was attributed to the extra variable required in many functions to keep track of the current Prolog engine.

6.2 A Case Study: Speedup on SMP Systems

This section describes the results of multi-threading the Inductive Logic Programming system Aleph [9]. Aleph is an ILP system developed by Ashwin Srinivasan at the Oxford University Computing Laboratory. Inductive Logic Programming (ILP) [8] is a branch of machine learning that synthesises logic programs using other logic programs as input. The program explores a vast search-space of candidate programs using a generate-and-test approach. One of the techniques used by Aleph is called randomised local search. In this approach the system selects a number of random start-points in the search space and exploring some space around it. As the local searches are independent, this search technique is a good candidate for concurrency.

The Aleph implementation heavily uses dynamic predicates for keeping track of the domain as well as the search process. It uses a largely failure-driven control-structure. We realised concurrency by:

Isolating the work that will be carried out by each worker
We reworked the internal communication isolating a predicate taking a specification where to start and returning the best result from that start.

Decide on thread-local
Some of the dynamic predicates need to be shared as they express the domain, while other needed to be local to each searching thread. This work requires a good overview of the application.

Design a work-crew model
We designed a work-crew model where a manager created two queues, start N workers and fill the work-queue with start-locations. The workers return best clauses in the other queue. The manager collects the best result. If it finds a perfect result it kills the crew using thread_signal/2. Otherwise it waits for as many results as it has given start-locations.

6.2.1 Experimental Results and Discussion

An exploratory study was performed to study the speedup resulting from using multiple threads on an SMP machine. We realised a work-crew model implementation for randomised-local-search in Aleph. As the task is completely CPU bound we expected optimal results if the number of threads equal the number of utilised processors. The task consisted of 16 random restarts, each making 10 moves using the carcinogenesis [7] data set. This task was carried out using a work-crew of 1, 2, 4, 8 and 16 workers scheduled on an equal number of CPUs. Figure 4 shows the result, providing near optimal speedup, upto 8 processors, while using 16 processors hardly improves performance. Due to limited time and access to the hardware we have not been able to analyse the bottleneck.

Figure 4: Speedup with increasing number of CPUs defined as elapsed time using one CPU divided by elapsed time using N CPUs. The task consisted of 16 restarts of 10 moves. The values are averaged over 30 runs. The study was performed on a Sun Fire 6800 with 24 SUN UltraSPARC III 900 MHz Processors, 48 GBytes of shared memory, utilising exclusively up to 16 processors. Each processor has 2 GBytes of memory.

7 Conclusions

We have described the implementation of adding concurrency based on native preemptive threads using a shared database in an existing Prolog system. The implementation of such a system is relatively straightforward. The last beta-release of multi-threaded SWI-Prolog has been released at April 25, 2003 and will become the default release on all platforms that can support it for the following reasons.

Although we have concentrated on speedup that can be reached on SMP hardware, experience with Aleph described in Section 6.2 illustrates interesting speedups can be achieved. In the current implementation dynamic predicates, atom handling and meta-calling are the bottlenecks that limit the reachable speedup. It is assumed the use of POSIX rw-locks and/or the use of lock-free algorithms [6] can improve concurrent performance.

We believe Prolog threads based on a shared database has a bright future, especially if a standardised set of primitives becomes available.

Acknowledgements

SWI-Prolog is a Free Software project which, by nature, profits heavily from user feedback and participation. We would like to express our gratitude to Sergey Tikhonov for his courage to test and help debug early versions of the implementation. The work reported regarding the Aleph system was performed jointly with Ashwin Srinivasan and Steve Moyle at the Oxford University Computing Laboratory. We gratefully acknowledge the Oxford Supercomputing Centre for the use of their system, and in particular Fred Youhanaie for his patient guidance and support.

1 Bibliography

[1]
Karen Appleby, Mats Carlsson, Seif Haridi, and Dan Sahlin. Garbage collection for Prolog based on WAM. Communications of the ACM, 31(6):719--741, 1988.

[2]
David R. Butenhof. Programming with POSIX threads. Addison-Wesley, Reading, MA, USA, 1997.

[3]
Manuel Carro and Manuel V. Hermenegildo. Concurrency in Prolog using threads and a shared database. In International Conference on Logic Programming, pages 320--334, 1999.

[4]
P. Deransart, A. Ed-Dbali, and L. Cervoni. Prolog: The Standard. Springer-Verlag, New York, 1996.

[5]
Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick. gprof: a call graph execution profiler. In SIGPLAN Symposium on Compiler Construction, pages 120--126, 1982.

[6]
Maurice Herlihy. A methodology for implementing highly concurrent data objects. ACM Transactions on Programming Languages and Systems, 15(5):745--770, November 1993.

[7]
R.D. King and A. Srinivasan. Prediction of rodent carcinogenicity bioassays from molecular structure using inductive logic programming. Environmental Health Perspectives, 104(5):1031--1040, 1996.

[8]
S. Muggleton and L. De Raedt. Inductive Logic Programming: Theory and Method. Journal of Logic Programming, 19-20:629--679, 1994.

[9]
A. Srinivasan. The Aleph Manual, 2003.

Footnotes

note-1
For a memory-efficient realisation of the pipeline model it may be desirable to suspend if the queue exceeds a certain length, waiting for the consumers to drain the queue.
note-2
Also known as critical sections