Social Science Informatics (SWI),
University of Amsterdam,
Roetersstraat 15, 1018 WB Amsterdam, The Netherlands,
Area Editor: Roberto Bagnara
| We have explored the consequences of adding native preemptive threads
to the originally single-threaded SWI-Prolog system. Unlike classical approaches
such as Parlog, SWI-Prolog's approach follows recent initiatives to have
multiple completely separated Prolog engines sharing the same predicates.
In this article we describe multi-threading from the Prolog user as well
as from the implementation aspects. It includes metrics comparing the multi-threaded
version to the single-threaded version as well as speedup that can be achieved
using SMP hardware.
Multi-threaded SWI-Prolog is in its late beta-releases and will become the default on all systems that provide the required primitives shortly.
In todays computing world there is an important reason for concurrency: being able to talk to multiple agents in the outside world concurrently. We see programmers tackle this challenge in a variety of ways. The two most popular are embedding of a Prolog engine in a concurrent programming environment (C, C++, Java, etc.) which serialises requests on the Prolog engine or using event-driven programming. Neither is very attractive. Embedding is hard to program, requires skills in at least two languages and the resulting program is often hard to debug and maintain. Event-driven programming, realising multiple state-engines to handle events from multiple input sources is a non-trivial task at best.
A much cleaner solution can be realised using multiple Prolog threads, especially if native preemptive threading is used, avoiding complications with blocking system calls and exploiting SMP hardware if available.
Initially most of the research aimed at running multiple provers sharing at all Prolog data . This approach is difficult to implement and the role of backtracking is complicated at best. Recently a number of systems (CIAO, Qu-Prolog, Jinni) have adopted threads only sharing the Prolog database. As access to the database is generally structured and localised in the implementation the implementation is relatively simple, while programming such systems is very close to programming a classical single-threaded Prolog system.
In this article we describe our API choices and experience implementing native preemptive threads in SWI-Prolog. In Section 2 we summary our rationale. Next we describe what constitutes a thread and what primitives are used to make threads communicate and synchronise. Section 5 describes implementation experience, followed by performance analysis in Section 6, issues and conclusions.
We have chosen to use the POSIX thread (pthread) library  as a basis for its clean design and the availability on many platforms. On Windows we use a mixture of pthread-win32 and the native Win32 thread-API.
A Prolog thread is a native (POSIX) thread with a Prolog engine capable of proving a goal. A Prolog engine consists of the runtime stacks and required state for proving goals independently. In this view a Prolog term is entirely local to an engine and can only be communicated through other engines through the shared (code-)area. Threads share all other global data, such atoms, predicates, modules, records, streams, etc.
The new Prolog engine runs completely independent. If the thread is attached, any thread can wait for its completion using thread_join/2. Otherwise all resources are reclaimed silently on its completion.
The most difficult aspect of multi-threaded programming is the need to synchronise the concurrently executing threads: ensure they use proper protocols to exchange data and maintain data consistency of shared-data as maintained in dynamic predicates.
Our primarily synchronisation primitive is a FIFO (first-in-first-out) queue of Prolog terms, an approach found in other programming languages under the name port if one queue is attached to each thread or channel if a queue is a stand-alone object. SWI-Prolog supports both ports and channels. Channels were added because they simplify the implementation of the popular worker-crew model significantly as illustrated in Section 4. Queues are manipulated through the following predicates:
To make consistent changes to shared data or execute code not designed for threading, SWI-Prolog provides access to mutex objects (2) to serialise access to a code-fragment. The principal API is with_mutex/2 as described below. Wrapping mutex ownership in a goal allow to unlock the mutex on failure or exception without user concern. This is vital as a mutex held incorrectly by a thread easily deadlocks the application. For convenience and to simplify enabling existing applications with thread-support, SWI-Prolog mutex objects are recursive. To facilitate code that can run in both the single- and multi-threaded version, the single-threaded version provides with_mutex/2 simply calling once/1.
Debugging interaction as well as manager threads are simplified using the last communication primitive called thread_signal/2, which is similar to Qu-Prolog's thread_push_goal/2. Figure 1 show this predicate in action to start the command line tracer in another thread, while Figure 2 shows it interacting with a worker for a work-crew.
|Figure 1: Attach a console and start the debugger in another thread.|
catch(do_work(Work), stop, cleanup),
|Figure 2: Stopping a worker using thread_signal/2. Bold fragments show the relevant parts of the code.|
All predicates, both static and dynamic, are shared between all threads. Changes to static predicates only influence the test-edit-reload cycle, which is discussed in Section 5. For dynamic predicates we kept the `logical update semantics' as defined by the ISO standard . This implies that a goal uses the predicate with the clause-set as found when the goal was started, regardless of whether or not clauses are asserted or retracted by the calling thread or another thread. The implementation ensures consistency of the predicate as seen from Prolog's perspective. Consistency as required by the application such as clause-order and consistency with other dynamic predicates must be ensured using synchronisation as discussed in Section 3.2.
We introduced the notion of thread-local predicates to deal with code that uses dynamic predicates to store intermediate results of a computation. Purists will advocate this is almost invariably bad programming style, but if one examines real-life code written in the Prolog community it is commonly found. Without a new construct, such code must be re-written to store their data in Prolog terms, something that may require a complete redesign or the assert-, retract- and query-code must add an extra argument to indicate the calling thread or the code must be fully serialised (i.e. only one thread may run it at any time). Re-write is often infeasible, adding an extra argument can seriously harm performance and makes it hard to maintain a single and a multi-threaded version in parallel, while serialisation breaks the whole point of multi-threading. The declaration below is similar to declaring foo/2 and bar/1 as dynamic and volatile.
:- thread_local foo/2, bar/1.
Thread-local predicates have the following properties:
This section describes the implementation of a simple network service. The SWI-Prolog socket commands are described in http://www.swi-prolog.org/packages/clib.html. Our service handles a single TCP/IP request per connection, using a specified number of `worker threads' and a single `accept-thread'. The accept-thread executes acceptor/2, accepting connection requests and adding them to the queue for the workers. The workers execute worker/1, getting the accepted socket from the queue, read the request and execute process/2 to compute a reply and write this to the output stream. After this, the worker returns to the queue for the next request.
make_server(Port, Workers) :-
forall(between(1, Workers, _),
thread_create(worker(Queue), _, )),
thread_create(acceptor(Socket, Queue), _, ).
create_socket(Port, Socket) :-
|acceptor(Socket, Queue) :-
tcp_accept(Socket, Client, _Peer),
tcp_open_socket(Client, In, Out),
process(hello, Out) :-
format(Out, 'Hello world! n', ).
|Figure 3: Implementation of a multi-threaded server. Threading primitives are set in bold. The left column builds the server. The top-right runs the acceptor thread, while the bottom-right contains the code for a worker of the crew.|
The advantages of this implementation over a traditional single-threaded Prolog implementation are evident. Our server exploits SMP hardware and will show much more predicable response-times, especially if there is a large distribution in time required by process/1. In addition, we can easily improve on it with more monitoring components. For example, acceptor/2 could immediately respond with an estimated reply-time, or commands can be provided to examine and control activity of the workers. All this can be achieved without complicating the implementation of process/2 with code to be aware of its environment.
We tried to minimise the changes required to turn the single-engine and single-threaded SWI-Prolog system into a multi-threaded version. As explained by Butenhof , code for multi-threaded usage should be designed starting from shared data. Changing an existing implementation is often a compromise.
For the first implementation we split all global data into three sets: data that is initialised when Prolog is initialised and never changes afterward, data that is used for shared data-structures, such as atoms, predicates, modules, etc. and finally data that is only used from a single engine such as the stacks and state of the virtual machine. For each of these we defined a large nested C-structure. In the single-threaded version we have a single globally defined instance of all three. In the multi-threaded version the local structure is dynamically allocated and associated with the thread using pthread_setspecific().
Update to shared data was split into a number of independent subsets, each protected by its own mutex object.
In the second phase, fetching the current engine using pthread_getspecific() was reduced by caching this information inside functions that use it multiple times and passing it as an extra-variable to commonly used small functions as identified using the gprof  profiling tool. Mutex contention was analysed and reduced from some critical places:
Normal (stack-) garbage collection is not affected by threading and continues completely concurrently as it only refers to the stacks. This is an important observation as it allows for threads monitoring and handling the external world under real-time constraints by writing them such that they do not perform garbage collections, while other threads can perform garbage collection.
Atom-garbage collection is more complicated because atoms are shared resources. Atoms referenced from global data such as clauses and records use reference-counting, while atoms reachable from the Prolog stacks are marked during the marking phase of the atom garbage collector. With multiple threads this implies that all threads have to mark their atoms before the collector can reclaim unused atoms. Marking references atoms is executed asynchronously based on signals in Unix and SuspendThread()/ResumeThread() on Windows. While atom garbage collection is in progress, threads that wish to create atoms or change the reference count of an atom (e.g. use assert/1 of a term holding atoms) will block.
SWI-Prolog uses a sliding garbage collector . During the execution of GC, it is very hard to mark atoms. Therefore during atom-GC, normal GC cannot start. As threads cannot access atoms anyhow during atom-GC, this is not likely to cause much further harm. Because atom-GC is such a harmful activity, we should avoid it being blocked by a normal GC. Therefore the system keeps track of the number of normal GC instances active. If atom-GC finds GC is in progress, it signals the atom-GC request and aborts. After the last GC dies it re-schedules atom-GC.
We studied the relative performance to the single-threaded version and speedup on SMP systems.
We used the benchmark suite by Fernando Pereira for comparing the single-threaded to the multi-threaded version running one thread. Achieving comparable performance is important to avoid the need for two distributions, leading to much confusion in the user community. We used a 550Mhz single-CPU Crusoe machine for our tests.
On Windows-2000 compiled with MSVC 5 professional edition, the multi-threaded version performs equal within the measuring accuracy on all tests. On Linux compiled with gcc 2.95 the overall performance of the multi-threaded version is comparable to Windows-2000, but tests involving synchronisation are upto 30% slower while other tests are upto 30% faster. Compiled for single-threading the Linux version is approx. 15% faster on the same hardware. Approx. 8% thereof was attributed to the extra variable required in many functions to keep track of the current Prolog engine.
This section describes the results of multi-threading the Inductive Logic Programming system Aleph . Aleph is an ILP system developed by Ashwin Srinivasan at the Oxford University Computing Laboratory. Inductive Logic Programming (ILP)  is a branch of machine learning that synthesises logic programs using other logic programs as input. The program explores a vast search-space of candidate programs using a generate-and-test approach. One of the techniques used by Aleph is called randomised local search. In this approach the system selects a number of random start-points in the search space and exploring some space around it. As the local searches are independent, this search technique is a good candidate for concurrency.
The Aleph implementation heavily uses dynamic predicates for keeping track of the domain as well as the search process. It uses a largely failure-driven control-structure. We realised concurrency by:
An exploratory study was performed to study the speedup resulting from using multiple threads on an SMP machine. We realised a work-crew model implementation for randomised-local-search in Aleph. As the task is completely CPU bound we expected optimal results if the number of threads equal the number of utilised processors. The task consisted of 16 random restarts, each making 10 moves using the carcinogenesis  data set. This task was carried out using a work-crew of 1, 2, 4, 8 and 16 workers scheduled on an equal number of CPUs. Figure 4 shows the result, providing near optimal speedup, upto 8 processors, while using 16 processors hardly improves performance. Due to limited time and access to the hardware we have not been able to analyse the bottleneck.
|Figure 4: Speedup with increasing number of CPUs defined as elapsed time using one CPU divided by elapsed time using N CPUs. The task consisted of 16 restarts of 10 moves. The values are averaged over 30 runs. The study was performed on a Sun Fire 6800 with 24 SUN UltraSPARC III 900 MHz Processors, 48 GBytes of shared memory, utilising exclusively up to 16 processors. Each processor has 2 GBytes of memory.|
We have described the implementation of adding concurrency based on native preemptive threads using a shared database in an existing Prolog system. The implementation of such a system is relatively straightforward. The last beta-release of multi-threaded SWI-Prolog has been released at April 25, 2003 and will become the default release on all platforms that can support it for the following reasons.
Although we have concentrated on speedup that can be reached on SMP hardware, experience with Aleph described in Section 6.2 illustrates interesting speedups can be achieved. In the current implementation dynamic predicates, atom handling and meta-calling are the bottlenecks that limit the reachable speedup. It is assumed the use of POSIX rw-locks and/or the use of lock-free algorithms  can improve concurrent performance.
We believe Prolog threads based on a shared database has a bright future, especially if a standardised set of primitives becomes available.
SWI-Prolog is a Free Software project which, by nature, profits heavily from user feedback and participation. We would like to express our gratitude to Sergey Tikhonov for his courage to test and help debug early versions of the implementation. The work reported regarding the Aleph system was performed jointly with Ashwin Srinivasan and Steve Moyle at the Oxford University Computing Laboratory. We gratefully acknowledge the Oxford Supercomputing Centre for the use of their system, and in particular Fred Youhanaie for his patient guidance and support.