Errors were found in encodings and/or checkers of three benchmarks
after the competition took place. Below we discuss these errors in
detail and their possible impact on the outcome of the competition.
The errors on this page have been identified by Broes De
Cat and Johan Wittocx.
We offer our sincere apologies for
Impact on the competition results
In all three cases, the errors in
the encodings make the benchmark problems computationally easier for the
ASP solvers that use them. In the case of Labyrinth, the Asparagus encoding solves a variant of the benchmark that is computationally simpler than the actual benchmark. In the case of EdgeMatching, the same can be said for the encodings of Asparagus and of DLV.
the three benchmarks, teams that did not use the erroneous encodings
had a clear disadvantage. These are:
results of the competition show that in the track "Optimization
problems" to which LabyrinthOptimize belongs, margins between
different contenders were quite large and the error most likely had no
impact on the ranking.
- Labyrinth and LabyrinthOptimize: DLV and IDP
- EdgeMatching: bpsolver-CLP(FD), IDP and Enfragmo
The impact is possibly larger in the track "Decision problems in
NP", to which Labyrinth and EdgeMatching belong. If we drop these
2 benchmarks and we compare total number of solved instances, then
Potassco still leads with a good margin, followed by single system
teams IDP, Claspfolio and Cmodels: here, IDP overtakes Claspfolio and
Cmodels. This does not guarantee that IDP would have been first
single system in this track. Indeed, rankings were made on the basis
of weighted sums of number of solved instances for individual
benchmarks and these weights depended on various parameters.
The only fully correct way to recompute rankings would be to rerun all
solvers on all instances of the reduced set of benchmarks. We have
decided against this.
Lessons for the future
In our opinion, a modeling competition like this ASP competition is
valuable and gives valuable information on various aspect of modeling
and solving (see the competition report).
Unfortunately, in such a competition, there is an inherent risk
that bugs go unnoticed or even occur in the programs that should check
for correctness. Based on our experience, here are some ideas that
would have led to an early detection of the above problems and could
help to reduce the risk for errors in future modeling competitions:
For this competition, several combinations of the above techniques
would have led to a timely detection of the above errors.
- Multiple checkers: To test correctness of solutions, one
could use more than one checker program. This would have shown the
problem with EdgeMatching.
- Visualization methods: The organizers could provide, for
each benchmark, a visualization method for benchmark solutions. This
is the way the bugs in Labyrinth and EdgeMatching were
discovered. Several good visualization programs for structures/answer
sets are available.
- Apply the "small scope hypothesis": if there are bugs in
an encoding, then probably they will show up in small domains. This
idea could be exploited in various ways.
- Exhaustive list of solutions for some test instances: The
organizers could provide for some small problem instances an
exhaustive list of all solutions. Since every benchmark encoding can
easily be converted in a checker, each team could verify that its
encoding accepts all solutions.
- Exhaustive list of test instances: In the previous idea,
it is important that the set of problem instances is large enough.
ASP or FO(.) benchmark encodings can easily be converted in a
generator that for a given input domain, generates all combinations of
problem instances and solutions within this domain. The organizers
could provide all combined solutions in the context of a fixed small
- Number of solutions: The above method might be infeasible
because even for a small domain, the amount of problem instances and
solutions may be prohibitive. A more compact method might be to
provide the number of solutions for a given instance, or the number of
combinations of instances and solutions for a given small domain.
- A test script: Alternatively, the organizers could provide a
script that generates a large number of combinations of instances and
solutions and tests whether a benchmark encoding accepts all of them.
The input of this script would be the participants benchmark encoding
and a (small) domain or (small) problem instance.