Clustering With User Feedback

This post is based on the following publications:

Software is available in the following toolboxes: COBRAS and nCOBRAS

Data analysts often try to make sense of data by clustering it. This means they try to identify groups of objects that are somehow very similar and “belong together”. For instance, this is what biologists have done for ages when they categorize animals and plants: they identify, say, felines, as a set of animals with certain characteristics, which can be further subdivided into lions, tigers, leopards, etc. Similarly, one could cluster movies according to their genre: western, science fiction, etc. This is a task that service providers such as Netflix are interested in — understanding clusters helps them make better recommendations to customers!

But objects can often be clustered in different ways. Movies could be considered “similar” when they fit in the same genre, but also, when they are from the same director, share important actors, etc. So when we say “similar” objects should be clustered together, we first need to be precise about what “similar” means. Once that is done, a computer program can cluster the data for us. But how do you explain to a computer what you consider similar?

Traditionally, the user needs to provide a mathematical formula that expresses similarity. But it is often very hard for users to provide such a formula or algorithm. And that’s not the only problem. Even for a fixed similarity measure, different clustering algorithms may return different clusterings. Moreover, algorithms for clustering have so-called hyperparameters, which can be set by the user and which change their behavior. So, to obtain a good clustering, a user needs to get at least 3 things right: the similarity measure, the clustering algorithm, and it hyperparameter settings. Often, this requires a lot of trial and error and a good understanding of how clustering works.

Our research focuses on how to make it easier for users to get the clusters they want. We achieve this through active, semi-supervised, robust clustering.

Unsupervised clustering: given an algorithm, distance metric and hyperparameters, a clustering is produced.

Semi-supervised learning

To avoid having to go through the process of finetuning the distance metric, algorithm and hyperparameters, some clustering algorithms allow to simply provide some example pairs of objects that belong in the same cluster and some pairs of objects that belong in different clusters.

In the literature, these example pairs are often referred to as constraints: a must-link constraint between objects reflects that the two involved objects must be in the same cluster, a cannot-link constraint reflects that the two involved objects must be in different clusters.

Active learning

Not all examples a user provides are equally informative. Some of the provided examples might be useless for the algorithm. To minimize the effort the user needs to spent labelling, it is better to let the algorithm decide which pairs of instances the user should label. This way, the algorithm can query the most informative object pairs first and produce a high-quality clustering with a low amount of effort from the user. Concretely, the algorithm will ask the user: “Should object x and object y be in the same cluster or not?”. Answering with “yes” results in a must-link constraint between the objects, while answering with “no” results in a cannot-link constraint. This setting is called active semi-supervised clustering.

Active semi-supervised clustering: the user answers queries to produce a high-quality clustering.
COBRAS is an active semi-supervised algorithm developed at DTAI. COBRAS recognizes that in order for an active semi-supervised clustering algorithm to be practical it needs to satisfy three requirements: it needs to be query efficient, i.e. it should produce high quality clusterings with small amounts of queries; it should be time efficient, i.e. the time in between two queries should be short; and anytime, at any point in the clustering process the user should be able to inspect the current clustering and decide this result is good enough. COBRAS satisfies all three of these requirements.

For more on how COBRAS can be used to cluster timeseries see Time Series Clustering.)

User mistakes in active semi-supervised clustering

Asking the user questions is a good way to gather informative constraints from the user. However, a user might also make mistakes when answering the queries. Most semi-supervised clustering algorithms assume that all answers are correct. In these algorithms, a single user mistake can have a detrimental impact on the quality of the produced clustering.

COBRAS is one of the algorithms is very sensitive to user mistakes: it assumes that all constraints are correct. Any mistake will result in a significant decrease in clustering quality. To solve this problem, researchers at DTAI developed a novel strategy to detect and correct the user's mistakes. The strategy actively asks additional questions that introduce redundancy in the known constraints. This redundancy allows to reason about the correctness of the constraints. If the redundant constraint agrees with previously obtained constraints, this increases the belief that the constraints are correct. Conversely, if the redundant constraint contradicts any of the previously obtained constraints this signals the presence of a noisy constraint. By reasoning probabilistically, the approach can correct the user’s mistakes with a high degree of certainty.

nCOBRAS is a modification of COBRAS where this noise correcting mechanism is added. This mechanism makes COBRAS significantly more robust to mistakes from the user at the cost of a few additional redundant constraints. Extensive experiments show that, in the presence of noise, nCOBRAS produces on average higher quality clusterings than COBRAS for the same number of queries.

Jonas Soenen
Jonas Soenen
PhD student

My research interests include active and semi-supervised learning applied to usually unsupervised problems such as clustering and anomaly detection.

Wannes Meert
Wannes Meert
Research Manager

Applications of Artificial Intelligence and Machine Learning.

Hendrik Blockeel
Hendrik Blockeel
Full professor

Theory and algorithms for machine learning and data mining in general.

Related