Naive Bayes

Independent and Identically Distributed examples (iid)

Similar to learning a full Bayesian network, we can also learn a Naive Bayes classifier. A Naive Bayes classifier can be represented by the following Bayesian network structure:

digraph naivebayes { "topic(t1)" -> "word(w1)"; "topic(t1)" -> "word(w2)"; }

Where the topic/1 is the class and word/1 are the features. The arrows point from the class to the features because we assume that when we know the class, the features are independent. This is a too strong assumption for a real-world model but has proven to be a good approximation for a lot of real-world cases. Since this assumption also allows for efficient (linear) inference, the Naive Bayes classifier is very popular and used for a wide variety of applications.

Example 1

Given a set of words that appear in documents, we can train a Naive Bayes classifier to predict the topic of a document. This is the default setting typically found when learning about Naive Bayes classifiers.

t(0.5)::topic(t1). t(0.5)::topic(t2). t(_)::word(w1) :- topic(t1). t(0.0)::word(w1) :- \+topic(t1). t(_)::word(w2) :- topic(t1). t(0.0)::word(w2) :- \+topic(t1). t(_)::word(w1) :- topic(t2). t(0.0)::word(w1) :- \+topic(t2). t(_)::word(w2) :- topic(t2). t(0.0)::word(w2) :- \+topic(t2).
evidence(topic(t1),true). evidence(topic(t2),false). evidence(word(w1), true). evidence(word(w2), false). ----- evidence(topic(t1),false). evidence(topic(t2),true). evidence(word(w1), false). evidence(word(w2), true). -----

When using the command line interface of ProbLog we can write the learned program immediately to a file that can be interpreted by problog to perform classification:

$ problog lfi model.pl evidence.pl -O learned_model.pl

Example 2

In the example above, we train a classifier per topic, if we want to train just one Naive Bayes classifier where the topics are mutually exclusive we can express the root node as an annotated disjunction. In that case we also know that \+topic(t1) is equal to topic(t2).

t(0.5)::topic(t1); t(0.5)::topic(t2). t(_)::word(w1) :- topic(t1). t(_)::word(w1) :- topic(t2). t(_)::word(w2) :- topic(t1). t(_)::word(w2) :- topic(t2).
evidence(topic(t1),true). evidence(topic(t2),false). evidence(word(w1), true). evidence(word(w2), false). ----- evidence(topic(t1),false). evidence(topic(t2),true). evidence(word(w1), false). evidence(word(w2), true). -----

Example 3

Instead of writing down a rule per word (feature) and topic (class), we can make abstract rules and pass variables to the t(_) tunable parameters by adding terms that are variables. For example, when using t(_,A), where A is a variable, LFI will learn a different weight for every grounding of A. This only works if the variables in the tunable parameter can be ground by ProbLog.

is_topic(t1). is_topic(t2). is_word(w1). is_word(w2). t(_,T)::topic(T). t(_,W,T)::word(W) :- is_topic(T), is_word(W), topic(T). t(_,W,T)::word(W) :- is_topic(T), is_word(W), \+topic(T).
evidence(topic(t1),true). evidence(topic(t2),false). evidence(word(w1), true). evidence(word(w2), false). ----- evidence(topic(t1),false). evidence(topic(t2),true). evidence(word(w1), false). evidence(word(w2), true). -----

Features with Unused Terms

Sometimes it might be that the observed features contain terms that are not used for the Naive Bayes classifier. For example, we are given a set of word/2 atoms that represent the word and the word count in a document.

t(0.5)::topic(t1). t(0.5)::topic(t2). t(_)::word(w1,_) :- topic(t1). t(0.0)::word(w1,_) :- \+topic(t1). t(_)::word(w2,_) :- topic(t1). t(0.0)::word(w2,_) :- \+topic(t1). t(_)::word(w1,_) :- topic(t2). t(0.0)::word(w1,_) :- \+topic(t2). t(_)::word(w2,_) :- topic(t2). t(0.0)::word(w2,_) :- \+topic(t2).
evidence(topic(t1), true). evidence(topic(t2), false). evidence(word(w1,2),true). evidence(word(w2,0),false). ----- evidence(topic(t1), false). evidence(topic(t2), true). evidence(word(w1,0),false). evidence(word(w2,5),true). -----

Alternatively, this can be represented by introducing an intermediate, latent variable word/1.

t(0.5)::topic(t1). t(0.5)::topic(t2). t(_)::word(w1) :- topic(t1). t(0.0)::word(w1) :- \+topic(t1). t(_)::word(w2) :- topic(t1). t(0.0)::word(w2) :- \+topic(t1). t(_)::word(w1) :- topic(t2). t(0.0)::word(w1) :- \+topic(t2). t(_)::word(w2) :- topic(t2). t(0.0)::word(w2) :- \+topic(t2). word(W,C) :- word(W).
evidence(topic(t1), true). evidence(topic(t2), false). evidence(word(w1,2),true). evidence(word(w2,0),false). ----- evidence(topic(t1), false). evidence(topic(t2), true). evidence(word(w1,0),false). evidence(word(w2,5),true). -----

Documents in a Network (non-iid)

Suppose these documents are web pages in a large linked network. We might get one large example instead of one entry per document which represents a snapshot of the network. This is useful, for example, when we also want to express relations between the documents such as “linked pages have a higher probability to be about the same topic”. Such transitive relations cannot be represented in an iid setting.

The next program extends the previous programs by adding a term that identifies the document. It is also supplied with one example instead of one example per document.

t(0.5)::topic(t1,D). t(0.5)::topic(t2,D). t(_)::word(w1,_,D) :- topic(t1,D). t(0.0)::word(w1,_,D) :- \+topic(t1,D). t(_)::word(w2,_,D) :- topic(t1,D). t(0.0)::word(w2,_,D) :- \+topic(t1,D). t(_)::word(w1,_,D) :- topic(t2,D). t(0.0)::word(w1,_,D) :- \+topic(t2,D). t(_)::word(w2,_,D) :- topic(t2,D). t(0.0)::word(w2,_,D) :- \+topic(t2,D).
evidence(topic(t1,d1), true). evidence(topic(t2,d1), false). evidence(word(w1,2,d1),true). evidence(word(w2,0,d1),false). evidence(topic(t1,d2), false). evidence(topic(t2,d2), true). evidence(word(w1,0,d2),false). evidence(word(w2,5,d2),true).

When we add the transitive links between topics, we obtain the following network (with two topics, two documents and two words). Note that this is not a Bayesian network anymore because it contains directed cycles.

digraph naivebayes_noniid { "topic(t1,d1)" -> "topic(t1,d2)"; "topic(t2,d1)" -> "topic(t2,d2)"; "links(d1,d2)" -> "topic(t1,d2)" "links(d1,d2)" -> "topic(t2,d2)" "topic(t1,d2)" -> "topic(t1,d1)"; "topic(t2,d2)" -> "topic(t2,d1)"; "links(d2,d1)" -> "topic(t1,d1)" "links(d2,d1)" -> "topic(t2,d1)" "topic(t1,d1)" -> "word(w1,_,d1)"; "topic(t1,d1)" -> "word(w2,_,d1)"; "topic(t2,d1)" -> "word(w1,_,d1)"; "topic(t2,d1)" -> "word(w2,_,d1)"; "topic(t1,d2)" -> "word(w1,_,d2)"; "topic(t1,d2)" -> "word(w2,_,d2)"; "topic(t2,d2)" -> "word(w1,_,d2)"; "topic(t2,d2)" -> "word(w2,_,d2)"; }

The resulting ProbLog program becomes:

t(0.5)::topic(t1,D). t(0.5)::topic(t2,D). t(_)::word(w1,_,D) :- topic(t1,D). t(0.0)::word(w1,_,D) :- \+topic(t1,D). t(_)::word(w2,_,D) :- topic(t1,D). t(0.0)::word(w2,_,D) :- \+topic(t1,D). t(_)::word(w1,_,D) :- topic(t2,D). t(0.0)::word(w1,_,D) :- \+topic(t2,D). t(_)::word(w2,_,D) :- topic(t2,D). t(0.0)::word(w2,_,D) :- \+topic(t2,D). links(d1,d3). t(_)::topic(T,D1) :- links(D2,D1), topic(T,D2).
evidence(topic(t1,d1), true). evidence(topic(t2,d1), false). evidence(word(w1,2,d1),true). evidence(word(w2,0,d1),false). evidence(topic(t1,d2), false). evidence(topic(t2,d2), true). evidence(word(w1,0,d2),false). evidence(word(w2,5,d2),true). evidence(topic(t1,d3), true). evidence(topic(t2,d3), false). evidence(word(w1,0,d3),false). evidence(word(w2,0,d3),false).

In this example, we added a third document that has none of the words but is linked to from document 1 that is about the same topic. In this program we thus combine the presence of features (words) and links between documents to infer the topic of the document.