# A Bayesian classifier in Clojure. Part 1

## Intro

Sometimes you need to classify incoming data. Consider yourself lucky if this can be done deterministically. In most interesting cases you have to use machine learning for that. There’s been a prominent progress in this field lately: you might’ve already heard about deep learning, ImageNet, and friends. On the other hand, not all classification problems are related to images. Not every problem needs to be solved with a mysterious artificial neural network, be it shallow or deep. There are plenty “old-school” machine learning methods applicable to real-world problems.

While working for Shuttlerock, I had to solve one of such problems. Here’s the story behind it.

Imagine that you’re a fan of sports and there’s a local stadium in your city where sports, well, take place. People come there, practice, watch others and, most importantly, post to Instagram! You want to create a live wall of Instagram posts related to the stadium, for that would be very motivational and just fun. Since there’s no default hashtag for the things happening at the stadium, you’re better off using location search. So you fire Google Maps, find out the exact coordinates of the stadium, and use location endpoints to fetch fresh stuff.

Then you implement the refresh algorithm, open the whole thing on a wide screen Friday evening and sip a cold beer like every other sports fan does. Everything is alright, and you are going to boast about it to your friends and neighbors, but… wait, what, do… DOGS? What the hell?! YOU HATE DOGS! Well, apparently some big dog lover lives near the stadium and posts his photos with almost the same coordinates. It’s not that he even gets his dogs onto the stadium!

Ok, you add a rule to remove his Instagram account from your show and try to forget the whole thing. Well, dogs were mean to you in your childhood and you hate them. Crisis averted!

…Eh, what? It seems that the guy has a whole network of friends who come to his house and post dogs to Instagram! And these dog people seem to make social connections at an enormous rate! What do they do, sniff at each other nose?! It’s a new Instagram account every day!

So, you can’t ditch location-based search and can’t refine it (Instagram API allows to specify a range, but GPS devices don’t give exact coordinates, so you can’t reduce this range down to, like, 1 meter). On the other hand, dogs must be filtered out.

## Classification

What we came to is a classification task, similar to spam detection. In our setting sports are “ham”, and everything dog-related is “spam”. If a cat fan appears later, it will be “spam” too. How could we establish the difference to be able to automatically assign a class to incoming posts?

First let’s settle on what we are looking at. Classifying images by their contents (i.e., pixels) is hard, but luckily most Instagram posts contain hashtags which we can use as an input (“features” in machine learning speak).

The intuition here is that a post with following hashtags:

#doggie #Pluto #mylovelydog #Scotch #terrier

will most certainly contain a dog image, while a post with

#stadium #sweat #game #soccer

is expected to be sports-related.

Now, we are not going to create whitelists and blacklists of words. The machine learning approach is different. Instead of coming out with rules, we prepare a training set which contains a number of “posts” (in fact, we only need the hashtag portion, so a post can be trimmed down to a set of hashtag strings) with a boolean flag attached to each one (e.g., true marks post as ham, and false — as spam). Then we let allow a classifier to “learn”, i.e., somehow figure the difference. After learning is done, the classifier, being presented a post with an unknown class, will tell us if it’s spam or ham. Depending on the approach used, sometimes the probabilities (e.g., spam 86%, ham 14%) are also reported.

## Counting words

The task is going to be separated into two subtasks:

1. Single-word reasoning. I.e. what can we say about a post containing some hashtag, like #soccer?
2. Combining individual hashtag scores. If it’s #soccer #match, how about that?

Let’s start with the former. Basically, we’d want to calculate the probability of a post being spam given that there’s some #hashtag in it.

And here comes the Bayesian approach. There’s a well-known Bayes’ formula that can give us exactly what we want:

Cryptic, huh? Let me describe what it means if you’re new to the field. It says something about probabilities. $A$ and $B$ are events. $P(A)$ is the probability of event $A$ taking place, and $P(B)$ is related to $B$ in the same fashion. $P(A\mid B)$ is a notion of conditional probability, i.e., the probability of event $A$ in case event $B$ takes place.

Let’s use the formula in our setting:

Wow, we’ve just expressed an unknown probability in terms of three other unknown probabilities! Sounds like progress!

But, in fact, these three can be calculated. What’s $P(hashtag\mid post~is~spam)$? Remember, we have a training set which consists of some spam and some ham posts. What if we just go over all spam posts and count how many of them contain our #hashtag? Let’s denote the count as $N_{spam}(hashtag)$. Dividing by $N_{spam}$, which is the total number of spam posts in the training set, brings us the desired value:

Now let’s attack $P(hashtag)$. Using the formula of full probability, we can express it in this way:

Calculating conditional probability for ham in a similar fashion, we finally come to the following:

Basically, only two unknown probabilities are left here: $P(post~is~spam)$ and $P(post~is~ham)$. They are called prior probabilities, which means that they are not related to our post in question, but rather reflect the total proportion of spam and ham out there. And we better not fixate on any specific proportion because it’s really hard to predict. In particular, the spam/ham ratio in the training set is a very bad estimate, because we don’t know how it was collected. Therefore the safe way to go would be to assume both priors to be equal to 0.5. The formula gets even simpler:

What we just did is estimating the probability of a post containing #hashtag to be spam.

## Some Gory Details about $p$

There’s more to consider about $p$. First, what if #hashtag was never seen before? The formula gives us $0/0$, which is clearly bad. In this case, we should resort to some predefined probability, $p_{unknown}$. Per research, the best value here is again 0.5.

Another problem is that the formula for $p$ doesn’t take the total number of hashtag occurrences into consideration. Only proportion is used. Let’s see why it might be important.

Let’s imagine that some random hashtag, #evening, appeared in one of the spam posts in our training set. Apparently, it’s not relevant because you can do both things in the evening: play sports (ham) and mess with dogs (spam). However, $p$ would be equal to 1! Let’s take another hashtag, #dog. Say it appeared 98 times in 100 posts. Here $p=0.98$. A random word beats a real spam marker, which occurred in most of the spam posts! The whole system is way too sensitive to rare hashtags. We need a way to make it more robust.

Luckily, there’s a solution by Gary Robinson, presented in his seminal post, “Spam Detection” and later in LinuxJournal article “A Statistical Approach to the Spam Problem”. I won’t go into much detail, but rather describe the logic in its essence. Follow the links if you’re interested.

So, basically, if a word is completely missing from our dataset, we assign the spam probability to 0.5. If the word occurred once in the spam corpus and never in the ham, the probability should be greater than 0.5, but not much. Because it’s, well, only one occurrence, might be random as discussed earlier. However, if the word occurred 100 times in a spam corpus with size 100, this ought to be serious and the probability better be close to 1.

This intuition is captured in the following formula suggested by Gary:

where:

• $w$ is the word (hashtag) in question;
• $p(w)$ is the same $p$ we already know how to calculate (see prev. formula);
• $s$ is the strength we want to give to our background information ($s\le1$);
• $N(w)$ is the number of posts in the training set that contain the word $w$ (both spam and ham). $N(w)=N_{spam}(w)+N_{ham}(w)$.

$f(w)$ is called the degree of belief. It can easily be seen that with small $N$’s $f(w)$ is close to $p_{unknown}$, while higher $N$ values result in $f(w)$ approaching $p(w)$. This value, $f(w)$, is not a real probability, it’s our best guess at it, better than $p(w)$ (which is, by the way, an estimate too!). So we will be using $f(w)$ instead of $p(w)$ from now on.

## Combining probabilities

So far we were able to calculate the probability of a post being spam given some hashtag (denoted by $w$). But let’s remember that there might be multiple hashtags in one post (we’ll name them $w_1, w_2, \ldots, w_n$). Accordingly, there will be $n$ probabilities: $p_1, p_2, \ldots, p_n$ ($p_i=f(w_i)$). What do we do about them? Like, add up, or multiply?

### The Naive Bayes Approach

This method to combine separate probabilities is based on the premise that the occurrences of $w_1, w_2, \ldots, w_n$ are independent events: words are related to each other. That is, having #dog #terrier in a post is a pure coincidence. The #terrier hashtag does not increase the probability of seeing also a #dog hashtag. As you can conclude, that’s not exactly true in real-world situations. The classifier is called naive precisely because of that.

Nevertheless, it’s still a viable approach. It can be shown (by using the Bayes’ formula, equating the priors and doing some simple math) that the overall spam probability

This formula was used by Paul Graham who pioneered the statistical approach in the spam combat.

### The $\chi^2$ Approach

As Gary Robinson points out in his article, there’s another approach to combine probabilities originating from the field of statistics known as meta-analysis. If we calculate the following value,

the result will have a $\chi^2$ distribution with $2n$ degrees of freedom as shown by R. A. Fisher. Therefore we can use a $\chi^2$ table to compute the probability of getting a result as extreme, or more extreme, than the one obtained. This “combined” probability meaningfully summarizes all the individual probabilities.

In fact, this Fisher probability is tricky (read the Gary’s article with explanations, especially about the null hypothesis and its verification). What we can notice from the formula is that the probabilities are multiplied. If we consider a ham post, most $p_i$’s will be close to zero, and the product will be a very small value. Due to that, the classifier will be very sensitive to “hammy” words. As for “spammy” words, they will be close to 1 and thus make less impact on the final product. So the classifier will be less sensitive to “spammy” words than it is to “hammy” words (which will result in “false negatives”, i.e., spam posts mistakenly considered as ham).

But there’s a technique to deal with this bias towards ham. We can just repeat the Fisher calculation with “reversed” probabilities ($p'_i=1-p_i$) and obtain another indicator, which will be more spam-sensitive. Finally, both indicators (let’s call them $H$ and $S$) can be united into one:

If $I$ is close to 0, the conclusion is that the post in question is ham. If $I$ is near 1, the conclusion is just the opposite. However, if both $H$ and $S$ are near 0 (this might happen if a post contains both “spammy” and “hammy” hashtags), $I$ will be around 0.5 (meaning “unsure”). This is a very handy feature making the $I$-based classifier results much more robust.

## Conclusion

We’ve discussed some theory behind statistical classification methods. Next part will show how to implement the $f(w)$ and $\chi^2$ calculations in Clojure and build a classification microservice.

Stay tuned!

Newer >>