This is the second in my series of implementing low-level machine learning algorithms in Matlab. We first did linear regression with gradient descent and now we’re working with the more popular naive bayes classifier. As is evident from the name, NB it is a classifier i.e. it sorts data points into classes based on some features. We’ll be writing code for NB using low-level matlab (meaning we won’t use matlab’s implementation of NB). Here’s the example we’ve taken (with a bit of modification) from here.

Consider the following vector:

(likes shortbread, likes lager, eats porridge, watched England play football, nationality)T

A vector $x = (1, 0, 1, 0, 1)^T$ would describe that a person likes shortbread, does not like lager, eats porridge, has not watched England play football and is a national of Scottland. The final point is the class that we want to predict and takes two values: 1 for Scottish, 0 for English.

Here’s the data we’re given:

 X = [ 0 0 1 1 0 ; 1 0 1 0 0 ; 1 1 0 1 0 ; 1 1 0 0 0 ; 0 1 0 1 0 ; 0 0 1 0 0 ; 1 0 1 1 1 ; 1 1 0 1 1 ; 1 1 1 0 1 ; 1 1 1 0 1 ; 1 1 1 1 1 ; 1 0 1 0 1 ; 1 0 0 0 1 ]; 

Notice that usually when we represent data, we write features in columns, instances in rows. If this is the case, we need to get the data in proper orientation: features in rows, instances in columns. That’s the convention. Also, we need to separate the class from the feature set:

Y = X(:,5);
X = X(:,1:4)'; % X in proper format now.


Alright. Now, that we have the data, let’s hear some theory. As always, this isn’t a tutorial on statistics. Go read about the theory somewhere else. This is just a refresher:

In order to predict the class from a feature set, we need to find out the probability of Y given X (where

$X = ( x_1, x_2, \ldots x_n )$

with n being the number of features. We denote the number of instances given to us as m. In our example, n = 4, m = 13. The probability of Y given X is:

$P(Y=1|X) = P(X|Y=1) * P(Y=1) / P(X)$

Which is called the Bayes rule. Now, we make the NB assumption: All features in the feature set are independant of each other! Strong assumption but usually works. Given this assumption, we need to find $P(X|Y=1), P(Y) and P(X)$.

(The weird braces notation that follows is the indicator notation. $1\{ v \}$ means use 1 only if condition v holds, 0 otherwise.)

$P(X) = P(X|Y=1) + P(X|Y=0)$

$P(X|Y=1) = \prod_j{P(x_i|Y=1)}$

To find $P(X|Y=1)$, you just have to find $P(x_i|Y=1)$ for all features and multiply them together. This is where the assumption comes in. You need the assumption of independence here for this.

$P(x_i|Y=1) = \sum_j{1\{x_i^j = 1, y^j = 1\}} / \sum_j{1\{y^j = 1\}}$

This equation basically means count the number of instances for which both x_i and Y are 1 and divide by the count of Y being 1. That’s the probability of x_i appearing with Y. Fairly straight forward if you think about it.

$P(Y=1) = \sum_j{1\{y^j = 1 \}} / \sum_j{1\{y^j = 1, y^j = 0 \}}$

Same as above. Count the ratio of Y=1 with the total number of Ys. Notice that we need to calculate all these for both Y=0 and Y=1 because we need both in the first equation. Let’s begin from the bottom up. For all of below, consider E as 0 and S as 1 since we consider being Scottish as being in class 1 (positive example).

P(Y):

pS = sum (Y)/size(Y,1);     % all rows with Y = 1
pE = sum(1 - Y)/size(Y,1);  % all rows with Y = 0


P(x_i|Y):

phiS = X * Y / sum(Y);  % all instances for which attrib phi(i) and Y are both 1
% meaning all Scotts with attribute phi(i)  = 1
phiE = X * (1-Y) / sum(1-Y) ;  % all instances for which attrib phi(i) = 1 and Y =0
% meaning all English with attribute phi(i) = 1


PhiS and PhiE are vectors that store the probabilities for all attributes. Now that we have the probabilities, we’re ready to make a prediction. Let’s get a test datapoint:

x=[1 0 1 0]';  % test point


And calculate the probabilities P(X|Y=1) and P(X|Y=0)

pxS = prod(phiS.^x.*(1-phiS).^(1-x));
pxE = prod(phiE.^x.*(1-phiE).^(1-x));


And finally, the probabilities of P(Y=1|X) and P(Y=0|X)

pxSF = (pxS * pS ) / (pxS + pxE)
pxEF = (pxE * pS ) / (pxS + pxE)


They should add upto 1 since there are only two classes. Now you can define a threshold for deciding whether the class should be considered 1 or 0 based on these probabilities. In this case, we can consider this test point to belong to class 1 since the probability pxSF > 0.5.

And there you have it!

1. Thank you for it. Your work is interesting. We are working on speech recognition based on the processing strategy of the cochlear nucleus and interested to apply the probabilistic gradient descent method to evaluate its performance.

2. recluze

June 5, 2011 at 1:25 pm

Thanks for the comment Seraj. I’m taking a look at your webpage to find out more about your work. It sounds very interesting.

(btw, the layout of your webpage is a little messed up on my Chrome 11 browser.)

3. I think you’re missing the multiplication by the priors pS and pE in the calculation of pxSF and pxEF. You’re using the Bayes rule here right?

4. test

5. Hi,

I am trying to learn the Naive Bayes classifier. I just copy your code but the final results do not add up to 1.

Many thanks

Ricardo

X = [ 0 0 1 1 0 ;
1 0 1 0 0 ;
1 1 0 1 0 ;
1 1 0 0 0 ;
0 1 0 1 0 ;
0 0 1 0 0 ;
1 0 1 1 1 ;
1 1 0 1 1 ;
1 1 1 0 1 ;
1 1 1 0 1 ;
1 1 1 1 1 ;
1 0 1 0 1 ;
1 0 0 0 1 ];
Y = X(:,5);
X = X(:,1:4)’;
pS = sum (Y)/size(Y,1);
pE = sum(1 – Y)/size(Y,1);
phiS = X * Y / sum(Y);
phiE = X * (1-Y) / sum(1-Y) ;
x=[1 0 1 0]’;
pxS = prod(phiS.^x.*(1-phiS).^(1-x));
pxE = prod(phiE.^x.*(1-phiE).^(1-x));
pxSF = (pxS * pS ) / (pxS + pxE)
pxEF = (pxE * pS ) / (pxS + pxE)

>> NaiveBayesRecluze

pxSF =

0.3967

pxEF =

0.1417

>>

6. Hi,
According to Nait and Ricardo, I recalculated it like this :
pxSF = (pxS * pS ) / (pxS * pS + pxE * pE)
pxEF = (pxE * pE ) / (pxS * pS + pxE * pE)
===============================================
X = [ 0 0 1 1 0 ;
1 0 1 0 0 ;
1 1 0 1 0 ;
1 1 0 0 0 ;
0 1 0 1 0 ;
0 0 1 0 0 ;
1 0 1 1 1 ;
1 1 0 1 1 ;
1 1 1 0 1 ;
1 1 1 0 1 ;
1 1 1 1 1 ;
1 0 1 0 1 ;
1 0 0 0 1 ];
Y = X(:,5);
X = X(:,1:4)’;
pS = sum (Y)/size(Y,1);
pE = sum(1 – Y)/size(Y,1);
phiS = X * Y / sum(Y);
phiE = X * (1-Y) / sum(1-Y) ;
x=[1 0 1 0]’;
pxS = prod(phiS.^x.*(1-phiS).^(1-x));
pxE = prod(phiE.^x.*(1-phiE).^(1-x));
pxSF = (pxS * pS ) / (pxS * pS + pxE * pE)
pxEF = (pxE * pE ) / (pxS * pS + pxE * pE)

pxSF =

0.7656

pxEF =

0.2344

pxEF = (pxE * pS ) / (pxS + pxE)
to
pxEF = (pxE * pE ) / (pxS + pxE)

8. This one is correct:

pxSF = (pxS * pS ) / (pxS * pS + pxE * pE)
pxEF = (pxE * pE ) / (pxS * pS + pxE * pE)

9. hello -please send to me simple matlab code for bayesian method-statistical classifier hurrily
best regards…..

10. hello please send to me some source codes for linear classifiers :1-widrow-Hoff
2-Ho-Koshyop completly and correctly – and for Non-parametric classifiers:
1-parzen 2-k.nearest neighbor and N-dimensional bayesian method hurrily
best regards……

11. sir,
I need the matlab codes for automated detection of diabetic retinopathy from microanerysms using naive bayes classifier

12. what are the steps hav i do detect the diabetic retinopathy from microanerysms using naive bayes classifier

13. the P(X) formula doesn’t make sense, it should be possible to be rearranged back to the original Bayes formula but it doesn’t.

14. Ok the other comments above have corrected it. This is why you always sufficiently explain the algorithm or theory properly otherwise you end up with messed up coding later.

15. sir,
I need the matlab codes for prediction of missing in data set using naive bayes classifier