This post is inspired by a discussion thread from the piazza forum course 10-605/10-805 I am taking at CMU. The question is summarized below:
In the classification progress of a Naive Bayes(NB) Classifier, using Maximum Likelihood Estimation(MLE) to compute the joint distribution of label Y and all words from a vocabulary with size |V| as follows:
Thus to mitigate the effect of overfitting, we will use a smoothing method:
Actually in the class back to last Thursday I was confused by this too. It's not THAT obvious to me at first glance. Then my intuition that it make sense to do this since it will not change the probability too much but will avoid 0 probability for feature X that haven't seen in the vocabulary convinced me to stop thinking and go out for a spicy pot. Yeah it is kind of smoothing, but could we call that a mitigation of overfitting? Mmm...Should go for dinner first.
If I have forgot about this, I will still have nothing on my blog. :( Fortunately I found the discussion thread on piazza. I think this is a good question and my classmate and TA have give some inspiring explanation on extreme case(say the 0 prob case). Professor Cohen, the lecturer of the course, shared a great reading material in the discussion, which shows the improved method is called MAP(also an example of "additive smoothing", which is part of the title) and gained using a Dirichlet prior. This do take me to as far as I have been. How I hope I can make the thread available for the reader.
Well, that really make sense. Actually I was always thinking that MLE will cause some kind of overfitting while using Bayesian Inference(BI) may mitigate the problem a little, for the MLE only return us with a optimal parameter value but BI will return a posterior parameter distribution which is over all the parameter space. Then we may use the expectation of a certain parameter from that distribution as the desired output. By calculating the expectation the parameters are naturally smoothed, which is kind of a weighted average over the whole parameter space. I think it will to some extent mitigate overfitting problem.
So the rest is just the beautiful math, majorly from ref. Please enjoy.
For NB classifier, we suppose that are i.i.d draws from multinomial (n, ), where is the document size. Then
When we use a Dirichlet distribution with parameter with all element equal to and , we have:
Then we have a posterior distribution:
The normalizing constant is given by
To estimate the value for a specific parameter \theta_i, which is also the estimation of , we calculate the mean of this posterior:
Finally use the assumption of NB classifier that are i.i.d and use some knowledge about conditional probability to involve Y, which I omitted deliberately for simplicity, we get the original formula from the course.
Thank you for reading. Feel free to point out any mistake or discuss with me.
1.Christopher M. Bishop. 1995. Neural Networks for Pattern Recognition. Oxford University Press, Inc., New York, NY, USA.