"Ten points for Numpy!"

Recently I am asked to generate samples from a multinomial distribution. "Sounds easy." I thought, having no idea that my day would be ruined by this assignment.

I used spicy.stats.rv_discrete at first, inspired by this post (Am I a typical Stack Overflow programmer?). It's really slow when the number of samples are large, for it took me about half an hour to run the task. Later my (girl)friend told me to use numpy.random.multinomial instead, where the task finished within seconds.

I was totally shocked.

To test these two methods, I write simple code to do an experiment.


'''
    Comparing different ways to sample from a given discrete distribution
    Inspired by 11761 hw 5
    Author: Ang Lu
    Date: 03/14/2015
'''
import sys, random
import numpy as np
from scipy import stats
from timeit import default_timer as timer

def generate_dist(voca):
    unnormed_dist = [random.random() for x in range(voca)]
    norm = sum(unnormed_dist)
    dist = [x * 1.0 / norm for x in unnormed_dist]
    return dist

def naive_sample(size,dist):
    __name__ = "naive sample method"
    #dist = sorted(dist,reverse=True)
    sample = []
    for i in range(size):
        rp = random.random()
        cum = 0
        for j,p in enumerate(dist):
            cum += p
            if rp <= cum:
                sample.append(j)
                break
    return sample

def scipy_rv_wrapper(size,dist):
    sample_machine = stats.rv_discrete(values = (range(len(dist)),dist))
    sample = sample_machine.rvs(size = size)
    return sample

def sample_with(function, dist, size):
    print "Sample with " + function.__name__
    start = timer()
    sample = function(size,dist)
    end = timer()
    print "Time usage: %f" %(end-start)

def main():
    voca = int(sys.argv[1])
    size = int(sys.argv[2])
    dist = generate_dist(voca)
    methods = [naive_sample,np.random.multinomial,scipy_rv_wrapper]
    for method in methods:
        sample_with(method, dist, size)

if __name__ == '__main__':
    main()

There are three methods that I tested: my naive inverse sampling method, Numpy and Scipy. Here are the results(numbers don't vary much among runs).

Running time (unit: s) of three methods (Sample number = 10000)

Dist Size MySampling Numpy Scipy
100 0.0492 0.0000 0.0600
1000 0.4054 0.0003 0.0589
10000 3.9504 0.0015 0.1851

Running time (unit: s) of three methods (Dist size = 1000)

Sample num MySampling Numpy Scipy
100 0.0041 0.0001 0.0024
1000 0.0454 0.0001 0.0073
10000 0.4307 0.0003 0.0675

From the tables we can tell that Numpy is much faster than Scipy and my naive method, especially when the size of distribution or the number of wanted samples are large. Typically 200x+ faster is common for Numpy to Scipy.

"Ten points for Numpy!"

It seems that all the methods suffers much more from the increasing of distribution size than of number of samples.  It seems that for the libraries the running time doesn't grows linearly when the number of samples does. This is interesting.

I can't find the very reason that causes all of this, for there are so many variables and I didn't have a deep understanding about the Numpy implement. There may be better algorithm(like sorting the distribution for an early break makes my method 1.5x faster) or better implementation. I would like to hear from anyone who could uncover the reason. So the takeaway message of this post is: next time you want to sample from a multinomial, choose Numpy.

Thank Yulan Huang for sharing the Numpy module with me. There is also a short discussion about this on stack overflow.

Posted in Experiment | Tagged , , , , | 1 Comment

Stepping A Little Deeper into Additive Smoothing

This post is inspired by a discussion thread from the piazza forum course 10-605/10-805 I am taking at CMU. The question is summarized below:

In the classification progress of a Naive Bayes(NB) Classifier, using Maximum Likelihood Estimation(MLE) to compute the joint distribution of label Y and all words from a vocabulary with size |V| as follows:

Pr(y,x_1,\ldots,x_d) = \prod_{j=1}^d \frac{Count(X=x_{j},Y=y) }{Count(X=ANy,Y=y)}\frac{Counts(Y=y)}{Counts(Y=ANY)}

may overfit.

Thus to mitigate the effect of overfitting, we will use a smoothing method:

 Pr(y,x_1,\ldots,x_d) = \prod_{j=1}^d \frac{Count(X=x_{j},Y=y) + mq_{x}}{Count(X=ANY,Y=y)+m}\frac{Counts(Y=y)+mq_{y}}{Counts(Y=ANY)+m}

Here  q_{x} = 1/|V|, q_{y} = 1/dom(Y), m = 1

Actually in the class back to last Thursday I was confused by this too. It's not THAT obvious to me at first glance. Then my intuition that it make sense to do this since it will not change the probability too much but will avoid 0 probability for feature X that haven't seen in the vocabulary convinced me to stop thinking and go out for a spicy pot. Yeah it is kind of smoothing, but could we call that a mitigation of overfitting? Mmm...Should go for dinner first.

If I have forgot about this, I will still have nothing on my blog. :( Fortunately I found the discussion thread on piazza. I think this is a good question and my classmate and TA have give some inspiring explanation on extreme case(say the 0 prob case).  Professor Cohen, the lecturer of the course, shared a great reading material in the discussion, which shows the improved method is called MAP(also an example of "additive smoothing", which is part of the title) and gained using a Dirichlet prior. This do take me to as far as I have been. How I hope I can make the thread available for the reader.

Well, that really make sense.  Actually I was always thinking that MLE will cause some kind of overfitting while using Bayesian Inference(BI) may mitigate the problem a little, for the MLE only return us with a optimal parameter value but BI will return a posterior parameter distribution which is over all the parameter space. Then we may use the expectation of a certain parameter from that distribution as the desired output. By calculating the expectation the parameters are naturally smoothed, which is kind of a weighted average over the whole parameter space. I think it will to some extent mitigate overfitting problem.

So the rest is just the beautiful math, majorly from ref[2].  Please enjoy.

For NB classifier, we suppose that  X_{1}, X_{2}, \ldots, X_{n} are i.i.d draws from multinomial (n, \theta), where  n = d is the document size. Then

 P(X=x|\theta) \propto \theta_{1}^{\sum_{j=1}^n1 (x_{j} = \theta_{1})} \ldots \theta_{k}^{\sum_{j=1}^n1(x_{j} = \theta_{k})}


When we use a Dirichlet distribution with parameter \alpha \in \mathbb{R}^k with all element equal to mq_{x} and k = |V|, we have:

P(\theta|\alpha) \propto \theta_{1}^{mq_{x}-1} \ldots \theta_{k}^{mq_{x}-1}


Then we have a posterior distribution:

P(X=x|\theta,\alpha) \propto \theta_{1}^{\sum_{j=1}^n1(x_{j} = \theta_{1})+mq_{x}-1} \ldots \theta_{k}^{\sum_{j=1}^n1(x_{j} = \theta_{k})+mq_{x}-1}

The normalizing constant is given by

\frac{\Gamma(\sum mq_{xi} + n)}{\prod_{i} \Gamma(mq_{xi} + \sum_{j=1}^n1(x_{j}=i))}

To estimate the value for a specific parameter \theta_i, which is also the estimation of P(X=x_i|x, \alpha), we calculate the mean of this posterior:

\mathbb{E}[\theta_i|x,\alpha] = \frac{mq_{x} + \sum_{j=1}^{n}1(x_{j}=i)}{n+\sum_{l=1}^k mq_{x}} = \frac{Count(X=x_{j}) + mq_{x}}{Count(X=ANY)+m}

Finally use the assumption of NB classifier that X_{i} are i.i.d and use some knowledge about conditional probability to involve Y, which I omitted deliberately for simplicity, we get the original formula from the course.

Thank you for reading. Feel free to point out any mistake or discuss with me.

----------------

Reference:

1.Christopher M. Bishop. 1995. Neural Networks for Pattern Recognition. Oxford University Press, Inc., New York, NY, USA.

2.http://www.cs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture4.pdf

3.http://www.cs.cmu.edu/~wcohen/10-605/prob-tour+bayes.pdf

4.http://www.cs.cmu.edu/~wcohen/10-601/prob-tour+bayes.pdf

5. https://en.wikipedia.org/wiki/Additive_smoothing

Posted in Machine Learning | Leave a comment