"Ten points for Numpy!"

Recently I am asked to generate samples from a multinomial distribution. "Sounds easy." I thought, having no idea that my day would be ruined by this assignment.

I used spicy.stats.rv_discrete at first, inspired by this post (Am I a typical Stack Overflow programmer?). It's really slow when the number of samples are large, for it took me about half an hour to run the task. Later my (girl)friend told me to use numpy.random.multinomial instead, where the task finished within seconds.

I was totally shocked.

To test these two methods, I write simple code to do an experiment.


'''
    Comparing different ways to sample from a given discrete distribution
    Inspired by 11761 hw 5
    Author: Ang Lu
    Date: 03/14/2015
'''
import sys, random
import numpy as np
from scipy import stats
from timeit import default_timer as timer

def generate_dist(voca):
    unnormed_dist = [random.random() for x in range(voca)]
    norm = sum(unnormed_dist)
    dist = [x * 1.0 / norm for x in unnormed_dist]
    return dist

def naive_sample(size,dist):
    __name__ = "naive sample method"
    #dist = sorted(dist,reverse=True)
    sample = []
    for i in range(size):
        rp = random.random()
        cum = 0
        for j,p in enumerate(dist):
            cum += p
            if rp <= cum:
                sample.append(j)
                break
    return sample

def scipy_rv_wrapper(size,dist):
    sample_machine = stats.rv_discrete(values = (range(len(dist)),dist))
    sample = sample_machine.rvs(size = size)
    return sample

def sample_with(function, dist, size):
    print "Sample with " + function.__name__
    start = timer()
    sample = function(size,dist)
    end = timer()
    print "Time usage: %f" %(end-start)

def main():
    voca = int(sys.argv[1])
    size = int(sys.argv[2])
    dist = generate_dist(voca)
    methods = [naive_sample,np.random.multinomial,scipy_rv_wrapper]
    for method in methods:
        sample_with(method, dist, size)

if __name__ == '__main__':
    main()

There are three methods that I tested: my naive inverse sampling method, Numpy and Scipy. Here are the results(numbers don't vary much among runs).

Running time (unit: s) of three methods (Sample number = 10000)

Dist Size MySampling Numpy Scipy
100 0.0492 0.0000 0.0600
1000 0.4054 0.0003 0.0589
10000 3.9504 0.0015 0.1851

Running time (unit: s) of three methods (Dist size = 1000)

Sample num MySampling Numpy Scipy
100 0.0041 0.0001 0.0024
1000 0.0454 0.0001 0.0073
10000 0.4307 0.0003 0.0675

From the tables we can tell that Numpy is much faster than Scipy and my naive method, especially when the size of distribution or the number of wanted samples are large. Typically 200x+ faster is common for Numpy to Scipy.

"Ten points for Numpy!"

It seems that all the methods suffers much more from the increasing of distribution size than of number of samples.  It seems that for the libraries the running time doesn't grows linearly when the number of samples does. This is interesting.

I can't find the very reason that causes all of this, for there are so many variables and I didn't have a deep understanding about the Numpy implement. There may be better algorithm(like sorting the distribution for an early break makes my method 1.5x faster) or better implementation. I would like to hear from anyone who could uncover the reason. So the takeaway message of this post is: next time you want to sample from a multinomial, choose Numpy.

Thank Yulan Huang for sharing the Numpy module with me. There is also a short discussion about this on stack overflow.

This entry was posted in Experiment and tagged , , , , . Bookmark the permalink.

One Response to "Ten points for Numpy!"

  1. Jing says:

    Just dian zan :D

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>