Recently I am asked to generate samples from a multinomial distribution. "Sounds easy." I thought, having no idea that my day would be ruined by this assignment.
I used spicy.stats.rv_discrete at first, inspired by this post (Am I a typical Stack Overflow programmer?). It's really slow when the number of samples are large, for it took me about half an hour to run the task. Later my (girl)friend told me to use numpy.random.multinomial instead, where the task finished within seconds.
I was totally shocked.
To test these two methods, I write simple code to do an experiment.
''' Comparing different ways to sample from a given discrete distribution Inspired by 11761 hw 5 Author: Ang Lu Date: 03/14/2015 ''' import sys, random import numpy as np from scipy import stats from timeit import default_timer as timer def generate_dist(voca): unnormed_dist = [random.random() for x in range(voca)] norm = sum(unnormed_dist) dist = [x * 1.0 / norm for x in unnormed_dist] return dist def naive_sample(size,dist): __name__ = "naive sample method" #dist = sorted(dist,reverse=True) sample =  for i in range(size): rp = random.random() cum = 0 for j,p in enumerate(dist): cum += p if rp &amp;lt;= cum: sample.append(j) break return sample def scipy_rv_wrapper(size,dist): sample_machine = stats.rv_discrete(values = (range(len(dist)),dist)) sample = sample_machine.rvs(size = size) return sample def sample_with(function, dist, size): print "Sample with " + function.__name__ start = timer() sample = function(size,dist) end = timer() print "Time usage: %f" %(end-start) def main(): voca = int(sys.argv) size = int(sys.argv) dist = generate_dist(voca) methods = [naive_sample,np.random.multinomial,scipy_rv_wrapper] for method in methods: sample_with(method, dist, size) if __name__ == '__main__': main()
There are three methods that I tested: my naive inverse sampling method, Numpy and Scipy. Here are the results(numbers don't vary much among runs).
Running time (unit: s) of three methods (Sample number = 10000)
Running time (unit: s) of three methods (Dist size = 1000)
From the tables we can tell that Numpy is much faster than Scipy and my naive method, especially when the size of distribution or the number of wanted samples are large. Typically 200x+ faster is common for Numpy to Scipy.
"Ten points for Numpy!"
It seems that all the methods suffers much more from the increasing of distribution size than of number of samples. It seems that for the libraries the running time doesn't grows linearly when the number of samples does. This is interesting.
I can't find the very reason that causes all of this, for there are so many variables and I didn't have a deep understanding about the Numpy implement. There may be better algorithm(like sorting the distribution for an early break makes my method 1.5x faster) or better implementation. I would like to hear from anyone who could uncover the reason. So the takeaway message of this post is: next time you want to sample from a multinomial, choose Numpy.
Thank Yulan Huang for sharing the Numpy module with me. There is also a short discussion about this on stack overflow.