Disco was born in Nokia Research Center in 2008 to solve real challenges in handling massive amounts of data. Disco has been actively developed since then by Nokia and many other companies who use it for a variety of purposes, such as log analysis, probabilistic modelling, data mining, and full-text indexing.

Disco is a lightweight, open-source framework for distributed computing based on the MapReduce paradigm.

Disco in action

from disco.core import Job, result_iterator def map(line, params): for word in line.split(): yield word, 1 def reduce(iter, params): from disco.util import kvgroup for word, counts in kvgroup(sorted(iter)): yield word, sum(counts) if __name__ == '__main__': input = ["http://discoproject.org/media/text/chekhov.txt"] job = Job().run(input=input, map=map, reduce=reduce) for word, count in result_iterator(job.wait()): print word, count

This is a fully working Disco script that computes word frequencies in a text corpus. Disco distributes the script automatically to a cluster, so it can utilize all available CPUs in parallel. For details, see Disco tutorial.

Highlights

...and more! See the documentation for details.