About

How do we measure similarity?

purely on their content

No metadata

The similarity between scores is measured based, i.e. the music itself.are used.

The algorithm relies on a huge inverted index used in our score search engine, Peachnote.com. The algorithm is pretty straightforward and works as follows. We scan through the inverted index (the mapping between musical n-grams, in our case chord progressions, and lists of scores they can be found in). We consider only rare patterns (n-grams that occur at most 40 times in our whole data set of about a million score sheets). For every such rare pattern, for every pair of scores from the list of scores containing it, we increment a co-occurrence counter of the score pair by one. Then, for each score we store a list of scores such that the accumulated count for the pair containing both scores exceed 20 (we store counts as well). This way, for every score containing rare musical patterns we obtain a list of other scores that share at least 20 rare patterns with that score.

The more rare (and thus characteristic) patterns a pair of scores has in common, the more similar we judge them.

We find it surprising that this simple definition of similarity seems to capture musical semantics so well. Not only does this similarity capture the specificity of a particular piece; scores by the same composer, of the same genre or the same time are rated more similar to each other too.

The inverted index we use contains several hundred million n-grams and is stored in an HBase database. The algorithm is implemented in a single map-reduce job running in our Hadoop environment. We use the chord n-gram type. Rhythmic information is not considered.

The algorithm could be improved by using normalization (as a second map-reduce step). This way we could improve the quality of ranking and avoid the problem of hubs (large scores sharing many rare patterns with lots of other scores). Examples of such hubs currently are some large scores by Bach, orchestral scores of Fidelio and some Wagner operas. We plan to implement the normalization in the near future. Stay tuned!

The similarity data is available via the Peachnote API.

For more information about our system and the origin of the underlying data please refer to our paper "Peachnote: Music Score Search and Analysis Platform".

Who made this?

Mai Bui Bachelor in Computer Science at the LMU Munich