My Codegram retreat was well spent working on a fun problem I never had the chance to work on before. One of our in-house app projects (which we hope to release to the public later this year!) is a curated platform to discover new tech talks to watch, a Netflix for tech talks if you wish.

Sourcing content is an easy task, as most tech conferences (including our very own Full Stack Fest) upload their videos on YouTube or a similar video hosting platform freely available.

However, content is only one part of the story -- how do we find relevant content, tag it, and make it searchable?

The nail: automatic document tagging

So we have all these videos now -- how can we know what they're talking about? YouTube metadata offers us little help, since most videos (if tagged at all!) are tagged something along the lines of "Science & Technology", which is not all that useful. We need to look deeper in the videos to understand the subtleties of tech-y language, and extract relevant tags about their content.

Fortunately, most videos on YouTube have available captions (user-uploaded or automatic), which are essentially a pretty faithful transcript of the talk. Working with transcripts is much more comfortable than working with video, as it's just raw text.

So the first step is being able to convert raw text into individual tokens and removing the stopwords (common words such as articles, prepositions and some verbs). You'll need to download a list of stopwords in your language of choice.

( require ' [ clojure.java.io :as io ] ' [ clojure.string :as str ] ) ( def stopwords ( with-open [ rdr ( io/reader "stopwords_en.txt" ) ] ( set ( line-seq rdr ) ) ) ) ( defn tokenize [ s ] ( ->> ( str /split ( str /lower-case s ) # "[^a-zäöüáéíóúãâêîôûàèìòùçñ]+" ) ( remove stopwords ) ) ) ( tokenize "Clojure is a pretty nice language." ) # = > ( "clojure" "pretty" "nice" "language" )

Great! Now we just need a way to extract relevant terms.

The hammer: TF-IDF

A possible approach is the so-called TF-IDF, which stands for Term Frequency and Inverse Document Frequency , respectively. Let's unpack that!

Term Frequency is a map of terms (words) to their frequency in the document, usually normalized (i.e divided by) the total word count of the document. This metric will tell us how often a word is repeated in the document, presumably telling us that it is an important word.

( defn term-frequencies [ tokens ] ( let [ freqs ( frequencies tokens ) term-count ( count tokens ) ] ( ->> freqs ( map ( fn [ [ term frequency ] ] [ term ( / frequency term-count ] ) ) ( into { } ) ) ) ) ( term-frequencies [ "a" "b" "b" "c" "c" "c" ] ) # = > { "a" 1 / 6 , "b" 1 / 3 , "c" 1 / 2 }

As expected, a only appears 1 out of 6 times, b 1 out of 3, and c half the time.

You're probably seeing the catch here: wouldn't the most frequent word always be something like a or the or some such article or preposition, just by virtue of being a common English word in any text? Yes, and even if we removed all those (also known as stopwords ) when pre-processing the document, the problem remains true of other common words that are likely to appear in many, many talks without being directly relevant to the topic (think code , development , build , like s and uhm s, etc).

To account for that, we need another metric: Inverse Document Frequency. This measures precisely how rare a term is across all the documents , assuming that terms that repeat a lot everywhere are less meaningful within a particular document. If we combine both these factors, literally multiplying them, we get TF-IDF for each term in a document.

Calculating the IDF for a single term is simple ( corpus is a vector of documents, but in this case a document can simply be a set of terms):

( defn idf [ term corpus ] ( let [ documents-matching-term ( count ( filter # ( % term ) corpus ) ) ] ( if ( > documents-matching-term 0 ) ( -> ( count corpus ) ( / documents-matching-term ) Math/log ( + 1 ) ) 1 . 0 ) ) ) ( let [ corpus [ # { "clojure" "pretty" "nice" "language" } # { "scala" "nice" "type" "system" } # { "rust" "nice " "borrow" "checker" } ] ] { "clojure" ( idf "clojure" corpus ) "nice" ( idf "nice" corpus ) } ) # = > { "clojure" 2 . 09861228866811 , "nice" 1 . 4054651081081644 }

As we see, clojure seems much more rare than nice (the latter appears in every document), which makes it more meaningful when it shows up.

The last step is multiplying our TFs with our IDFs.

( defn tf-idf [ document corpus ] ( ->> ( term-frequencies document ) ( map ( fn [ [ term freq ] ] [ term ( * freq ( idf term corpus ) ) ] ) ) ( into { } ) ) ) ( let [ corpus [ # { "clojure" "pretty" "nice" "language" } # { "scala" "nice" "type" "system" } # { "rust" "nice " "borrow" "checker" } ] ] ( tf-idf [ "go" "verbose" "language" "type" "system" ] corpus ) ) # = > { "go" 0 . 2 , "verbose" 0 . 2 , "language" 0 . 41972245773362205 , "type" 0 . 41972245773362205 , "system" 0 . 41972245773362205 }

If we take the terms with the highest value, we can treat them as tags or keywords, which is known as automatic document tagging -- very handy to start making sense of a large corpus of documents.

Recommendations with cosine similarity

Now that we distilled such an opaque data point (a video) into a neat set of keywords with a relevance score, we can proceed to calculate recommendations based on those scores.

These TF-IDF scores that we just computed can be seen as vectors of numbers in a vector space. And if we see a document as just a vector, we can compute similarity between documents by computing the distance between them in the vector space!

There are many ways to calculate the distance, but we'll go with one called cosine similarity (the math behind it is not that important):

( defn dot-product [ document another-document ] ( ->> document ( map ( fn [ [ term tf-idf ] ] ( * tf-idf ( get another-document term 0 . 0 ) ) ) ) ( reduce + ) ) ) ( defn magnitude [ document ] ( ->> document ( map ( fn [ [ _ tf-idf ] ] ( * tf-idf tf-idf ) ) ) ( reduce + ) Math/sqrt ) ) ( defn cosine-similarity [ document another-document ] ( let [ dot-p ( dot-product document another-document ) document-magnitude ( magnitude document ) another-document-magnitude ( magnitude ( select-keys another-document ( keys document ) ) ) magnitude-product ( * document-magnitude another-document-magnitude ) ] ( if ( zero? magnitude-product ) 0 . 0 ( / dot-p magnitude-product ) ) ) ) ( let [ doc- a { "hello" 2 . 5 "goodbye" 0 . 3 } ] ( cosine-similarity doc- a doc- a ) ) # = > 0 . 9999999999999999 ( cosine-similarity { "hello" 2 . 5 "goodbye" 0 . 3 } { "clojure" 2 . 1 "goodbye" 2 . 8 } ) # = > 0 . 11914522061843065

As expected, {"hello" 2.5 "goodbye" 0.3} is very similar to itself, and not so similar to {"clojure" 2.1 "goodbye" 2.8} .

If we compute similarities between a document and the rest of the corpus, we can then provide the n top recommendations by similarity.

Search can be approached as a similar problem to recommendations. If we think of the search query as a document in itself, and the search results as the most similar documents to it, we've already built a simple, content-based search feature!

( defn search [ query all-documents ] ( let [ query-document ( tf-idf ( term-frequencies ( tokenize query ) ) all-documents ) ] ( ->> all-documents ( map ( fn [ document ] [ document ( cosine-similarity query-document document ) ] ) ) ( sort-by last ) reverse ) ) ( def corpus [ { "clojure" 1 . 5 "pretty" 1 . 1 "nice" 0 . 2 "language" 2 . 5 } { "scala" 1 . 5 "nice" 0 . 2 "type" 0 . 8 "system" 0 . 9 } { "rust" 1 . 5 "nice" 0 . 2 "borrow" 0 . 8 "checker" 0 . 9 } ] ( search "type system" corpus ) # = > [ [ { "scala" 1 . 5 , "nice" 0 . 2 , "type" 0 . 8 , "system" 0 . 9 } 0 . 9982743731749959 ] [ { "rust" 1 . 5 , "nice" 0 . 2 , "borrow" 0 . 8 , "checker" 0 . 9 } 0 . 0 ] [ { "clojure" 1 . 5 , "pretty" 1 . 1 , "nice" 0 . 2 , "language" 2 . 5 } 0 . 0 ] ] ( search "nice language" corpus ) # = > [ [ { "clojure" 1 . 5 , "pretty" 1 . 1 , "nice" 0 . 2 , "language" 2 . 5 } 0 . 9341787663087262 ] [ { "rust" 1 . 5 , "nice" 0 . 2 , "borrow" 0 . 8 , "checker" 0 . 9 } 0 . 430165282498796 ] [ { "scala" 1 . 5 , "nice" 0 . 2 , "type" 0 . 8 , "system" 0 . 9 } 0 . 430165282498796 ] ]

The value we get next to our records is the relevance of the result (the similarity of the document to our query).

Running in production

Quickly prototyping the recommender system was fun, but my aim was to build a production system from the very beginning. As soon as I started importing content, the limitations of a naive prototype became apparent: calculating TF-IDFs for every document was expensive and had to be done often, precisely because to calculate the IDFs we need the whole corpus of documents in memory.

The first steps were pre-processing TF-IDFs for all documents in batches. This way, the recommendations and search could be done on the fly, as it involves just a few operations on numbers (yet still we need the whole corpus in memory!).

Every time we import a large number of documents, the IDFs of many terms are very likely to have changed -- then we re-run the batch process, and we're done. At this point, with a corpus of just a few thousand videos, performance seems perfectly fine and the whole corpus fits in a very tiny amount of memory (for Clojure standards, heh).

Making it really webscales

Now for the elephant in the room. As the corpus of documents grows, it will eventually stop fitting in memory. Here are some ideas to mitigate that, in no particular order.

Buy more memory (seriously)

It will get you surprisingly far. Clojure's persistent data structures are already quite memory efficient thanks to memory sharing, but admittedly I wouldn't bet on a corpus of documents sharing a lot of data.

Calculate TF-IDF on a subset of the corpus

It won't yield results of the same quality, but the advantage is that you can choose how much of the corpus you keep in memory, and change that depending on your budget.

Offload TF-IDF calculation to Apache Spark or similar

This is probably the last resort, as it increases the complexity of the system considerably, but it's worth considering if you do reach that corpus size and you're not willing to keep it in one huge process.

Conclusion

The design, development and optimization of this recommender system from scratch took less than a week, deploying about 10 times every day to production with much tweaking, re-importing and reprocessing.

It runs in a single process with only 512mb of memory, with a one-off task (recreating the TF-IDFs) needing about 1024mb of memory to run. The data is persisted in PostgreSQL and everything is hosted on Heroku for $25 a month.

Looking forward to iterating on this, stay tuned for the release!

Cover image credits to https://www.flickr.com/photos/brianjmatis/4333643093