I will start working on 2018 OpenSubtitles dataset soon. Watch the space.

Download Frequency Words lists for 2016 OpenSubtitles datasets and the code used to generate them are now publicly available.

Click here to go to the GitHub

Previous post and links to old data files

Go to skydrive download page

I originally created the word lists while I was trying to improve the dictionaries I used for my windows phone app called Slydr.

Of course there were commercial options – however I was quoted about £500 per language for a nice / cleaned wordlist.. Me of course being a cheap git.. decided to create my own.

If you decide to use it, please let me know what you are using it for. Its yours to use.

Note: I used public / free subtitles to generate these and like most things, it will have errors.

I would like to thank opensubtitles.org as their subtitles form the basis of the word lists. I would also like to thank the Tehran University for Persian Language corpus which allowed me to build Persian / Farsi word list (2011 version).

While the subtitles are free, donations do motivate further work. If you would like to donate, please click the Donate button to donate using Paypal.

If you like to create you own word lists, here’s something to get you started. Download FrequencyWordsHelper . When you run the app, it will ask for a directory to scan and then ask for output filename. once you provide both, it will scan the directory for all txt files and create a word list out of it. The app requires .NET framework 4.5

Format of the frequency lists:

word1 number1 (number1 represents occurance of word1 across all files)

word2 number2 (number2 represents occurance of word2 across all files)

Language 2011 2012 Arabic – ar Download Download Bulgarian – bg Download Download Czech – cs Download Download Danish – da Download Download German – de Download Download Greek – el Download Download English – en Download Download Spanish – es Download Download Estonian – et Download Download Farsi – fa Download Download Finnish – fi Download Download French – fr Download Download Hebrew – he Download Download Croatian – hr Download Download Hungarian – hu Download Download Indonesian – id Download Download Icelandic – is Download Download Italian – it Download Download Korean – ko Download Download Lithuanian – lt Download Download Latvian – lv Download Download Macedonian – mk Download Download Malay – ms Download Download Dutch – nl Download Download Norwegian – no Download Download Polish – pl Download Download Portuguese – pt Download Download Portuguese Brazilian – pt-br Download Download Romanian – ro Download Download Russian – ru Download Download Slovak – sk Download Download Slovenian – sl Download Download Albanian – sq Download Download Serbian Cyrillic – sr-Cyrl Download Download Serbian Latin – sr-Latn Download Download Swedish – sv Download Download Turkish – tr Download Download Ukrainian – uk Download Download Simplified Chinese – zh-CN Download Download