“Soon, there will be no survivors to tell their stories, and your children would hear about the Holocaust only by books and videos”

My teacher said this about 18 years ago after showing my class a documentary film about Holocaust survivors.

Photo by Alessio Maffeis on Unsplash

The Holocaust, also known as “The Shoah” (Hebrew: השואה), was a genocide during World War II in which Nazi Germany, aided by local collaborators, systematically murdered ~6,000,000 Jews.

As a grandson of Holocaust survivors and an avid data researcher, I thought to myself that nowadays, in the “Data Science” era — where big data can be found with small efforts — I could find some raw data about Holocaust victims via a short Google search.

Surprisingly — my Google searches yielded almost nothing.

The only significant “Holocaust database” I could find was “Yad Vashem” — The World Holocaust Remembrance Center, which contains millions of records with biographical details of Holocaust victims, records which were carefully gathered from 2004 up to this day.

Even though this database is publicly accessible, the access is via an online (and limited) query form, which makes it impossible to manipulate the data with more suitable analysis tools.

Can it be that the data of one of the most significant episodes in world history is not available in its raw form for non-profit research?

So I decided to investigate “Yad Vashem” website (technical note: understanding the “hidden API” by filtering XHR and fetch requests via chrome DevTools), and was able to create a Python script to automatically query and store the data of ~7.5 million entries of Holocaust victims (note: it was all done according to “proper usage” and “privacy” terms of the “Yad Vashem” website).

Even though “Yad Vashem’s” information is far from being complete (duplicate entries, missing entries, unknown sources, etc.) — it is still the best there is, and with this exclusive data in hand, I could now see the Holocaust in a convenient yet disturbing perspective:

Europe heat map — victims per country of death (note: borders are not accurate as I used a modern geocoding engine instead of a 1938 one)

Reason of death — per country of residence

Number of victims between 1938–145 — colored by documented fate

Top originating cities of the victims — colored by country of death

Main traffic routes of victims between 1938–1945

Needless to say, this is just the tip of the iceberg; these are just a few summaries I made with “Tableau” and “Python”, they do not claim to be accurate, but to emphasize how such data could be utilized.

Imagine that we would take into account relationships between surnames and geographic locations of victims, or if we would cluster groups of victims by similarity of their locations in certain dates.

One can imagine that a comprehensive study may provide new insights which could benefit the international community, from historical insights regarding the war to discovering further information on one’s family fate.

Unfortunately, there are almost no Holocaust survivors left, and time doesn’t make it better. But in a reality where the human survivors disappear, let us at least utilize the “digital survivors” to tell their stories.

Written by Yoav Tepper, grandson of Holocaust survivors.