I was asked to write a blog about how to make a personal Wiki, like this "HandWiki" from Wikipedia. The purpose of this blog is to show the basic steps in text processing. It requires some basic knowledge of MediaWiki setup, and some knowledge of the Python programming language. I hope this blog will be useful for data scientists.

So I've decided to make my own wiki portal for researches with less restrictions (than in Wikipedia) for posting scholarly articles. The good thing is that the Wikipedia license allows to do this. It appears this takes only several days if you know the basic ideas of data science. Here are the steps:

(1) Install MediaWiki and basic extensions (~1 h work) and insert some templates.

(2) Download Wikipedia dump file (with the extension *.bz2) using https://dumps.wikimedia.org/ A BitTorrent program is recommended since the file is large (~17 GB).

(3) Create a Python script that reads this file and write only articles with certain categories. Read it wisely. You cannot read the entire file into the computer memory (my old computer had only 8 GB of RAM), so use other techniques to parse this file. In data science, this step is called data skimming. I wanted all categories related to data science and science. My script creates TXT file with Wiki tags (30 min on a commodity computer). However, it took about 2 days (not very hard) work to make this python code.

(4) Data cleaning. Since my Wiki is about "concepts", not about people and other things, I've done a second pass with my python script to remove articles on people that end up in my categories. While doing this, I've detected a massive number of empty "stabs", self-promotions and junk articles without references. I was surprised that professional Wikipedians do nothing about such entries. It took about 2 hours of work during my weekend to make this script.

(5) During data cleaning I had to shorten some long articles (up to 10 sections) and removed many Wikipedia info boxes (takes time to install these templates). Plus I wanted to have easy-to-read articles. This step is called data slimming.

(6) The hardest work was to move some articles (on particular topics) to new namespaces, and replace internal links. I wanted a better organized Wiki. Wikipedia just dumps all articles to the "main namespace". In my Wiki, I wanted links like [[Gyroscope]] to go to the "Physics" namespace i.e. Physics:Gyroscope. I had to convert some Wikipedia links to the plain text if I do not have such articles. This required me to build a Python map with all my titles, and use it for the link replacements.

(7) Duplicate removal. Some articles appeared in the main namespace and under the dedicated namespaces for specific sciences. A small script removed such duplicate entries.

(8) Finally, I've added a link to a full Wikipedia article for the section "External links", saying that it was sourced from Wikipedia. All of this took me one day of thinking, but the actual implementation was simple - create a Python map and use it for replacements of words between the tags [[ and ]].

(9) The very last step is to import my TXT files with the selected articles to MediaWiki. MediaWiki "maintenance" directory provides such PHP scripts. The import takes about 3h for 20,000 slimmed articles. With the default MediaWiki setup, you do not need to copy images from Wikipedia Commons.

Done. You can find this Wiki here. The whole project took about a week, spending ~2-3h h per day. Later I noticed that some old Wikipedia templates are still missing. This should be easy to fix in the future.

Enjoy HandWiki.Org.

S.Chekanov