Part 1 — Availability of personal information

The ease with which one is able to retrieve the electoral rolls for all of Delhi, and the fact that this can be adapted in a few minutes to any other part of the country, is worrisome.

The application would have required some form of a lookup service which can take the voter ID and return the constituency and polling booth required, so that I can then lead the user to it.

Such a service already exists. While this made my life very simple, it also did the same for anyone trying to locate a person, especially in bulk. Since the form isn’t protected by anything that can prevent automation, such as a CAPTCHA, a script can easily be generated to look up the names of people any number of people, and discover their approximate address, as well as age, gender and father’s name.

While this service would have served my immediate purpose, I was trying to build something that can be scaled for all of India. Keeping that in mind, I decided to build my own database of Voter ID numbers, and their corresponding polling booths and constituencies.

Every state and union territory in India offers PDF Electoral Rolls for each polling booth. These PDF rolls contain the name, father’s name, age, Voter ID, and enough information to figure out a more or less exact address for any person, down to the house number. Some rolls also contain photos of the people.

The first step to using these to create an index would be to obtain all of the PDFs in question. These are stored as individual files (though Arunachal Pradesh offers ZIP files of areas), and downloading them manually is unfeasible to say the least.

Since manually downloading everything is out of the question, I turned to scripting. The files are stored in a very structured manner, which makes automating their retrieval a trivial task. The URL for any file fits the following format:

http://ceodelhi.gov.in/WriteReadData/AssemblyConstituency/AC<AC NUMBER/A<THREE DIGIT AC NUMBER><FOUR DIGIT BOOTH NUMBER>.pdf

An example url hence looks like: http://ceodelhi.gov.in/WriteReadData/AssemblyConstituency/AC22/A0220161.pdf

Delhi has 70 assembly constituencies, each with a different number of polling booths. I built the following python script to retrieve all the electoral rolls:

The script is available on GitHub here. Running it as as simple as changing the path in directory to point to somewhere on your system, and then executing:

python delhidownload.py

This is a single threaded script, so it can be optimised to download more than one file at once. However, this served the immediate requirement, and I had downloaded every single PDF for Delhi in a few hours. All 11,832 of them. Before running the script, keep in mind that in all this is about 5.47 GB of data, and we haven’t even processed it yet.

PDFs aren’t very edit friendly formats, and building an index directly from the files will be quite a task. Instead, I converted them to text files, which are far simpler to handle, with the help of Xpdf and another python script:

Once again, the code is available as a gist and you can run the script by putting in the same directory you used earlier and executing (you must have Xpdf installed):

python delhipdftotext.py

This gave me a text version of every PDF file, sorted by constituency and polling booth.

As the next step in the indexing process is to extract all the voter IDs from each file, and store them in a database along with the meta data like which polling booth they belong to.

I accomplished this with yet another python script, which iterated through the files and then returned another set of files with all the voter IDs:

The code can as always be found in a gist and run by you.

The final step in the indexing process is to add all this information into a database. I opted to use some more python, and SQLite as the database:

This code too is available in a gist.

Having done all this, I was left with a database of every Voter ID in those files, along with which constituency and polling booth it belonged to. In all, I had 13,066,244 such IDs with corresponding data.

Once I had all this data, I decided to check just how many of the voter IDs match the format and guidelines laid down by the ECI in 2000. Which leads us to…