Scrape the population data

After hours of trudging through the sluggish Indian Census Library, I came across City Population website, that neatly displays census data for each district in India, for last 3 decennial Censuses of India. However, this information is available on separate webpages, so next step was to write a simple scraper to get this data into a nicely formatted csv file.

Below code shows the scraping steps and saves the scraped data into a csv file.

Scraping population data for each Indian district

As you can see from the code above, the nifty Pandas ability to read tables directly from webpages is very handy, especially for such small tasks. Here is a brief look at the contents of the saved csv file:

district-wise population data

Merge the data

At this stage, we have geospatial data for each district in one csv file, and population data for them in a separate csv file. Naturally, the next step would be to combine them. Note that one important reason for having two csv files is to deal with the fact that our geospatial misses a few districts(yes, I counted). So, it is a good option to have our location and population data in separate tables and then perform a join operation on the two, to get the clean merged table of data. The Python code below does it by performing a merge operation on the two dataframes that we have established from above examples.

# merge data

merged_df = pd.merge(df,population_df,on=['District','State'],how='left')

#save merged data

merged_df.to_csv('merged.csv',index=False)

# drop null-valued rows (not recommended, see below)

merged_df.dropna(axis=0,how='any').to_csv('merged_withoutNA.csv',index=False)

In the above code, we performed a merge operation with District and State as key columns. Unfortunately, there are some rows with null values in the above code. This is because the website, that we used to scrape population data, has slightly different naming conventions for a few states and cities (for example, our location data has state name “Jammu and Kashmir” while the population data has “Jammu & Kashmir”, and so on for a few districts). One could write a simple fuzzy match to handle this siutation, but considering that only a few rows had the missing data values, I manually filled out the corresponding population data by reading the numbers from the City Population website.

Calculate Weighted Mean Center

In the previous step, we have effectively assigned weights(population data) to each district centroid. Next step is to calculate the center of population density for each state based on the weighted averages.

First, I considered using ArcGis Pro’s MeanCenter_stats functionality for this purpose. ArcGis has a powerful set of tools for analyzing geospatial data, along with a very useful Python driver arcpy. However, upon diving down into the ESRI documentations, I realized that the mean center implementation technique was pretty straightforward. So, I decided to write my own code to calculate the weighted means, based on below formula.

read more about the formula here

Below is the Python implementation to generate the desired weighted means.

calculating weighted means for each state based on district coordinates

Repeat the above process for 2011 population data, and we have all the weighted centered means along with the year-wise population data.