Introduction

Time is an American weekly news magazine and news website published in New York City. It was founded in 1923 and originally run by Henry Luce.

Time has the world’s largest circulation for a weekly news magazine. The print edition has a readership of 26 million, 20 million of whom are based in the United States. In mid-2012, its circulation was over three million, which had lowered to two million by late 2017.

The uploaded dataset sheds light on how gender diversity is maintained while choosing cover pictures for the magazine since their beginning in 1923 till 2013.

Does Time really abide by equality? In this world of media having infiltrated our lives to the greatest extent, is Time a responsible bearer of gender miscellany? These are the few questions I tried to address via this blog post.





Project Details

The Kaggle notebook for this project to fork is linked here.

The Time Cover data used here is another Kaggle dataset.

The Github repo for this can be accessed from here.

Python libraries used extensively are : pandas - For analysing the data. Matplotlib - For plotting the stacked bar graph. seaborn - For plotting the scatter graph.







Exploratory Analysis

A. Importing libraries

To begin this exploratory analysis, we first import libraries and define functions for plotting the data using matplotlib , numpy and pandas . We then show how the gender demography has shaped itself on the covers of Time over a period of 80 years.

import matplotlib.pyplot as plt # plotting import numpy as np # linear algebra import os # accessing directory structure import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

B. Accessing the data

There is 1 csv file in the current version of the dataset:

print ( os . listdir ( '../input' ))

['TIMEGenderData.csv']

C. Reading the data

We read the data and store it in a dataframe using read_csv .

data = pd . read_csv ( "/TIMEGenderData.csv" )

D. Analysing the data

Take a look how the data looks like.

data . head ()

Year Female Male Total Female % Male % 0 1923 1 34 35 2.86% 97.14% 1 1924 4 48 52 7.69% 92.31% 2 1925 1 51 52 1.92% 98.08% 3 1926 7 46 52 13.46% 88.46% 4 1927 4 49 52 7.69% 94.23%

We see from data.head() that data has 5 columns.

Year - The year of release.

- The year of release. Female - The number of female personalities in issues for that whole year.

- The number of female personalities in issues for that whole year. Male - The number of female personalities in issues for that whole year.

- The number of female personalities in issues for that whole year. Total - Total number of issues in the year.

- Total number of issues in the year. Female % - Female/Total * 100%

- Male % - Male/Total * 100%

Teasing the data more,

data . describe ()

Year Female Male Total count 91.00000 91.000000 91.000000 91.000000 mean 1968.00000 5.835165 39.549451 45.362637 std 26.41338 3.163204 7.968359 6.433448 min 1923.00000 1.000000 20.000000 26.000000 25% 1945.50000 4.000000 33.000000 41.000000 50% 1968.00000 5.000000 41.000000 47.000000 75% 1990.50000 7.000000 46.000000 51.000000 max 2013.00000 18.000000 51.000000 52.000000

We see that from the above table there are no null values for any column in the whole dataframe.

And already, when we look at the max values in the description, the difference is… . Umm, let’s wait for that.

When we think of what we need to plot the data, i.e., how gender demography on the covers changed over the years, we can assume that we need the percentage values of both Female and Male covers. But we have two problems here.

The percentage values are float values, on looking at it first, and we cannot plot floats on, say, a stacked bar graph (which is the plan how I will plot the data finally).

If we check the type of values in ther percentage columns, we see

type ( data [ 'Female % ' ][ 0 ])

str

The percentage values are string values here. So apparently we cannot use them, unless we turn them first into float and then into int to be plotted.

We could have easily done this by typecasting string to float and then to int . But there is another catch. The percentage values are appended by a % symbol.

To expunge this problem, again, we can do two things.

Trim the symbol from the values and typecast.

Calculate the percentage from scratch.

I personally prefer calculating the percentage values from scratch since we already have the numbers given too. And hence, we will proceed with that here. But the first approach can be used too.

E. Modifying the data

First we drop the columns from the frame which cannot be used.

data = data . drop ([ 'Female % ' , 'Male % ' ], axis = 1 )

Now we calculate the percentage values by using data from the frame.

femaleperc = [] femaleperc = data . Female / data . Total * 100 maleperc = [] maleperc = data . Male / data . Total * 100

We also have to change the float values in arrays to int values to be plotted.

femaleperc = [ int ( x ) for x in femaleperc ] maleperc = [ int ( x ) for x in maleperc ]

Now we add two more columns in the frame and assign these calculated percentage values to those columns.

data = data . assign ( FemalePerc = femaleperc ) data = data . assign ( MalePerc = maleperc )

Now, we take a look at data again, checking whether what we did worked and we can work with the data now.

data . head ()

Year Female Male Total FemalePerc MalePerc 0 1923 1 34 35 2 97 1 1924 4 48 52 7 92 2 1925 1 51 52 1 98 3 1926 7 46 52 13 88 4 1927 4 49 52 7 94

Yay! Our problem is solved. We have the percentage values in int which we can now plot on a beautiful stacked bar chart.



Visualization of Data

The percentage values have been converted, not typecasted, from float to int here and hence what we popularly call as rounded values in mathematics have not been taken. The conversion has been done by flooring the values.

For example, 2.86 should have been rounded to 3 but has been floored to 2 .

And because of this, we get discrepancies in the sum of the two percentages. What we need to create beautiful stacked bars is a constant sum of percentages (Male and Female), else some bars will have a height lesser or more than others.

To fix this issue, we perform a simple trick. What we do is we find out the rows whose sum of MalePerc and FemalePerc is not equal to 100 (because we know the sum of percentages should always be 100 ) and adjust any one of MalePerc or FemalePerc such that the sum is equal to 100. This process is actually a work-around for rounding the values, which we did not do while conversion.

for i , row in data . iterrows (): sum = data . FemalePerc [ i ] + data . MalePerc [ i ] if sum > 100 : #Check whether there is any sum value above 100 diff = sum - 100 #Find out the difference data . MalePerc [ i ] = data . MalePerc [ i ] - diff #We modify the MalePerc values for adjusting the difference. elif sum < 100 : #Check whether there is any sum value less than 100 diff = 100 - sum data . MalePerc [ i ] = data . MalePerc [ i ] + diff

Now if we check the sum of percentages for regularizing the data,

for i , row in data . iterrows (): sum = data . FemalePerc [ i ] + data . MalePerc [ i ] if sum != 100 : print ( "Error" ) #Print Error if anyone of the row's sum of percentages in not 100.

We run this above cell, but we do not get any message saying Error . Hence, we are good to go now!

Now finally we can proceed to making the plot.



A. Plotting on a Stacked Bar Graph

#Plotting the data using a Stacked Bar Graph plt . figure ( figsize = ( 25 , 15 )) #Setting the figure size barWidth = 0.9 #Setting width of each bar x_values = data . Year #For setting the x-axis values as the Years of the publications plt . bar ( x_values , data . FemalePerc , color = '#b5ffb9' , edgecolor = 'white' , width = barWidth , label = 'Female' ) plt . bar ( x_values , data . MalePerc , bottom = data . FemalePerc , color = '#f9bc86' , edgecolor = 'white' , width = barWidth , label = 'Male' ) plt . xticks ( x_values , rotation = 90 , fontsize = 15 ) plt . yticks ( fontsize = 18 ) plt . legend ( bbox_to_anchor = ( 1 , 1 ), loc = 2 , prop = { 'size' : 15 }) #bbox_to_anchor makes legend visible outside the graph. The placement of the legend follows a different x and y-axes than the graph. For the axes which are followed by #legend, (0,0) is lower left point of the chart and (1,1) is the upper rightmost point of the chart. That is why the location is (1,1) such that the legend box #is just at the upper rightmost part of the chart. loc=2 indicates upper right corner. And prop is the size of the legend. plt . xlabel ( 'Years' , fontsize = 20 ) plt . ylabel ( 'Percentage' , fontsize = 20 , rotation = 90 ) plt . title ( 'Analysis of male and female personalities on covers of TIME (1923-2013)' , fontsize = 25 ) plt . show ()







B. Plotting on a Scatter Plot with a Regression Line

import seaborn as sns plt . figure ( figsize = ( 22 , 13 )) sns . set ( color_codes = True ) sns . set_style ( "darkgrid" ) ax = sns . regplot ( x = "Year" , y = "MalePerc" , data = data , color = '#FF7F50' , label = 'Male' ) ax1 = sns . regplot ( x = "Year" , y = "FemalePerc" , data = data , color = '#008000' , label = 'Female' ) plt . xlabel ( 'Year' , fontsize = 15 ) plt . ylabel ( 'Percentage' , fontsize = 15 ) plt . title ( 'Trend of male and female covers on Time (1923-2013)' , fontsize = 20 ) plt . legend ( bbox_to_anchor = ( 1 , 1 ), loc = 2 , prop = { 'size' : 15 }) plt . show ()









Conclusion