tep 1: Describe your Sample



The National Sample Survey Office (NSSO) during the period July 2011 - June 2012 carried out an all-India household survey on the subject of employment and unemployment in India as a part of 68th round of its survey program. In this survey, the nation-wide enquiry was conducted with a purpose to generate estimates of various characteristics pertaining to employment and unemployment and labor force characteristics at the national and State levels. Kind of data is Sample survey data (ssd) and units of analysis are Households and Persons. The number of persons surveyed was 4,59,784 (2,81,327 in rural areas and 1,78,457 in urban areas) from all the states of India (found from the description).

A microdata is available for download at the following link http://164.100.34.62/index.php/catalog/143/datafile/F9. This data has 2712 records, one for an individual person, with 44 variables, including State, Age,Sector, Principal activity status, Nature_of_employment, No_of_Months_Without_Work. In the microdata, most persons were between the ages of 30 and 45 years, with 25% below 30 years. Most of them (72.5%) were rural residents, where 27.5% were urban sector residents and most (59%) were self-employed, 31% were salaried and the rest had other employment types. Most (87%) of them worked more or less regularly. Among them 61% had permanent while 31% had temporary employments and the rest were missing.

Step 2: Describe the Procedures that were used to collect the data



The fieldwork of the 66th round of NSSO started from 1st July, 2011 and continued till 30th June, 2012. As usual, the survey period of this round was divided into four sub-rounds, each with a duration of three months, the 1st sub-round period ranging from July to September 2011, the 2nd sub-round period from October to December 2011 and so on. An equal number of randomly sampled villages/blocks (FSUs) was allotted for survey in each of these four sub-rounds. The survey used the interview method of data collection from a sample of randomly selected households (found from the description).



Step 3: Describe your variables and Measures



There are 44 variables in total in the microdata. Most of the variables are listed and described below (except the serial no and id variables and the variables are not self explanatory).



Explanatory and Response variables



I shall like to understand how the variable No_of_Months_Without_Work (which will be the response variable) depends on the explanatory variables Age, State, District, Sector, Nature_of_employement, Full_Time_or_Part_Time, Made_Any_Efforts_to_Get_Work, Any_union_association, Worked_more_or_less_Regularly, Usual_Principal_Activity_Status



Data Management



First let’s select and subset the data with only the variables from the data that we shall be primarily interested to work with, as described above.

The microdata sample downloaded contains only one value for the State variable, so it’s not informative, so let’s drop this variable.



Then we convert all the variables to its appropriate numeric types (int64 and float64) in pandas.



The response variable No_of_Months_Without_Work contains many of values

(1264 of 2712 values) as very high values (possibly due to some numeric error), we replace these values by 13 to denote these values, other values were 0-12.



Some of the explanatory variables such as Age, District, Sector and Usual_Principal_Activity_Status don’t contain any missing or incorrect (outlier) values and all of them are of type integer (these are categorical variables but coded as integers), so not much pre-processing is required for these variables.

For the other explanatory variables of interest we either have the corresponsdign numeric values / categorical codes listed above in the table of variable description or they are missing (blank, converted to nan by pandas).

Only the explanatory variable Made_Any_Efforts_to_Get_Work has majority of values (2520 of its values out of 2712 values) missing, so we decided to replace the missing value by the categorical code 0, where it already had 2 different pre-existing codes 1 & 2.



For all the other variables the number of missing values are quite small compared to the number of rows of the dataset. Also, since we don’t have much idea about what values to fill those nan values, we decide to drop the rows where missing value appears in all of these variables.



The below is the pre-processing / data management code:

