We trained all our algorithms using 5-fold cross validation and tested them on a separate, previously unseen, samples of 15k people in EU and 10k people in SA. We trained five different algorithms: logistic regression (Logistic), SVM with linear kernel (SVM-Linear) and radial basis function kernel (SVM-RBF), k-nearest neighbors (KNN), and random forests (RF). As our original feature space consisted of a large number of features, we used a cross validated SVM with L1-penalty for feature selection prior to training any of the five models. This feature selection was particularly important when we artificially reduced the size of the training set to 10k people or less. Therefore, throughout the paper, the training phase always refers to an initial feature-selection, followed by a cross-validation to find the best parameters.

The parameters for all the models were tuned through a grid search of stratified 5-fold cross validation. The tuned parameters were selected to have F1 scores as close as possible in both classes. This is a crucial step in eliminating bias from models trained on unbalanced data such as in our SA data (71% male) and in most developing countries [49]. Balancing is, along with its ability to work well with small training sets, a key property of our framework. Balancing is particularly important for our second use case of estimating gender balance. To further balance our model to make it equally good at identifying men and women, we tried either modifying the penalty term for each gender (inversely proportional to its relative frequency in the training set) or creating a balanced train set by undersampling the majority class while retaining the original population distribution in the test set. Results were equivalent and we kept the latter: undersampling the majority class. All results were obtained using Bandicoot v0.3.0 and scikit-learn v0.17.

Overall performance

Phone usage and how it distinguishes men and women or high- and low-income people is likely to be affected by the geographical and cultural context of the country and to vary with time. Indeed, changes in pricing schemes, penetration rates, and the socio-economic development of the country means that training sets have to be country and time specific. This differs strongly from traditional classification problems. For instance, training sets used for image classification are less time - and potentially culturally-sensitive. This is an important part of our application as labeled data will need to be specifically acquired through surveys to train the model with a collection cost largely linear with the number of labels. The applicability of our work is thus largely dependent on the performance of the framework with a small training set: to be useful, our framework needs to reach a high accuracy with a training set that is a small fraction of the considered dataset.

Figure 2 shows the performance of our framework as a function of the size of the training set and that our framework reaches a high accuracy with a small training set. Indeed, with a training set of 10k people, we already reach an accuracy of 74.1% in both EU and SA, and AUC of 0.81 in EU and 0.76 in SA. Increasing the size of the training set beyond 10k people only marginally increases accuracy. In EU, we reach an accuracy of 75.5% with 500k people and, in SA, an accuracy of 75.4% with 20k people. SVM-RBF gives, in both EU and SA, the best accuracy with training sets ≥5000 people and KNN the worst. On the entire dataset, RF and SVM-Linear give, in EU, an accuracy similar to SVM-RBF while logistic and KNN give a lower accuracy. In SA, SVM-Linear, Logistic, and RF have similar performances but lower than SVM-RBF. KNN has a lower accuracy. As mentioned before, all these results are obtained on subset of users who are active at least 2 days per week on average during the 3 months CDR period. Figure 3 shows how the performance results change as we vary this data preprocessing threshold.

Figure 2 Accuracy in EU (top) and SA (bottom) as a function of the size of the training sample. We reach an accuracy beyond 74% with training sets of 10k people. In case of data scarcity, we can further reduce the training size to 5k with minimal deterioration in performance. SVM-RBF reaches a higher accuracy than other algorithms in both cases for training sets larger than 5k people. In EU, increasing the training set size from 15k to 500k only increases the best accuracy by 1%. Full size image

Figure 3 Accuracy and AUC in EU (left) and SA (right) as a function of the minimum required level of activity. The model is trained and tested on a subset of users who are on average active x days per week, throughout the 3-months CDR period. As we increase this minimum required threshold, x, the coverage decreases and the performance slightly improves. Full size image

Beyond accuracy on the entire dataset, it is important to consider how well the model performs on the users it is the most or least confident about. See, for example, [44] and [46]. Figure 4 shows our accuracy in EU and SA as a function of confidence of the algorithm on its prediction, ranging from the top 25% to the entire dataset (100%) with a training set of 10k users. In EU, SVM-RBF’s accuracy ranges from 88.4% on the top 25% of users to 74.3% on the entire dataset. In SA, its accuracy ranges from 79.7% to 74.5%. One can see that, in the EU country, accuracy scales linearly with the percent of users we consider. RF and logistic regression are slightly better than SVM-RBF on the top 25% of the dataset. Contrarily, in SA, accuracy saturates at around the top 60% of the database with KNN reaching the best accuracy on the top 25% at 82.80%. We believe that the accuracy saturation in SA might be due to large non-linearities or even behavior reversals from rural to urban areas. Our models generalize without any location information and the SA country is known to exhibit a significant level of inequality between rural and urban areas when compared to the EU country. KNN is the only algorithm that can effectively circumvent such data complexity but performs poorly on the bottom 50% of the users.

Figure 4 Accuracy in EU (left), SA (right) on the top x% of the users the algorithms are the most confident about. In EU accuracy scales linearly with the top x% of the dataset while in SA most algorithms plateau around the top 60% of the database. Full size image

Table 2 shows that our results are correctly balanced with true positive rates (TPR) in SA and EU roughly equal to the true negative rates (TNR) in both the entire and top 25% of the dataset. In our application, misclassifying women (or men) at a higher rate than the other - which might arise in unbalanced datasets - would be highly problematic. These results validate the effectiveness of our balancing scheme in SA and the applicability of our results.

Table 2 Performance measures for SVM-RBF at different confidence thresholds in EU and SA ( \(\pmb{\mathrm{Positive}=\mathrm{Women}}\) , \(\pmb{\mathrm{Negative}=\mathrm{Men}}\) ): True Positive Rate (TPR), False Negative Rate (FNR), True Negative Rate (TNR), False Positive Rate (FPR) Full size table

The performance results above are obtained by training the model on 10K samples. However, if training data was particularly hard or expensive to acquire in a new country or if slightly lower accuracy is enough for the application at hand, 4 out of 5 of the algorithms we tried reach good level of accuracy with smaller training set. For instance, Figure 2 shows that SVM-RBF has an accuracy of 73.6% in EU and 72.9% in SA with a training set of only 5000 people. This means that in scenarios where coverage need not be 100% or roughly 75% accuracy is sufficient, large-scale mobile phone datasets can be labeled with gender information at a fraction of the cost of traditional national surveys (roughly two order of magnitude) as we only need to survey a couple thousand individuals for labeling a database containing millions.

Use cases

We here showed that large-scale mobile phone datasets can be labeled with gender information at a fraction of the cost. In this section, we will now evaluate the applicability of our framework for two use-cases. All the results are using a training set of 10k people with SVM-RBF unless specified otherwise.

Finding women in a dataset

The high prevalence of mobile phones has made them one of the main communication tools in developing countries. Finding the relevant population to send text messages to [50] is the first use case we evaluate the effectiveness of our framework against. More specifically, we focus on the task of finding a given set of users that are the most likely to be women in a dataset, e.g. to send them prenatal care educational and child immunization messages [51] or information about the importance of measles immunization [52]. More precisely, we here evaluate the effectiveness of our framework at identifying a certain number of women out of the dataset.

Figure 5 shows the precision of our framework at finding women (and men) in EU and SA. Precision is, if we pick the 10% of people in the dataset that we think are the most likely to be women, the percentage of them who are actually women. Precision in finding women in the EU ranges from 90.3% when we try to find 25% of the women in the dataset to 78.1% when trying to find all of them. Precision for men is respectively 86.2% and 70.8%. We obtain similar results with other algorithms. Precision is slightly higher for women, probably as the dataset is slightly unbalanced towards women (46% men). In SA, precision ranges from 71.4% to 52.7% for women and from 88.5% to 84.9% for men. The precision for women is 19 to 38% lower than for men. This is likely due to the fact that the dataset (and therefore test set) is highly unbalanced towards men (71% men) thus making it much harder to find women even with our balancing of the training set. Nevertheless, our method reaches a precision up to 2.5 times higher than random.

Figure 5 Precision for finding women (and men) in EU (left) and SA (right) as a function of the percentage of women that are in the dataset and which we are trying to find. The precisions are highly influenced by how balanced the datasets are. Our method reaches much higher precision than random (54% in EU and 29% in SA). Full size image

Estimating gender balance

In many situations, we are more interested in knowing the gender composition of a group than in the gender of individual users. For instance during response planning to crises and migration flows, more resources can be allocated to areas where a larger fraction of the vulnerable populations are women, children, and the elderly [53–55]. Beyond crisis, estimating the number of women residing in a particular area allows us to estimate the number of births and the need for reproductive health care services [56] and to inform the need of protection against gender based violence [57]. The gender composition of areas at a particular time of day would allow us to further refine recent development of time-dynamic census data based on mobile phone data [10]. To evaluate the effectiveness of our method for this use case, we create groups of 5k people with varying gender balance from 0 (all men) to 1 (all women) with steps of 0.1 (\(0, 0.1, 0.2, 0.3, \ldots, 1\)) classifying each individual in the group as a man or woman. While this gives us a first estimation of the gender balance of this group, we know from our training set that we have non-zero true and false positive rate [Table 2]. This means that, in a group in SA composed only of men, we would still - on average - predict that we have 1150 women. We therefore control for the false positive and negative rates as:

$$ \mathit{calibrated} = \frac{\mathit{predicted}}{\mathit{TPR} - \mathit{FPR}} -\frac{\mathit{FPR}}{\mathit{TPR} - \mathit{FPR}} $$ (1)

Figure 6 shows that our predicted gender balance with SVMs and Logistic Regression is very close, both in the EU and SA to the true gender balance of the group. Table 3 shows that the mean absolute error of our predicted gender balance using SVM-Linear is 1.10% in EU (\(r^{2}=0.9993\)) and 1.21% in SA (\(r^{2}=0.9992\)). This means that we are, on average, at most one or two percent off from the true men-women percentage in the group. The predicted ratio after calibration is not as impressive in the case of random forest and KNN, mainly due to the difference in train and test recalls. Both achieved a considerably higher recall in the train set compared to test set, and as a result their calibration was not aggressive enough. For example, the EU recall of RF in train set is 0.18 larger than test set while both train and test recalls are around 0.73 in the case of SVM-Linear.

Figure 6 True versus predicted gender balance in EU (left) and SA (right). Full size image