The dataset has been analyzed to recognizes customer types through the extracted features, customers with abnormal behavior have been detected using k-Means clustering, like customers having as average as more than 120 min of calls per day or having as average as more than 60 calls per day (later, we knew that their types of jobs explain their abnormality). Then, many classification algorithms were tested with extracted features. R language environment and its packages like caret and xgboost were used to preprocess those features and for modeling. The used preprocessing methods are:

PCA: PCA with 10 and 100 principal components were tested, although it accelerated model’s execution due to dimensionality reduction, but it didn’t improve models results.

Z-score, or standard score: Although it slightly improved SVM model, but it didn’t improved the best model we got which is xgboost.

$$\begin{aligned} Z{\text -}score= \frac{x-\mu }{\sigma } \end{aligned}$$ (6)

The following classification algorithms were tested: linear discriminant analysis (LDA), support vector machine [(SVM (with a radial basis)], extreme gradient boosting (XGBoost), random forest, logistic regression, GLMNET, KNN, Naive Bayes, CART, C5.0, gradient boosting machine (GBM) and Bagged CART, however the best model is selected based on evaluation results.

For models training and validation, the reliable dataset was divided into (80%–20%). All classification algorithms have been trained using 10-fold cross validation, and relied on below metrics for model evaluation:

Accuracy The number of correct predictions made is divided by the total number of predictions made. It is calculated by the formula $$\begin{aligned} Accuracy=\frac{T_p+T_n}{P+N} \end{aligned}$$ (7)

Area under the curve (AUC) Measures classifier’s performance [27]. It can be calculated by the formula $$\begin{aligned} AUC= \int _0^{1}TPR(x)dx \end{aligned}$$ (8) $$\begin{aligned} TPR= & {} \frac{T_p}{T_p+F_n} \end{aligned}$$ (9)

F1-measure Harmonic mean of the precision and recall. It is calculated by the formula $$\begin{aligned} F1{\text -}measure = \frac{2*Precision*Recall}{(Precision + Recall)} \end{aligned}$$ (10) Regarding model’s age evaluation, the formula of Mean F1 is $$\begin{aligned} Mean \, F1= \frac{F1\,grroup\, (A)+ F1\,grroup\, (B)+ F1\, grroup\, (C)}{3} \end{aligned}$$ (11)

These metrics were used in this research to evaluate models on the testing set. Table 5 and Fig. 4 shows evaluation results regarding gender prediction. Table 6 and Fig. 5 shows best 4 evaluation results regarding age prediction, using big data platform (6 nodes, each node has processor of 16 cores and 32 GB of memory).

Table 5 Results for gender prediction Full size table

Fig. 4 Results of classification algorithms for gender model Full size image

Table 6 Results for age prediction Full size table

Fig. 5 Results of classification algorithms for age model Full size image

As a result, ensemble learning algorithms such as GBM, xgboost [28] and random forest, have more advantages on other classification algorithms and achieved best Accuracy, (AUC and F1-measure, that is xgboost score 0.8903 in F1-measure for gender prediction).

The tuning of xgboost on gender prediction model is (using xgboost package): max_depth = 10, eta = 0.1, gamma = 0, min_child_weight = 0.9, lambda = 0, alpha = 0.9, nrounds = 150, subsample = 1.

The tuning of xgboost for age prediction is: max_depth = 20, eta = 0.1, gamma = 0, min_child_weight = 0.9, lambda = 0, alpha = 0.9, nrounds = 50, subsample = 1.

Increasing the number of trees more than 150 trees for xgboost in gender prediction didn’t improve gender model accuracy. Also increasing the number of trees more than 50 didn’t improve age model accuracy. That is at model learning process, each time we added tree to the xgboost model, the error rate is being tested on training and testing set, if the error rate on test set doesn’t decrease, learning process should be stopped, even if the error rate of the training set continued to decline, because most likely the model is going to overfit.

Gain metric is considered as a measure for features importance, therefore it used to detect informative features for gender and age models.

Gain implies the relative contribution of the corresponding feature to the model calculated by taking each feature’s contribution for each tree in the model. A higher value of this metric when compared to another feature implies it is more important for generating a prediction. It is calculated by the formula [28]

$$\begin{aligned} Gain = {\frac{1}{2} \Bigg [\frac{\left( \sum _{i\in I_L}{g_i}\right) ^2}{\sum _{i\in I_L}{h_i+\lambda }} + \frac{\left( \sum _{i\in I_R}{g_i}\right) ^2}{\sum _{i\in I_R}{h_i+\lambda }} - \frac{\left( \sum _{i\in I}{g_i}\right) ^2}{\sum _{i\in I}{h_i+\lambda }} \Bigg ] - \gamma }_{I=I_L\bigcup I_R} \end{aligned}$$ (12)

$$\begin{aligned} g_i = {\frac{\vartheta L\left(Y,f(x)\right)}{\vartheta f(x)}}_{f(x)=f^{(m-1)}(x)} \end{aligned}$$ (13)

$$\begin{aligned} h_i = {\frac{\vartheta ^2 L\left(Y,f(x)\right)}{\vartheta f(x)^2}}_{f(x)=f^{(m-1)}(x)} \end{aligned}$$ (14)

The proposed framework was selected based on the comparison between the results of classification methods as Fig. 6.

According to xgboost model, the top predicted features for gender model based on gain measure are entropy of duration for received calls, services that were oriented for young girls. Figure 7 shows gain measure for top 5 informative features for gender prediction. Figure 8 shows gain measure for top 4 informative features for age prediction, like entropy of all calls’ durations, average SMSs sent by the customer, etc...

Fig. 7 Top 5 predicted features for gender Full size image

Fig. 8 Top 4 predicted features for age Full size image

These results reflect the nature of our conservative society where males usually bear responsibilities more than females, so that the males handle several types of contacts (business, family, friends, ...). This justifies entropy in their telecommunication behavior (Figs. 1, 5). Also; with age model in (Figs. 2, 7) shows that group (A) which contains young people at the university age have less entropy because of having less contact types. The older the age, the more entropy we find in their telecommunication behavior compared to group (B) and (C). Less average transactions per contact can be justified by the increase of commitments towards more people and bigger families, which is a common thing in our country.

The limitation of this work is collecting reliable data (for training and testing) from random customers (only about 18000 customers) took a lot of time (about 6 months) because of following the direct methods and limitation of human resources for this process.

This work could be improved by being extended to include 2 new age-related groups which aren’t included right now, one for people who are less than 18 years old and another one for people who are above 60 years old, and achieve a more balanced percentage regarding the gender to be more equal.

In addition, this work was conducted on two types of CDRs only, (calls CDR and SMS CDR), that we couldn’t handle other types of CDR due to storage and process limitations. Internet usage CDR is considered as another data source to extract more valuable features, if the work gets rid of previously mentioned limitations, the reliable data set would be larger and more suitable for deep learning algorithms and the models will be more robust and accurate.

Another limitation is that, this work has been applied in the Syrian society, which may differ from other societies, so the informative features in this study could be more or less important in other societies.