In order to build the churn predictive system at SyriaTl, a big data platform must be installed. Hortonworks Data Platform (HDP)Footnote 1 was chosen because it is a free and an open source framework. In addition, it is under the Apache 2.0 License. HDP platform has a variety of open source systems and tools related to big data. These open source systems and tools are integrated with each other. Figure 1 presents the ecosystem of HDP, where each group of tools is categorized under specific specialization like Data Management, Data Access, Security, Operations and Governance Integration.

Fig. 1 Hortonworks data platform HDP—big data framework Full size image

The installation of HDP framework was customized in order to have the only needed tools and systems that are enough to go through all phases of this work. This customized package of installed systems and tools is called SYTL-BD framework (SyriaTel’s big data framework). We installed Hadoop Distributed File System HDFSFootnote 2 to store the data, Spark execution engineFootnote 3 to process the data, YarnFootnote 4 to manage the resources, ZeppelinFootnote 5 as the development user interface, AmbariFootnote 6 to monitor the system, RangerFootnote 7 to secure the system and (FlumeFootnote 8 System and ScoopFootnote 9 tool) to acquire the data from outside SYTL-BD framework into HDFS.

The used hardware resources contained 12 nodes with 32 Gigabyte RAM, 10 Terabyte storage capacity, and 16 cores processor for each node. A nine consecutive months dataset was collected. This dataset will be used to extract the features of churn predictive model. The data life cycle went through several stages as shown in Fig. 2

Fig. 2 Proposed churn Prediction System Architecture Full size image

Spark engine was used in most of the phases of the model like data processing, feature engineering, training and testing the model since it performs the processing on RAM. In addition, there are many other advantages. One of these advantages is that this engine containing a variety of libraries for implementing all stages of machine learning lifecycle.

Data acquisition and storing

Moving the data from outside SYTL-BD into HDFS was the first step of work. The data is divided into three main types which are structured, semi-structured and unstructured.

Apache Flume is a distributed system used to collect and move the unstructured (CSV and text) and semi-structured (JSON and XML) data files to HDFS. Figure 3 shows the designed architecture of flume in SYTL-BD. There are three main components in FLUME. These components are the data Source, the Channel where the data moves and the Sink where the data is transported.

Fig. 3 Apache Flume configured system architecture Full size image

Flume agents transporting files exist in the defined Spooling Directory Source using one channel, as configured in SYTL-BD. This channel is defined as Memory Channel because it performed better than the other channels in FLUME. The data moves across the channel to be finally written in the sink which is HDFS. The data transformed to HDFS keep in the same format type as it was.

Apache SQOOP is the distributed tool used to transfer the bulk of data between HDFS and relational databases (Structured data). This tool was used to transfer all the data which exists in databases into HDFS by using Map jobs. Figure 4 shows the architecture of SQOOP import process where four mappers are defined by default. Each Map job selects part of the data and moves it to HDFS. The data is saved in CSV file type after being transported by SQOOP to HDFS.

Fig. 4 Apache SQOOP data import architecture Full size image

After transporting all the data from its sources into HDFS, it was important to choose the appropriate file type that gives the best performance in regards to space utilization and execution time. This experiment was done using spark engine where Data Frame libraryFootnote 10 was used to transform 1 terra byte of CSV data into Apache ParquetFootnote 11 file type and Apache AvroFootnote 12 file type. In addition to that, three compression scenarios were taken into consideration in this experiment.

Fig. 5 Differences in space utilization and execution time per file type Full size image

Parquet file type was the chosen format type that gave the best results. It is a columnar storage format since it has efficient performance compared with the others, especially in dealing with feature engineering and data exploration tasks. On the other hand, using Parquet file type with Snappy Compression technique gave the best space utilization. Figure 5 shows some comparison between file types.

Feature engineering

The data was processed to convert it from its raw status into features to be used in machine learning algorithms. This process took the longest time due to the huge numbers of columns. The first idea was to aggregate values of columns per month (average, count, sum, max, min ...) for each numerical column per customer, and the count of distinct values for categorical columns.

Another type of features was calculated based on the social activities of the customers through SMS and calls. Spark engine is used for both statistical and social features, the library used for SNA features is the Graph Frame.

Statistics features These features are generated from all types of CDRs, such as the average of calls made by the customer per month, the average of upload/download internet access, the number of subscribed packages, the percentage of Radio Access Type per site in month, the ratio of calls count on SMS count and many features generated from aggregating data of the CDRs. Since we have data related to all customers’ actions in the network, we aggregated the data related to Calls, SMS, MMS, and internet usage for each customer per day, week, and month for each action during the nine months. Therefore, the number of generated features increased more than three times the number of the columns. In addition, we entered the features related to complaints submitted from the customers from all systems. Some features were related to the number of complaints, the percentage of coverage complaints to the whole complaints submitted, the average duration between each two complaints sequentially, the duration in “Hours” to close the complaint, the closure result, and other features. The features related to IMEI data such as the type of device, the brand, dual or mono device, and how many devices the customer changed were extracted. We did many rounds of brainstorming with seniors in the marketing section to decide what features to create in addition to those mentioned in some researches. We created many features like percentage of incoming/out-coming calls, SMS, MMS to the competitors and landlines, binary features to show if customers were subscribing some services or not, rate of internet usage between 2G, 3G and 4G, number of devices used each month, number of days being out of coverage, percentage of friends related to competitor, and hundred of other features. Figures 6 and 7 visualize some of the basic categorical and numerical features to give more insight on the deference between churn and non-churn classes.

Fig. 6 Distribution of some main categorical features Full size image

Fig. 7 Feature distribution for some main numerical features. Panel (a) visualizes the distribution of Day of Last Outgoing Transaction feature. Panel (b) visualizes the feature distribution of Average Radio Access Type Between 3G and 2G. Panel (c) also visualizes the distribution of Total Balance feature. Panel (d) shows the feature distribution of Percentage Transaction with other operators. Similarly, panel (e) visualizes the distribution of Percentage of Signaling Error/Dropped calls. Finally, panel (f) visualizes the distribution of the GSM Age feature. The red color is used in all panels to represent the churned customers' class and the blue one for active customers' class Full size image

Social Network Analysis features Data transformation and preparation are performed to summarize the connections between every two customers and build a social network graph based on CDR data taken for the last 4 months. Graph frame library on spark is used to accomplish this work. The social network graph consists of Nodes and edges. Nodes: represent GSM number of subscribers. Edges: represent interactions between subscribers (Calls, SMS, and MMS). The graph edges are directed since we have A to B and B to A. Figure 8 visualizes a sample of the build social network in SyriaTel where the red nodes are SyriaTel’s customers and the Yellow nodes are MTN’s Customers, the lines between the nodes express the interaction between the nodes. The total social graph contained about 15 million nodes that represent SyriaTel, MTN, and Baseline numbers and more than 2.5 Billion edges.



Fig. 8 Visualization for a sample of the Syrian social community Full size image

Graph-based features are extracted from the social graph. The graph is a weighted directed graph. We built three graphs depending on the used edges’ weight. The weight of edges is the number of shared events between every two customers. We used three types of weights: (1) the normalized calling duration between customers, (2) the normalized total number of calls, SMS, and MMS, (3) the mean of the previous two normalized weights. The normalization process varies according to the algorithm used to extract the features as we see in the formulas of these algorithms. Based on the directed graphs, we use PageRank [19], Sender Rank [20] algorithms to produce two features for each graph.

The weighted Page Rank equation is defined as follows $$\begin{aligned} PR(m)=(1-d)+d*\sum _{n\in N(m)}\frac{W_{n\rightarrow m}}{\sum _{n'\in N(n)}W_{n\rightarrow n'}} PR(n) \end{aligned}$$ (1)

While the weighted Sender Rank equation is defined as follow $$\begin{aligned} SR(m)=(1-d)+d*\sum _{n\in N(m)}\frac{W_{m\rightarrow n}}{\sum _{n'\in N(n)}W_{n\rightarrow n'}} SR(n) \end{aligned}$$ (2) Graph networks related to telecom data may contain two types of nodes. First, nodes with zero outgoing and many incoming interactions. Second, nodes with zero-incoming and many outgoing interactions. These two kinds of nodes are called Sink nodes. In regards to Eq. (1), the nodes with zero outgoing edges are the Sinks while in Eq. (2) the Sinks are the nodes with zero-incoming edges. The damping factor d is used here to prevent these Sinks from getting higher SR or PR values each round of calculation. Damping factor in telecom social graph is used to represent the interaction-through probability .The first part (1-d) represents the chance to randomly select a sink node while the d is used to make sure that the sum of PageRanks or SenderRanks is equal to 1 at the end. In addition to that, it prevents the nodes with zero-outgoing edges to get zero SenderRank values and the nodes with zero-incoming edges to get zero PageRank values since these values will be passed to the sink nodes each round. If d =1, the equations need an infinite number of iterations to reach convergence. While a low d value will make the calculations easier but will give incorrect results. We assumed to set the d value to be 0.85 as mentioned in most of the research [21, 22]. N(m) is the list of friends for the customer (m) in his social network. \(W_{n \rightarrow m}\) is the directed edge weight from n to m. \(\frac{W_{n\rightarrow m}}{\sum _{n'\in N(n)}W_{n\rightarrow n'}}\) is the normalized weight of the directed edge from n to m. The same description is used for sender rank. Due to the random walk nature of the Eqs. (1) and (2), PR and SR will be stable after a number of iterations. These values indicate the importance of the customers since the higher values of PR(m) and SR(m) corresponds to the higher importance of customers in the social network. Other SNA features like the degree of centrality, IN and OUT degree which is the number of distinct friends in receive and send behavior were calculated. The feature Neighbor Connectivity based on degree centrality which means the average connectivity of neighbors for each customer is also calculated [23].

Neighbor Connectivity equation is defined as follow $$\begin{aligned} NC(m)= \frac{\sum _{k\in N(m)} \left| N(k) \right| }{\left| N(m) \right| } \end{aligned}$$ (3) The local clustering coefficient for each customer is also calculated. This feature tells us how close the customer’s friends are (number of existing connections in a neighborhood divided by the number of all possible connections) [24].

local clustering coefficient equation is defined as follow $$\begin{aligned} LC(m)= \sum _{k\in N(m)} \frac{\left| N(m)\cap N(k) \right| }{ \left| N(m) \right| * (\left| N(m) \right| -1)} \end{aligned}$$ (4) This social network is also used to find similar customers in the network based on mutual friend concept. Each customer has 2 similarity features with the other customers in his network, like Jaccard similarity, and Cosine similarity. These calculations were done for each distinct couple in the social network, where each customer will have two calculations in the network. To reduce this complexity, customers who don’t have mutual friends are excluded from these calculations. The highest values for both measures are selected for each customer ( top Jaccard and Cosine similarity for similar SyriaTel customer and top Jaccard and Cosine similarity for similar MTN customer). Jaccard measure: normalize the number of mutual friends based on the union of the both friends lists, [25].

Jaccard similarity equation between customer(m) and customer(k) is defined as follows: $$\begin{aligned} JS(m,k) = \frac{\left| N(m)\cap N(k)\right| }{\left| N(m)\cup N(k)\right| } \end{aligned}$$ (5) Another similarity measure is the Cosine measure which is similar to Jaccard's. On the other hand, this similarity measure calculates the Cosine of the angle between every two customers’ vectors where the vector is the friend list of each customer [25].

Cosine similarity equation between customer(m) and customer(k) is defined as follows: $$\begin{aligned} JS(m,k) = \frac{\left| N(m)\cap N(k)\right| }{\sqrt{\left| N(m) \right| \left| N(k) \right| }} \end{aligned}$$ (6)

The cosign similarity is useful when the customer is in the phase of leaving the company to the competitor, where he starts building his network on the new GSM line to be similar to the old being churned, taking into consideration that the new line has a small friends list compared with the old one.

These features are used for the first time to enhance the prediction of churn, and they have a positive effect along with the other statistical features. The distribution of the main SNA features are presented in Fig. 9.

Fig. 9 Distribution of some main SNA features, panel (a) visualizes the feature distribution of Cosine Similarity Between GSM Operators, panel (b) visualizes the distribution of Local Cluster Coefficient feature, and panel (c) visualizes the distribution of Social Power Factor feature. The red color is used in all panels to represent the churned customers' class and the blue one for active customers' class Full size image

Table 1 shows some calculated main SNA features with illustration.

Table 1 Some main SNA features with description Full size table

Features transformation and selection

Some features such as Contract ID, MSISDN and other unique features for all customers were removed. They are not used in the training process because they have a direct correlation with the target output (specific to the customer itself). We deleted features with identical values or missing values, deleted duplicated features, and features that have few numeric values. We found that more than half of the features have more than 98% of missing values. We tried to delete all features that have at least one null value, but this method gave bad results.

Finally, we filled out the missing values with other values derived from either the same features or other features. This method is preferable so that it enables us to use the information in most features for the training process. We applied the following:

Records that contain more than 90% of missing features were deleted.

Features that have more than 70% of missing values were deleted.

For the missing categories in categorical features, they were replaced by a new category called ‘Other’.

The missing numerical values were replaced with the average of the feature.

The number of categorical features were 78, the first 31 most frequent categories were chosen and the remaining categories were replaced with a new category, so the total number is 32 categories.

There are some other features with a numeric character but they contain only a limited number of duplicate values in more than one record. This indicates that they are categorical so we have dealt with them as categorical features, but the experiment shows that they perform worse with the model, so that they have been deleted.

We have also calculated the correlation between numerical features using Pearson and removed the correlated features. This removal had no effect on the final result. Many other methods were tested, but this applied approach gave the best performance of the four algorithms. The number of features after this operation exceeded 2000 features at the end.

We need this data labeled for training and testing, we contacted experts from the marketing section to provide us with labeled sample of GSM, so they provide us with a prepaid customers in idle phase after 2 months of the nine months data, considering them as churners. The other non-churned customers were labeled as Active customers (customers acquired in the last 4 months are excluded). The total count of the sample where 5 million customers containing 300,000 churned customers and 4,700,000 active customers. Figure 10 shows the periods of historical data and the future period when the customer may leave the company.

Fig. 10 Periods of historical and future data Full size image

The experts in marketing decided to predict the churn before 2 months of the actual churn action, in order to have sufficient time for proactive action with these customers.

Classification

The solution we proposed divided the data into two groups: the training group and the testing group. The training group consists of 70% of the dataset and aims to train the algorithms. The test group contains 30% of the dataset and is used to test the algorithms. The hyperparameters of the algorithms were optimized using K-fold cross-validation. The value of k was 10. The target class is unbalanced, and this could cause a significant negative impact on the final models. We dealt with this problem in our research by rebalancing the sample of training by taking a sample of data to make the two classes balanced [25]. We started with oversampling by duplicating the churn class to be balanced with the other class. We also used the random undersampling method, which reduces the sample size of the large class to become balanced with the second class. This method is the same as the one used in more than one research papers [8, 26]. It gave the best result for some algorithms. The training sample size became 420,000.

We started training Decision Tree algorithm and optimizing the depth and the maximum number of nodes hyperparameters. We experimented with several values, the optimized number of nodes was 398 nodes in the tree and the depth value was 20. Random Forest algorithm was also trained, we optimized the number of trees hyperparameter. We experimented with building the model by changing the values of this parameter every time in 100, 200, 300, 400 and 500 trees. The best results show that the best number of trees was 200 trees. Increasing the number of trees after 200 will not give a significant increase in the performance. GBM algorithm was trained and tested on the same data, we optimized the number of trees hyper-parameter with values up to 500 trees. The best value after the experiment was also 200 trees. GBM gave better results than RF and DT. We finally installed XGBOOST on spark 2.3 framework and integrated it with ML library in spark and applied the same steps with the past three algorithms. We also optimized the number of trees, and the best value after multiple experiments was 180 trees.