Feature Correlation: I am using a couple features in this model, after a extensive evaluation i will drop some of the features that have a low contribution to the overall performance.

Feature list and computations:

friends_count,

followers_count,

listed_count,

statuses_count,

digit_count_in_name,

bot_in_name — regex search for bot in name,

len_url,

tweet_length,

accountage — status_count/ accountage,

activeness,

names_ratio,

followership — (followers_count / friends_count) ,

friendship — (friends_count, followers_count)

The choice of features has been guided by patterns observed in fake account and or their associations with other accounts, their tweeting patterns among others.

Step 1 Create the Model

in the code above i create a model in the Social dataset in my project space, the model name is ispopaganda. a lot of the string and Date handling in the script is due to the fact that i did not transform the data before loading it. I intend to use https://cloud.google.com/dataprep/ when i productionalize my model to handle the data prep side of things dynamically to ensure good results, but for now i will have to live with the code.

Step 2 Predict using our Model

Result

Step 3 Evaluate Model

One of the key steps is evaluating the performance of the model, depending on the desired KPI recall for your model you might have to tune your threshold and features until the model gives you the desired result. that being said we have to continuously evaluate our model this can be automated, we will look at this in later post, for now let go through hoe to get evaluation results using ML.EVALUATE and ML.ROC_CURVE functions.

For the (https://en.wikipedia.org/wiki/Receiver_operating_characteristic) ROC Curve we will execute the Query with SELECT * FROM ML.ROC_CURVE(MODEL …. option and click on explore with Data Studio to create a graph with Google’s free reporting tool. we will set the ( dimension = false_positive_rate and metric = recall) and the result is below.

WHATS NEXT

I am going to do a part 2 , where we will tune the model and serve the model to predict with real data that i have am collecting from twitter using Google connected sheets. I have glossed over a lot of detail, i will attempt to do a deep dive in part 3. If you have questions you can reach me on muchemwal@gmail.com

Credits: https://www.sciencedirect.com/science/article/pii/S0925231218308798

https://towardsdatascience.com/how-to-use-k-means-clustering-in-bigquery-ml-to-understand-and-describe-your-data-better-c972c6f5733b