Detecting so-called “fake news” is no easy task. First, there is defining what fake news is – given it has now become a political statement. If you can find or agree upon a definition, then you must collect and properly label real and fake news (hopefully on similar topics to best show clear distinctions). Once collected, you must then find useful features to determine fake from real news.

For a more in-depth look at this problem space, I recommend taking a look at Miguel Martinez-Alvarez’s post “How can Machine Learning and AI Help Solve the Fake News Problem”.

Around the same time I read Miguel’s insightful post, I came across an open data science post about building a successful fake news detector with Bayesian models. The author even created a repository with the dataset of tagged fake and real news examples. I was curious if I could easily reproduce the results, and if I could then determine what the model had learned.

In this tutorial, you’ll walk through some of my initial exploration together and see if you can build a successful fake news detector!

Tip: if you want to learn more about Natural Language Processing (NLP) basics, consider taking the Natural Language Processing Fundamentals in Python course.

Data Exploration To begin, you should always take a quick look at the data and get a feel for its contents. To do so, use a Pandas DataFrame and check the shape, head and apply any necessary transformations. eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZCIsInNhbXBsZSI6IiMgSW1wb3J0IGBmYWtlX29yX3JlYWxfbmV3cy5jc3ZgIFxuZGYgPSBfXy5fX19fX19fXyhcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2Zha2Vfb3JfcmVhbF9uZXdzLmNzdlwiKVxuICAgIFxuIyBJbnNwZWN0IHNoYXBlIG9mIGBkZmAgXG5kZi5fX19fX1xuXG4jIFByaW50IGZpcnN0IGxpbmVzIG9mIGBkZmBcbmRmLl9fX19fXyIsInNvbHV0aW9uIjoiIyBJbXBvcnQgYGZha2Vfb3JfcmVhbF9uZXdzLmNzdmAgXG5kZiA9IHBkLnJlYWRfY3N2KFwiaHR0cHM6Ly9zMy5hbWF6b25hd3MuY29tL2Fzc2V0cy5kYXRhY2FtcC5jb20vYmxvZ19hc3NldHMvZmFrZV9vcl9yZWFsX25ld3MuY3N2XCIpXG4gICAgXG4jIEluc3BlY3Qgc2hhcGUgb2YgYGRmYCBcbmRmLnNoYXBlXG5cbiMgUHJpbnQgZmlyc3QgbGluZXMgb2YgYGRmYCBcbmRmLmhlYWQoKSIsInNjdCI6InRlc3Rfb2JqZWN0KFwiZGZcIiwgdW5kZWZpbmVkX21zZz1cIkRpZCB5b3UgaW1wb3J0IHRoZSBkYXRhIGNvcnJlY3RseT9cIixpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHVzZSB0aGUgUGFuZGFzIGByZWFkX2NzdigpYCBmdW5jdGlvbiB0byByZWFkIGluIHRoZSBkYXRhIVwiKVxudGVzdF9vYmplY3RfYWNjZXNzZWQoXCJkZi5zaGFwZVwiLCBub3RfYWNjZXNzZWRfbXNnPVwiRGlkIHlvdSB1c2UgYHNoYXBlYCB0byByZXRyaWV2ZSB0aGUgc2hhcGUgb2YgdGhlIGBkZmA/XCIpXG50ZXN0X2Z1bmN0aW9uKFwiZGYuaGVhZFwiKVxuc3VjY2Vzc19tc2coXCJOaWNlIG9uZSFcIikifQ== eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGYgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2Zha2Vfb3JfcmVhbF9uZXdzLmNzdlwiKSIsInNhbXBsZSI6IiMgU2V0IGluZGV4XG5kZiA9IGRmLnNldF9pbmRleChcIlVubmFtZWQ6IDBcIikgXG5cbiMgUHJpbnQgZmlyc3QgbGluZXMgb2YgYGRmYCBcbmRmLl9fX19fXyIsInNvbHV0aW9uIjoiIyBTZXQgaW5kZXggXG5kZiA9IGRmLnNldF9pbmRleChcIlVubmFtZWQ6IDBcIilcblxuIyBQcmludCBmaXJzdCBsaW5lcyBvZiBgZGZgIFxuZGYuaGVhZCgpIiwic2N0IjoidGVzdF9vYmplY3QoXCJkZlwiKVxudGVzdF9mdW5jdGlvbihcImRmLmhlYWRcIilcbnN1Y2Nlc3NfbXNnKFwiQXdlc29tZSBqb2IhXCIpIn0=

Extracting the training data Now that the DataFrame looks closer to what you need, you want to separate the labels and set up training and test datasets. For this notebook, I decided to focus on using the longer article text. Because I knew I would be using bag-of-words and Term Frequency–Inverse Document Frequency (TF-IDF) to extract features, this seemed like a good choice. Using longer text will hopefully allow for distinct words and features for my real and fake news data. eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuLm1vZGVsX3NlbGVjdGlvbiBpbXBvcnQgdHJhaW5fdGVzdF9zcGxpdFxuaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGYgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2Zha2Vfb3JfcmVhbF9uZXdzLmNzdlwiKVxuZGYgPSBkZi5zZXRfaW5kZXgoXCJVbm5hbWVkOiAwXCIpICIsInNhbXBsZSI6IiNTZXQgYHlgIFxueSA9IGRmLl9fX19fIFxuIFxuIyBEcm9wIHRoZSBgbGFiZWxgIGNvbHVtbiBcbmRmLl9fX18oXCJsYWJlbFwiLCBheGlzPTEpIFxuIFxuIyBNYWtlIHRyYWluaW5nIGFuZCB0ZXN0IHNldHMgXG5YX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCA9IF9fX19fX19fX19fX19fX18oZGZbJ3RleHQnXSwgeSwgdGVzdF9zaXplPTAuMzMsIHJhbmRvbV9zdGF0ZT01MykiLCJzb2x1dGlvbiI6IiMgU2V0IGB5YCBcbnkgPSBkZi5sYWJlbCBcblxuIyBEcm9wIHRoZSBgbGFiZWxgIGNvbHVtblxuZGYuZHJvcChcImxhYmVsXCIsIGF4aXM9MSlcblxuIyBNYWtlIHRyYWluaW5nIGFuZCB0ZXN0IHNldHMgXG5YX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQoZGZbJ3RleHQnXSwgeSwgdGVzdF9zaXplPTAuMzMsIHJhbmRvbV9zdGF0ZT01MykiLCJzY3QiOiJ0ZXN0X29iamVjdChcInlcIilcbnRlc3Rfb2JqZWN0KFwiZGZcIilcbnRlc3Rfb2JqZWN0KFwiWF90ZXN0XCIsIGRvX2V2YWw9RmFsc2UsIHVuZGVmaW5lZF9tc2c9XCJEaWQgeW91IGRlZmluZSBgWF90ZXN0YD9cIilcbnRlc3Rfb2JqZWN0KFwiWF90cmFpblwiLCBkb19ldmFsPUZhbHNlLCB1bmRlZmluZWRfbXNnPVwiRGlkIHlvdSBkZWZpbmUgYFhfdHJhaW5gP1wiKVxudGVzdF9vYmplY3QoXCJ5X3RyYWluXCIsIGRvX2V2YWw9RmFsc2UsIHVuZGVmaW5lZF9tc2c9XCJEaWQgeW91IGRlZmluZSBgeV90cmFpbmA/XCIpXG50ZXN0X29iamVjdChcInlfdGVzdFwiLCBkb19ldmFsPUZhbHNlLCB1bmRlZmluZWRfbXNnPVwiRGlkIHlvdSBkZWZpbmUgYHlfdGVzdGA/XCIpXG5zdWNjZXNzX21zZyhcIkF3ZXNvbWUgam9iIVwiKSJ9