Neutral Bias

This article has neutral bias with a bias score of -3.63 from our political bias detecting A.I.

In our hyper-partisan political climate, discerning between fake and real news has grown even more so challenging. Even worse, though, is when political bias is thrown in the mix and facts and opinions are twisted.

We’ve always valued transparency here at The Bipartisan Press, by labeling the bias of all our articles and in order to make the process more systematic and nonpartisan, we pursued the use of AI and natural language processing to predict political bias.

Table of Content

Wait…So how does A.I. work?

Dataset

Building the Model Initial Conclusion

Building (Better) Models

Interesting Observations

Area’s for Improvement

Try the model out for yourself!

Wait…So how does A.I. work?

Classifying political bias is considered a Natural Language Processing, commonly abbreviated NLP, task. That means it relates to the processing and analysis of textual content. NLP is a branch of AI, or Artificial Intelligence, which is building systems (usually algorithms) that can mimic human intelligence.

Machine learning, which is teaching and building systems that can learn from experience is usually considered a branch and method of AI and building AI systems. In this case, machine learning is what we are going to be using to build the NLP model that will predict political bias.

In order to build a successful machine learning model, a dataset of prelabeled data containing the inputs and wanted outputs of the model are needed. In our case, this means the text content we want to feed the ML model and the bias values that we want to be outputted. This is the data that we will show to the model and have it learn from.

Once we have all our dataset ready, we will need to set up the actual model, which will be looking at and learning from the dataset to output the targeted values. There are countless libraries, parameters, and architectures in which we can build our model on. Two of these libraries are called Tensorflow and Pytorch (which we both will touch on during our experiment). These base libraries simplify the work needed to set up basic algorithms and layers in the ML model.

Running on top of these frameworks are libraries like Keras, which further make the process more simple. We will be using Keras and FastAI in our process.

Now, within these libraries are different configurations for the ML model. These configurations include things like max word count, batch size (the number of data pieces to show the model at one time) and more. The model also has various “layers” that carry out various functions. In NLP ML models, some layers remember certain words, some look for word relationships, some analyze how the text is structured.

If you combine many layers of various functions together, you can form an architecture. The two main architectures we will be using are LSTM (Long-Short Term Memory, the backbone of ULMFit) and Transformer networks (BERT), which we won’t go into detail here, but are basically just different configurations for the model to learn with.

Once we have assembled our ML model and dataset, we move on into training the model, or actually showing the model the data and having it learn to classify political bias. We will go through many rounds of training to let the ML model learn the most it can from the data.

After training the model, we need to evaluate how well the model is doing and how accurate the predicted outputs are compared to the actual outputs. We do this on a separate dataset called the “validation” set that we didn’t allow the machine to train on during training. There are many metrics we can use to judge the performance of the model. For simplicity, we will be using accuracy (#correct / total) for classification tasks (ie, “right” or “left) and mean of absolute error (aka MAE, absolute value of (actual – predicted) for each data entry, then the mean of all the values) for regression tasks (continuous output, ie 1.0 through 10.0).

Keep in mind, this is a vastly simplified process of how we built our ML model and the actual process had many more steps and experimentation.

Past Experiments

There have been attempts by others in the past to predict political bias. Linalgo built a classifier that used five labels for the data, achieving 70% accuracy. Researchers at Stanford also built a three label classifier, achieving 83% accuracy. MIT’s CSAIL has also built a three label classifier achieving 70% accuracy. However, these studies are slightly dated and only cover classification, and with the release of new, large, pre-trained models like BERT and GPT-2, we wondered how we would put those to use in a regression task (continuous variable).

Dataset

In order to train our AI, we need to first build a dataset to train it on. Ultimately, we ended up trying out four different datasets to find out which ones were the most accurate.

Our first dataset was based on AdFontesMedia’s list of articles with prelabeled per-article bias (on a scale of -42 to 42). We extracted the article content using the Python newspaper library and used the bias values as the prediction target. The total dataset consisted of ~1200 entries totaling ~1 million words. The distribution of bias is as follows (further left is more left-biased, center is more neutral, right is more right biased):

In an attempt to expose the ML model to a greater variety of text, we then extracted 10k articles using webhose.io and labeled them with numbers representing the bias value converted with data from MediaBiasFactCheck. This dataset had significantly more noise than the prior one since the generalization was made that all the articles published by an outlet have the same bias, which is obviously false. Here is the distribution of the data by sites and by bias:

The distribution by domain isn’t very even, as a majority of articles were from CNN and Breitbart. Bias-wise though, the skew was less obvious.

Our third dataset with based on the AllTheNews dataset which contained ~100k articles and was labeled the same way as the above. Here’s the distribution for this.

This bias distribution was more skewed on this with ~65k articles being labeled with right bias versus ~45k left.

Our fourth dataset was the Ideological Books Corpus, which is slightly different because it contains sentences annotated with ideology (2025 liberal sentences, 1701 conservative sentences, and 600 neutral sentences). We ultimately ended up not using this because of the limited data.

Model Layout

We wanted the output to be a range of numbers (ie, 0-42) showing the degree of bias, as well as negative or positive denoting left or right bias.

We considered two different setups for this:

Create a regression model that classifies both the degree of bias (ie, 0-42) and the direction of bias (-42 to 42) at the same time. This will be denoted as a “full” model in the rest of the article. Create a regression model that only classifies the degree of bias (0-42). Then make a separate classification model that classifies direction. (-1 = left, or 1 = right ) and combine the results (ie, 18 * -1 = -18 bias). This will be denoted as a “dual” model in the rest of the article.

Architecture

We experimented with a variety of architectures and pre-trained models for the AI, including transformers, LSTM networks, BERT, GPT, and ULMfiT. Ultimately, we found that BERT, transformer-based models were the most accurate in predicting both bias and direction.

Building the Model

Full Regression

ULMfiT

We first built a ULMfiT full regression (-42 to 42) model using FastAI’s native implementation (AWD LSTM) on the first two datasets. A batch size of 64 and train:validation split of 8:2 was used throughout the experiment.

We didn’t train it on the third one due to the excessive computation resources required that we lacked. Our language models (pre-trained from Wikitext-103) achieved the following accuracies:

Dataset #1: 34% accuracy

Dataset #2: 42% accuracy

From there, we trained a regressor and achieved the following mean of absolute errors:

Dataset #1: 10.23

Dataset #2: 4.23

Despite the seemingly low error deviation on the second model with dataset #2. we actually found the first model to be marginally more accurate is predicting actual bias, especially the direction of bias (left or right). This is probably due to the more specific labels assigned to dataset #1 versus the per-outlet labels with dataset #2.

Sampling the large, All The News dataset, we performed a separate validation regarding just the accuracy of predicting the direction and found that model #1 only had an accuracy of 62% and model #2 only had an accuracy of 47%.

Best Political Bias Prediction Model: Dataset #1, ULMFit, 10.23

BERT

Adapting the code for classification with BERT, we made it work with regression, using a batch size of 8 and a max_sequence_length of 512. Here are our regressor M.A.E.s for BERT:

Dataset #1: 7.94

Dataset #2: 3.04

Despite the seemingly low absolute error for dataset #2 again, the results were still pretty inaccurate on the separate test set, with D#1 outperforming D#2 by 28% in terms of just classifying left and right (85% vs 56%).

(New) Best Political Bias Prediction Model: Dataset #1, BERT, 7.94

Keras (LSTM)

Using the Keras library and Tensorflow, we also built a linear regression model. We used a max_sequence_length of 500, and 1 LSTM layer with 100 memory units with drop out, and a vocab of 50000. We attained a mean of absolute error of 9.3 with Dataset #1 and an M.A.E of 3.5 with datasets #2 and #3 combined, but separate testing once again proved the second and third datasets to be quite inaccurate.

(Still) Best Political Bias Prediction Model: Dataset #1, BERT, 7.94

Conclusion

From this, we were able to conclude that BERT had the best performance versus ULMFit, achieving the lowest mean of absolute error out of all the models, as well being the most accurate in determining the direction of bias.

Dual Model Ensemble

Next, we built the dual model ensemble using the similar architectures that we used above.

ULMFit

Splitting the dataset along into the categories “left” and “right” by assigning negative values “left” and positive values “right” from the datasets, we first built a language classifier based on Datasets #1, 2, 3, and 4 (ideology corpus).

Our fine-tuned language models achieved the following accuracies:

Dataset #1: 34% accuracy

Dataset #2: 42% accuracy

Dataset #3: 40% accuracy*

Dataset #4: 30% accuracy

Following the same procedure as our regression model, we then trained a text classifier on the two labels. Accuracies are, once again, below:

Dataset #1: 72% accuracy

Dataset #2: 91% accuracy

Dataset #3: 95% accuracy*

Dataset #4: 65% accuracy

Curiously enough, the models achieved only ~2/3 of the validation accuracy above on the separate test set.

In order to build a bias-only model, we then took the absolute values of all the numbers and trained using the language model above, trained regressor on it. The regressor achieved the following M.A.E’s (Dataset #4 was omitted due to the lack of individual bias values needed for regression, #3 was omitted due to lack of resources):

Dataset #1: 6.45

Dataset #2: 2.06

To produce a final value, we multiplied the direction (-1 or 1) by the bias value.

Due to the low accuracy of the directional classifier, we weren’t able to attain accurate results using this model setup.

*the language model of dataset #3 wasn’t completely trained to its potential, possibly bringing the maximum accuracies down.

(Still) Best Political Bias Prediction Model: Dataset #1, BERT, 7.94

BERT

Using the same BERT setup, we proceeded to build the same models using the bidirectional transformer architecture. Results are below:

Classification

Dataset #1: 68% accuracy

Dataset #2: 92% accuracy

Dataset #3: 96% accuracy

Dataset #4: 56% accuracy

Bias-only Regression (MAE)

Dataset #1: 5.02

Dataset #2: 1.07

Similar trends we observed that mirrored our previous experiments. The classification accuracies were significantly when tested separately versus the validation accuracy. Our best classification model (BERT Dataset #3) only had a 69% accuracy on the withheld test set.

Furthermore, our regression model using dataset #2 also yielded quite inaccurate and volatile results, despite the low error deviation.

(Still) Best Political Bias Prediction Model: Dataset #1, BERT, 7.94

Combining Models

Hoping to boost the accuracy, we experimented with combinations of the classification models, by adding the probabilities of the categories from different models and taking the high one.

Our highest final (testing) accuracy, 77%, came from the BERT model on dataset #3, and the ULMFit model on dataset #2.

(Still) Best Political Bias Prediction Model: Dataset #1, BERT, 7.94

Conclusion

Despite the significant accuracy boost from combining two different classification models, the single, full regression model still outperformed the dual-ensemble model, possibly because it was able to learn more complex relationships and correlations that the simplified, dual models weren’t able to learn.

We were also able to conclude that using the BERT transformer architecture was considerably more accurate than LSTM networks. Furthermore, we also found out that dataset #1 yielded the most accurate results when trained on, probably due to the least amount of noise in it, despite other datasets having more data.

Building (Better) Models

Now that we were able to generally establish which dataset, model configuration, and architecture worked the best, we set out on building an even more accurate model to predict and classify political bias. Having already experimented with various learning rates and parameters, our options were to either:

Increase dataset size Try a different model

We were pretty limited in terms of attaining a larger dataset since AdFontesMedia doesn’t add new sources very frequently, and so we decided to try a new model architecture.

After a bit of research, we found several new architectures that outperformed BERT on benchmarks like GLUE. With this, we built a regression model by adapting several guides we found, including here and here. The models we tried are below.

RoBERTa

Created by Facebook, RoBERTa is a more robustly trained version of BERT –it’s been trained for longer and with more data. Following a similar setup with the BERT models above, we tested both the “base” model and the “large” (more parameters) one. Using the based model, we achieved an MAE of 7.55, a marked improvement over BERT’s 7.9. On the “large” model, we attained a breakthrough with an MAE of 6.03, a 25% lower error deviation than the initial BERT model and a 20% improvement versus the “base” model.

(New) Best Political Bias Prediction Model: Dataset #1, RoBERTa (large), 6.04

XLNet

XLNet is a different transformer architecture that is comparable in performance to RoBERTa We attempted to set up an XLNet model, but were unable to fit even the base model on our 16 GB GPU with a batch size of 1 and FP16, so we weren’t able to test out its performance.

Albert

Albert is another variation of BERT trained by Google that achieves even better results than RoBERTa on some benchmarks. We also attempted to fine-tune the Albert model but were only able to achieve a min. MAE of 10.02. We tested out the “large-v1” “xlarge-v2” “xxlarge-v2” and “large-v2” models.

(Final) Conclusion

Facebook’s Roberta model had significantly better performance than the initial BERT model, achieving an error deviation of only 6.04. This is the model we currently have deployed to rate our articles.

Interesting Observations

While testing various models, we noticed several interesting observations.

Our largest BERT left-right classification model was adept in classifying bias by domain. For example, it correctly classifies, when given just the domain, “cnn.com”, “washingtonpost.com”, and “nytimes.com” as Left, while “foxnews.com”, “washingtonexamer.com”, and “nypost.com” were Right.

The BERT + RoBERTa regression models were not adept in recognizing short ideological sentences like “we need more gun control” and “gun control is bad”

The BERT regression model assigned moderate bias values (range: -9 to 9) when given the input of only names like “trump”, “clinton”, and “bernie sanders.” We observed similar results when we only inputted the terms “democrats” (14.5) and “republicans” (-10.1).

RoBERTa, on the other hand, returned much more neutral results (range: -1 to 1) when given the same terms, showing the increased accuracy of the model versus the base BERT model.

Areas for Improvement

Dataset – Increasing the size of the dataset should expose the ML model to a greater variety of data and topics, increasing accuracy.

More architectures (Ernie, XLnet, Albert) – Different architectures process text differently. Changing the architecture type may improve the results for this task.

Better data preprocessing – The content was put in the dataset directly after being scrapped with newspaper. By stripping out unnecessary content like “Advertisement” text, related post widgets, or newsletter subscription forms would probably have helped increased accuracy as there’s less noise.

Larger batch sizes – The use of larger batch sizes could potentially increase the accuracy of the models, but require more GPU memory.

Try Out Our Classifier

We have created a tool that allows you to test out our dual model ensemble and combined direction classifier. We are also considering the release of the model and its weights for public use. This model’s performance is lower than our production model, which, at this time, we won’t be releasing or allow the public use of.

Bias Classifier

As time goes on, the state of NLP will continue to improve and so will the political climate. With this, we will continue to improve our AI in pursuit of better accuracy. Subscribe to our newsletter to be the first to know of new developments.

However, if you are interested in gaining access to the models described in this experiment, learning more about this, or have any questions, we encourage you to email us at [email protected]

Content from The Bipartisan Press. All Rights Reserved.