It’s early afternoon, I have 4 hours until my flight from SFO to LAX, then I have 2 hours layover until my 15 hour flight from LAX to Sydney. I have time to burn.

I decided to set myself a little challenge, can I develop an algorithm that finds people on social media to follow that are not like me? That was last week, I’m still going now that I’m back in Sydney and it’s still keeping me up at night. Turns out this is quite a complex problem to solve.

I was inspired by 2 episodes of one of my favorite podcasts- “Talk Python to Me”, an episode on data science and language processing and an episode on diversity.

Twitter’s recommendation algorithm

Twitter has a recommendation algorithm. It looks at who you follow, who they follow and any common keywords to suggest other people with similar friends, interests etc. This approach is used by all the social media tools.

This is really helpful, allows you to build up a lot of followers (or followees) in your world, but there is one fatal flaw in the algorithm.

It assumes you want to follow people who are all more or less the same. I realized that the 1,000 people I am following on twitter are guys, in their 20’s and 30’s, with kids, interested in cloud, technology, programming and Python. What am I really going to learn from them? Sure I’ll polish my technical skills and find out about the latest cool new utility and project going around. But it won’t expand my world view on anything and I’ll become a more narrow minded individual.

I initially thought I could just search for the opposite (antonym) of my profile and search for that. It’s not that simple! What is the opposite of ‘enthusiast’ and does that even make sense? What is the opposite of ‘developer’, well the opposite would be subjective.

The diversity wheel

Diversity is complex, this is an illustration of a wheel that shows some vectors you can consider when assessing diversity.

I’m going to pick on the ones that I think people are likely to share on social media.

People don’t have profiles that say “I’m a right-wing, 32 year old straight woman with 2 kids, earning $35,000/yr as a tax accountant”

Profiles can be cryptic or irrelevant. Robert Downey Jr’s is simply “You know who I am.”

Stage 1: The noun cloud

The first phase of the algorithm is to collect all the profiles of the people I follow, then match them to certain “diversity wheels”, the first wheel I experimented with was Profession, since this was the most likely to be shared in a profile.

I’m using the Python library NLTK (Natural Language Toolkit) for this analysis the code is open https://github.com/tonybaloney/wntf

I thought I would try and characterize the words that I care about (nouns) and group them to establish patterns in your social circles.

The first thing to look at is the nouns in the followers description and the most common nouns. For me, this is :

{‘NN’: [(‘https’, 80),(‘cloud’, 56), (‘@’, 39), (‘technology’, 36), (‘http’, 31), (‘software’, 30), (‘developer’, 28), (‘business’, 28), (‘world’, 26), (‘father’, 24), (‘news’, 23), (‘fan’, 21), (‘account’, 20), (‘source’, 20), (‘husband’, 20), (‘enthusiast’, 18), (‘team’, 18), (‘web’, 18), (‘geek’, 17), (‘code’, 17)], }

So what can we tell about those nouns?

We then filter out certain nouns that commonly occur, such as ‘tweet’, ‘views’, ‘opinions’, since a lot of people have a statement about their views not representing their employer etc. etc.

Once you filter that list I can see that my followers’ characteristics in a few traits:

Their industry ‘business’, ‘technology’

Their role ‘developer’

Their gender ‘husband’, ‘father’

Their interests ‘web’, ‘code’, ‘software’

The way they describe themselves ‘geek’, ‘enthusiast’

Looking at the Proper nouns (NNP) I can also get some other interesting information:

‘NNP’: [(‘@’, 313),(‘Cloud’, 92), (‘|’, 74), (‘Data’, 63), (‘IT’, 44), (‘Dimension’, 39), (‘Software’, 36), (‘Microsoft’, 35), (‘Director’, 35), (‘Python’, 32), (‘Manager’, 31), (‘Husband’, 26), (‘Developer’, 25), (‘CTO’, 25), (‘Architect’, 25), (‘CEO’, 24), (‘Engineer’, 24), (‘/’, 24), (‘Technology’, 23), (‘Dad’, 23)],

Again filtering out some of the fluff, like @ and /

Company data Microsoft, Dimension (Data)

Role ‘CTO’, ‘CEO’, Engineer, Architect

My noun cloud

Stage 2: Sorting the nouns into wheels

Once we have a collection of the top 50 most commonly occurring nouns (excluding the black-list), we want to sort them into their respective diversity wheel.

Now for each of these words we build up a synset. A Synset is a set of synonyms that share a common meaning. So for ‘technology’, the synset includes the nouns ‘technology’ and ‘engineering’.

I compiled a list of professions, sorted by similarity, so a oceanographer is next to a meteorologist, but far from mechanic. These professions then form a circle like this:

If I then match my social network into those professions and then weight them as a polar diagram I see what the distribution is over this chart.

OK, so I follow a lot of CTOs, CEOs, engineers, developers, directors and clergy. Wait, what? I follow a lot of clergy?

For each profession, I listed other terms, for clergy we have clergy, vicar, rector and priest. When I debugged the algorithm it was finding a lot of uses of the word ‘father’ and a hypernym for father is of course priest. So my algorithm needs a little work.

Stage 3: tipping the scales

Now that I have the profession distribution, I build up other diversity wheels for age, gender. I considered sexual orientation but it was unreliable as not something people generally use to describe themselves in a profile.

Imagine each of these wheels in a series, they are interconnected, since a single person’s diversity reflects many aspects.

To create a diverse social network we want to rotate each of those wheels away from their current position. Think of the combination lock on your bike, you don’t just turn the numbers one-notch away from your key combination, you turn each one (hopefully) randomly to make it impossible for someone to guess the original combination.

Reconsider the profession wheel, I don’t want the algorithm to suggest only 1 profession based on the statistical point which is furthest from my median because that still wouldn’t be diverse. I want a collection of professions that are furthest away from my social network to expand. Those are (according to my charts) doctors, physicians, dentists, physiotherapists.

According to my gender distribution, I need to follow a lot more women and those who identify as non-binary.

All of this was done in less than 200 lines of Python.

What does this mean?

I would love to see research like this picked up by the social media engines. Once I have the outputs from my algorithm, I’m still constrained by Twitter’s keyword search feature and the rate-limiting APIs.

Twitter has some characteristics that impact natural language processing. The most obvious is that the character limit forces non-natural language. People over-punctuate, use abbreviations and rarely use prepositions. It was a lot of trial and error to get NLTK to do what I wanted.

Looking a Twitter you can see how much of an issue this is, we should have a “who [not] to follow”, reversing the suggestion list and considerations around diversity.

What next

I’m going to develop the algorithm to implicitly infer location, gender and age then build out the diversity wheels better for those aspects.

I’m going to measure the central point of the diversity wheels and their interconnections.

Offer a range of suggestions based on varying diversity, instead of simply assuming the opposite.