This week Occupy Math tries to give a sense of how big data, social media, and some simple machine learning techniques can be used to invade your privacy. If you know how some of these techniques work, you can be a little safer — Aikido is a defensive martial art, hence the title of this week’s post. Having said that, it is important to note that the greatest source of safety is that for the most part we are happy to live and let live. While you don’t need it, looking at an earlier blog, With Big Data comes Big Responsibility, may supply some useful context. In this post we will look at how machine learning lets people guess who you are and what you believe.

Most people think of digital self defense in terms of building a privacy wall around your online presence.

This isn’t a bad idea, and it can help with some things, but mostly it’s useless as a way to protect the character of your beliefs and personality from being accessible. To be really safe from this sort of invasion of privacy, you need to cut up your credit cards, buy everything with cash (not online!), and never even surf the web without total anonymity. We will begin our journey by looking at an old, reliable way of figuring out what a document — from a scientific paper to a Facebook post — is about. This technique is based on linear algebra, but Occupy Math will explain it in as close to plain language as can be managed.

If we think of vectors as arrows in space with a starting and ending point, then those arrows have a length. We usually encode the arrows as having the same coordinate as the point at their end, if their beginning is assumed to be at the point with all coordinates 0. That means that (1,1,3) represents an arrow that goes from (0,0,0) one right, one up, and three units in the third direction that isn’t right-left or up-down. If you take two vectors, multiply the corresponding terms, and then add up the result, you have something called the dot product of the vectors. Here’s an example: (1,1,3) DOT (1,-1,2) is (1×1)+(1x(-1))+(3×2)=1-1+6=6. Once you have lengths of vectors and dot products you can do something remarkable. The dot product of two vectors divided by the lengths of each of the vectors is the cosine of the angle between those vectors.

If this number is near one, the vectors point in the same direction.

If this number is near zero, the vectors point in unrelated directions.

If this number is near minus one, the vectors point in opposite directions.

This measure of similarity of vectors is called cosine similarity, but what on earth does this have to do with document analysis?

We are almost there — the missing part is how to turn a document into a vector. Take a document and count all the times adjacent groups of three words appear in a sentence (you could just count words, but triples of words capture a lot more meaning). This gives you hundreds or thousands of groups of words that appear at least once. You make a vector of these counts. This turns a document into a vector and so lets us use cosine similarity to tell how similar documents are. A much more detailed explanation is available in this blog. These vectors are in hundreds or even thousands of dimensions, but this is not at all difficult once you have computers available. Let’s start with the potential applications and then consider the implications.

This system is a wizard-good plagiarism detector. Even if you go through and do some rephrasing, the cosine similarity between documents, if one of them was plagiarized from the other, is very high.

This system has been used to grade essays on standardized tests with 98% agreement with the grading done by human experts. You need a large body of already graded essays to make this work — and this is a key feature; you always need examples.

This technique can be used to sort publications — like scientific papers or newspaper stories — by topic. If you have several publications on a topic you are interested in, cosine similarity applied to documents can find more documents on the topic. It can even be used to look at chunks of documents and find the good bits.

This technique can be used to capture the ideological or political intent or character of a document, if examples of other documents of a similar character are available.

People used to think the government or big companies couldn’t read everything. Cosine similarity means they don’t need to.

The core implication is this. An organization with a good desktop computer and access to the web can let the computer screen out the 99.999% of documents, posts, and transactions that they do not care about and pick out the few hundred things they actually want to monitor with pretty high accuracy. They can also automate the process of building lists of potential customers or political enemies. This is a very useful technology that is also very capable of supporting totalitarian goals. The main defense? Don’t let them do it. Informing yourself and screaming (in outrage or for help) when you detect abuse of this technology will work, if enough people do it.

Freedom requires informed and active citizens.

Which is why Occupy Math writes these posts. Security lies in behaving as if anything you write online is being read by the government, potential employers, and teen-hackers. You may manage to keep your stuff private, but probably not. We are really bad at digital security at this point in our history. What can you do? There are several options:

Mouse up. Don’t have controversial opinions outside of your own head or in person conversation. Keep your on-line persona interesting but politically bland. Sound peaceful and calm. All of this puts you at the end of a very long line to be bothered — a line that “they” will probably never reach the end of.

Panther out. Act like you are a free citizen of a free society and express your opinions in a forthright manner. Avoid breaking the law, but also don’t hide or cower. Act to defend your freedoms and the freedoms of others. When they come for someone, stand in the way.

Chameleon style. Stay informed, but don’t go out of your way to become involved in controversy or political activism. Be prepared, but pick your battles. Help where you can.

It’s important to keep in mind that cosine similarity is one of several hundred machine learning techniques that can reduce mountains of data to small piles of useful information. A nice article appearing on the website Phys.org entitled How algorithms (secretly) run the world provides some useful perspective. These algorithms are not only used to invade privacy and sell stuff. Occupy Math uses them all the time for things like sorting gene sequences or picking representative samples from a data set.

Occupy Math hopes this hasn’t been too scary. An excellent fantasy series, The Belgariad, contains a situation where the main characters have the problem that if someone speaks the bad-guy’s name, he can hear it a thousand miles away. Their clever solution is to have the bards and minstrels of the realm re-tell certain old stories ensuring the name is spoken in every inn and tavern — thus creating a shield of white noise against the bad-guy’s ability to hear them. This is a workable approach to preventing a machine-learning attack on our freedom. If we all speak out, speak up, and defend the right (as we see it), then knowing who we are won’t help much. Suppression of free speech works best when free speech is timid and rare. Would you like to see more posts on machine learning and digital privacy? Comment or tweet!

I hope to see you here again,

Daniel Ashlock,

University of Guelph,

Department of Mathematics and Statistics