Why Spam Looks Like That (Part 1): A Laymen’s peek into Natural Language Processing, Statistics, and Neural Networks



I had always wondered why spam looked the way it did. Is it written by people in the third world that don’t really know English? Why does the sentence structure look kinda correct but not quite? Do people really click the links in blogspam? What is all this hubbub about SEO? In this two part blogpost I will share what I’ve learned as I looked into all these questions and ultimately how I (as a sophomoric hacker-type) was able to use a little math (neural classifiers and natural language processing) to write code that artificially influenced search engine results (namely Google Page rank and keyword relevance). But before I dive into the details about all that SEO crap, a little background…

In 2003 I read the book “Linked: How Everything is Connected to Everything Else and Why It Matters“. This was one of those books that comes out every few years and immediately becomes a bestseller for it’s ability to make a niche science comprehendible for us laymen. I am a sucker for these kinds of books, favorites being: “Hyperspace“, “Guns, Germs, and Steel“, “The Black Swan“, “Blink“, and “Freakonomics“. But “Linked” remains one of my all time favorites because it taught me a bit about the science of networks just as terms like “social network” and “internet meme” were entering the vernacular. Before this, I’d never really though about these concepts beyond road-trip games of of “Six Degrees of Kevin Bacon“. This book also led me to other books that would become favorites like: “Turtles, Termites, and Traffic Jams” and strangely “The Hero With A Thousand Faces“. While the subjects differ, the themes remain the same: finding common unifying threads in seemingly disparate pieces of information.

Shortly after reading this book, I stumbled upon an article about a “Matrix-style” brain interface device created by a team from a university in Italy that was being demo’d at Neuroscience conferences. The device was supposedly revolutionary because it was non-invasive and adaptive. I was called “non-invasive” because it used only an Electro-Encepalagraph (aka EEG which is worn on the head) and “adaptive” because the device supposedly worked on virtually everyone. At the conferences, the demonstration of the device was essentially a booth, open to anyone willing to try it. Because every brain is different and produces different signals, a short 15 minute “training session” was needed for the device to adapt itself to the new user. During those 15 minutes the ABI device would hone in on the specific electro-encephalograph signals produced by parts of the brain responsible for motor control. After those fifteen minutes, when the user merely thought about moving their left hand, a mouse cursor would move left. When they thought about moving their right hand, the cursor would move right.

This blew my mind, not only because of the crazy Lawnmower Man powers it might give me or the potential of the device (in all it’s Neuromancer-esque cyberpunkish glory) to be a boon for paraplegics but because it implied that there was some fascinating signal processing going on inside the device. Something inside that thing was able to do the data mining necessary to find those needles in the proverbial haystack of brain noise. If it could mine that data could it also be used to find things we repress? Lies? Secrets? (Update 17Aug2013: The answer is ‘yes’ it can.)

Anyway, I wanted to know what it was and how it worked, so I tried to consume the few research papers produced by the ESPRIT ABI team. As a full scholarship-third year Physics dropout this was very ambitious for me (thats right ladies, I’m a quitter, take a number!). After a few hours googling every odd word in the research paper and instant messaging all the smart people in my buddy list, I had a general idea of how the Adaptive Brain Interface worked.

The device essentially worked like this:

After running the EEG signals through a lot of signal filters and transformations (things with fancy names like Fast-Fourier Transforms and Butterworth Filters) the system shat out massive matrices of information. All this data was ultimately passed into a ‘neural classifier‘ that was itself largely based upon a single simple classification algorithm that I learned was called Mahalanobis Distance Statistic.

If you are thinking: “That equation looks like Greek!” (which is kinda is 😉 and “What the hell is a ‘neural classifier‘?” Then you and I aren’t too dissimilar, ‘cuz that exactly what I first thought.

Firstly, it is important to note that just because the ESPRIT ABI is a neurological interface the ‘neural’ in the phrase ‘neural classifier’ is just a coincidence. The neural classifier has nothing to do with the fact that this is a neurological device. Don’t let it confuse you, the way it did me. The neural classifier is merely an algorithm that makes use of basic neural network concepts to optimize the task of separating data into classes. The neural network is the “decision engine” that decides if a piece of data is an apple or an orange. In the case of the ESPRIT Adaptive Brain Interface, its neural classifier relied heavily upon Mahalanobis Distance Statistic. Other than this, we don’t need to know anything more about neural networks to continue understanding how this thing works. ( I, myself, know nothing more about neural networks other than that oft-quoted Ant-Colony Optimization technique wherein computer programs mimic ants and their pheromone trails to do “AI stuff”.)

At the time that I was diving into this, there was very little information on the web about Mahalobis Distance (certainly not the Wikipedia page that there is now) so in early 2003 or 2004 I ordered a book from Amazon called the Mahalanobis-Taguchi System. In this book, I learned that a nerdy Japanese Quality Assurance guy was able to get famous and make a buncha money by applying the Mahalanobis Distance Statistic to manufacturing processes. How? Well, take the task of photo development as an example. (EDIT: Daniel Bilar shared this link on the Taguchi method May 2014)

With photo processing you have to apply costly chemicals to film gradually over many stages before you can produce a final photo that can be determined to be “bad” (overexposed, out of focus, etc). Well if you are able to determine that the photo will turn out bad at earlier stages in the process, you don’t have to invest the resources in developing the full final product before you decide to throw it out. This is also true of other kinds of manufacturing: from automobiles, to Spacely Sprockets. (Especially for things like CPUs where almost half of your production yield gets thrown away.)

This japanese Quality Assurance guy was able to collect metrics about the “bad yields” of different products as they flowed through their manufacturing processes.

He applied Mahalanobis Distance Statistic to these metrics at “checkpoints” along the manufacturing process and was able to make early predictions about if a yield would be good or bad. This saves companies a whole grip of cash. He presumably made loads of money and got girls from it. I too wanted money and girls, and this math crap seemed like a perfect angle for my new “tortured genius” affectation.

Anyway, in addition to the Mahalanobis-Taguchi book, I found that Mahalanobis is used for lots of other sexy applications like “computer vision“. But perhaps my favorite other application of Mahalanobis Distance I found in some obscure articles on the web. These articles were about how Astronomers and Chemists use Mahalanobis with spectrographs (things that measure wavelengths from light sources) to determine the chemical composition of distant stars based solely on the type of light the star emits.

“Wait a minute. How the hell does this crap have anything to do with stars?” Well, the universe is imperfect. When you burn a chemical in a lab it produces light that you can

measure with a spectrograph. A pure sample of a chemical will basically always produce the same spectrograph reading. However, when the same chemical gets mixed in with a buncha other stuff and burns as part of a chemical compound (like in a star) it will produce an “imperfect” spectrograph reading. A human can overlay the perfect andimperfect graphs to see that they look kinda similar, but your average computer program would likely nitpick at all the little abberations and imperfections in the “real-world” sample and call the two samples vastly different. Mahalanobis Distance Statistic is one tool that Astronomers use (coded into computer programs) to measure the “sameness” between the real-world readings (of stuff burning in distant stars) to the samples of purer chemicals burning in their labs. This is how we (as humans) determine the chemical composition of real stuff in the universe. Effectively, Mahalanobis Distance Statistic helps humans discover what stars are made of without actually having to visit them. It’s real sh*t, that really works!

After reading all this crap, my mind was flush with crazy ideas:

“If this equation could filter through seemingly chaotic radio noise from our brains, then surely it could be used to do something simple like classify mp3s on my filesystem based on the ‘way they sounded'” (A few short years later, I would read about Pandora which uses a similar classification system.)

Or:

“Surely if this thing can data-mine brain noise then it can be used to help decide if an email is spam!” (It turns out that Mahalanobis Distance offers no real improvement over Bayesian filtering because it requires some computationally expensive matrix math and matrix transformations.)

Or even better:

“I could use this to classify streams of traffic from libpcap or Snort to help discover previously overlooked intrusions and data exfiltrations!”

Or:

“I could use this to find out the encoding or compression schemes of data inside of seemingly arbitrary data streams.”

Or:

“I could use this to decide if an executable is malicious based on traces of it’s execution!”

My young over-zealous mind was awash in great applications for a flexible “decision engine” based on statistical classifiers like Mahalanobis Distance. All anyone would need to do is extract “features” from any data set and feed it to my magical solver to have all the data classified! All these ideas teemed in my head for a few years and like most of my ideas and life aspirations only manifested as half-assed and unfinished code in a repository on some forgotten backup disk. (That is until a few months ago 🙂

But lets rewind a bit. First I had to understand a bit of more about how Mahalanobis Distance worked. I will not bore you with the details in this blogpost (if you want to know there is a Wikipedia page and plenty of step-by-step tutorials) but what is important is why Mahalanobis Distance is uniquely useful over other types of classification.

Lets imagine that you and I were tasked with creating a system that was able to categorize the two dimensional data points like the ones in the chart below:

After a few minutes of looking at the data, we would probably come up with a scheme that probably worked something like this:

1. Graph all of the observations as points on a graph.

2. Finding the center-point (or average) of the cluster of points and plot this average as a point on the graph.

3. Calculate some kind of radius from this center-point. This radius will serve as our “class threshold”. (In this case we will use a radius of ~2.o)

4. Draw a circle using the radius or “class threshold” and the center-point of the cluster.

Using this scheme, any new points added to the graph would either fall within the circle our outside of the circle. With this system, those outside can be said to be “a part of the class” or “not part of the class”.

The fancy terms for this common-sense technique are “Euclidean Mean” or “Minimum Distance Classifier“.

Well if you think like a hacker, you are probably already imagining scenarios where this classification system breaks down. Such as:

“What happens if a point happens to fall on the line, is it part of the class or not? How close to the line is too close?”

“If two points are plotted with one point falling just inside the circle and the other falling just outside of the circle, they themselves are very similar to one another but determined to be drastically different simply because they don’t both fall within the “class threshold”. Is that ok?”

And lastly, and most importantly:

“How would this system scale?” That is to say, how would it hold up if we tried to classify richer sets of data.

For example: what about data sets that have more than just X and Y coordinates? If they had X, Y, and Z coordinates we could maybe extend this idea into three dimensions by drawing a sphere as our threshold. But what if we wanted to have the same system work for data with X, Y, Z, D, E, F, H coordinates? What if we want an equation that would classify data sets with N dimensions just as easily as 2 dimensions? As coders we don’t want to hand code a different classification system for every possible dimension, we want a single algorithm we can reuse for N dimensions!

That is precisely what Mahalanobis Distance gives us. It is a scheme that does not fall apart for any of these above issues (as well as a bunch of other ones that us noobs can’t yet dream up).

Well let’s bring this back the the ESPRIT ABI thingy I mentioned earlier. You’ll recall that the ABI uses an EEG to read electrical signals in the brain from electrodes attached the surface of the head.

At any given time in a recording there is a reading for each electrode of the EEG. This data, at any given time, can be represented as a “multi dimensionalarray”. Instead of the simplistic 2-dimensional data-points like the ones from earlier, these readings have many “features”. Instead of having just X and Y coordinates the EEG readings have F3, C3, F7, T7, P7, O1, P3, (and so on), with each coordinate representing a reading from a corresponding electrode. At any given time, the EEG is constantly producing a stream of data in the form of a matrix containing the frequencies of data for each electrode.

If a researcher wanted to add more electrodes, this would effectively increase the number of dimensions (or features) that the datastream from the EEG produced. Thus, the analysis engine that consumes this data would have to be flexible and able to process data with matrices of N-dimensions. This is precisely why the researchers chose to use Mahalanobis Distance in the Adaptive Brain Interface it is one of few statistical classifiers that can be used on matrices of N-dimensions.

Well this is all very interesting, but how the hell does this have anything to do with SEO, spam, and tricking search engines? When you perform a message digest (like md5) on two blocks of text that differ by only one byte, comparing the md5 sums only tells you that the two blocks differ.

Thus, message digests are not suited to give you a measure of similarity because they can only indicate uniqueness. Well Mahalanobis is suited for exactly this. (In fact, the S-Matrix can even be represented by something that looks like a SHA256 hash). With this in mind, I set out to try to use Mahalanobis Distance Statistic to tell me “how similar” two blocks of text were.

After implementing this, I found that I could also use a bit of Natural Language Processing and word databases to not only tell me how unique a block of text was (based on the characters it contained) but also to give me a measure of uniqueness based on the lexical sentence structure (i.e. placement of nouns, pronouns, determiners, etc within the text).

From this, I was able to write proof-of-concept applications that “borrowed” the lexical structure of one body of text to programmatically generate hundreds and thousands of articles that made absolutely no sense to a human reader but were grammatically and syntactically correct enough that Google considered it indexable (including the content and links it contained). This in turn increases page rank for the sites referenced in the generated text. When performed in an automated fashion you could effectively implement this on a scale large enough to influence search engine results for specific keywords….Sound familiar? Blog spam!

In the next blogpost I will dive into the technical details, code, and applications of all this. Some of the more interesting points are:

SEO vernacular: templates, articles, keywords oh my!

Combinatorics and Set Math: How your fuzzer is not so different from what spammers use to generate article content.

WTF is a Markov Chain?: How Natural Language Processing is used to “game” the search engines by determining parts of speech and “spinning” blocks of text.

Automation: How spammers and SEOs defeat Captchas. How spammers and SEOs mass mail and automate control of many web identities (blogs, webmail, etc).

Code and technical details for all of the above.

Lastly, my ideas on how to defeat/improve this stuff by using Statistical classifiers like Mahalanobis distance to see if blocks of text are lexically similar.

[spinner screenshot]

[nlp post screenshot]

[mahal nlp screenshot]