‘Rot. Forest’ is the results I got from automatic classification. I placed player’s agreement and f-score produced by the classifier on the same scale— they are not directly comparable, but it does give you an idea for what’s going on.

In most categories we have pretty good results, an I went on to produce maps — see ‘Results’ section. ‘PvP — E-war’ category is an exception. While there are dedicated ships for E-war, players often fit some E-War modules on random combat ship, and that confuses the classifier.

Also, ‘Cyno’ is a group unique to Eve Online, again see results section for an explanation.

Classifiers

the struggles

I started off using Weka, it has a lovely, simple interface and many-many classifiers. Unfortunately I burnt way too much midnight oil dealing with it’s refusal to read a simple CSV file. Then I realised that I could read data directly from MSSQL server, but again Weka would throw a bitch fit with astounding regularity.

Then I found Knime, which, like Weka, is also written in Java, and it reads files and connect to SQL server without any bullshit. But Knime doesn’t have nearly as many nice classifiers, so I found Weka plugins for Knime, and those worked fine, even if not all of them. So now that’s I have completed this inception gymnastics, I could finally get to work.

Useful bit

I have split up the responses into a training and verification sets, roughly 4:1, and gave several classifiers a go. Then I looked at the ones that gave best results, and chose one I could understand. That was rotation forest— it produces a decision tree, but unlike ‘normal’ decision tree, this one imagines that each column in the table is a dimension in n-dimensional space. The algorithm tries to rotate the tree in n-dimensional space into optimal angle using an iterative algorithm, hence the name.

The above means that it can combine several parameters and weight them relatively to one another, otherwise it acts as a normal decision tree. It has plenty of knobs to tweak, yet unlike with neural network I actually understood what’s going on. Another advantage of trees is that results are easy to inspect to make sure you aren’t over fitting your data or doing something ridiculous.

Getting the results was relatively straight forward — I inspected the tree to make sure it was not over fitting, and pruned it a lot. I didn’t want to see names of individual players or player organisations anywhere in the tree, or names of specific locations. I considered that overfitting, and my goal was to get it to operate on higher level data — security status, parameters of ships, not individual occurrences.

I spent a lot of time pre-processing the data before I fed it to the classifier. This seems to be a bit of a dark art, as I couldn’t find a robust method on how this should be done.

I tried to assist the classifier using my knowledge of the game. For example the classifier struggled to tell apart what was equipped on the ship and what was in the cargo, and thus useless in battle. So I created a rating of the ship’s attacking ability based on the kind of weapons it has, but excluding the ones it had in sitting in the cargo. Then it stopped categorising freighters as a massive threat. Check out the github repo if you would like to see this in detail.

Of course, the amount of hand-holding of the classifier would be vastly reduced if instead of 1000, I had 100,000 rated player deaths. But you can always have more data, can’t you? It’s all about what you can do with the resources you’ve got.

Unsupervised Classification

I spent a little time and had a stab at unsupervised classification, but it didn't yield any results. Actually, it did yield results, but they were useless.

All unsupervised ML I tried categorises deaths by ship class of the victim. I think that happens because the all the parameters I extracted out of the dataset — ship's armour, speed, etc. are mostly a function of the ship’s size class. For instance a battleship has literally two orders of magnitude more hit points than a frigate. If you do primary component analysis of the data, virtually all the variance in the dataset is in the ship’s size.

PCA of the data — A lot of varience in one dimention, much less in others.

In the chart above, dimension 0 is amount of damage taken, and that’s basically the ship’s size. Also I found there aren’t many packages that let you visualise results of PCA in an easily digestible format.

I can’t remember the algorithms I used for classification, but take a look at the result. In charts below, colour is the group assigned by algorithm, X and Y are the two dimensions of PCA. The data I am putting in is player deaths, the categories I am getting out are ship classes — top right group are kills of players in Titans, biggest ships in the game. The next blue group is other capital ships.