$\begingroup$

First, let me set the scene.

For each AFL match played, the umpires of that match decide who they deem to be "best and fairest" players on the ground. They allocate 3 votes to the most deserving player, 2 votes to the next deserving and 1 vote to the 3rd most deserving player. Towards the end of the season these votes are tallied and the player with the highest wins the Brownlow Medal.

The problem I am trying to solve is to try and predict the allocation of these votes based on the match statistics which are publicly available. For example, see this game: http://afltables.com/afl/stats/games/2015/031420150402.html

I am taking the statistics of each match (marks, kicks, handballs, goals etc.) for each player and using them to try and predict the BR (Brownlow vote) value assigned to each player for the match.

Unfortunately, I'm fairly new to machine learning and not sure of the best way to structure my data and what algorithms to try etc. I am using this project as a means of learning more about machine learning. I'm currently working with a Python stack, using NumPy, Scikit Learn and also Keras.

Currently, each vector in my input data is the statistics table for each player for a given match, flattened. Since there are usually 44 players in each match, and I am only interested in 21 of the supplied stats, each vector is of length 44*21 = 924. I do perform some column-wise normalization on the input data, before flattening the table to reduce the impact of a low-activity vs high-activity match.

The target data is a vector which contains the brownlow vote assigned to the i-th player (e.g. [0, 0,....,3, 0, 2, 0, 0, 1]), and is of length 44.

I'm processing data from matches spanning years 1998-2015 inclusive, around 3200 matches.

My questions are:

Is my current approach for formulating the input data sound? Does it make sense to normalize and flatten the input data in this way? As far as I could tell, most machine learning classifiers require data to be flattened (i.e. they do not take a 2-d matrix input)

What would be some good candidate algorithms for me to use to solve this problem?

To me this seems to be a Multi-output classification problem, as the output is a vector with possible values 0, 1, 2 or 3. Thus I have tried using a MultiOutputClassifier with a RandomForestClassifier with terrible results - I get an accuracy of 0.0?

I've also tried using a Keras basic Sequential model, again with terrible results - my accuracy is around 0.15?

Here is the code I'm using to process Pipe-delimited files with stats in them (each file is one match):