This opens up the opportunity to use clustering on the data for grouping a track to a certain originating particle. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) can be used for this purpose. Written in Python the method translates to the following code snippet:

Standard scaler subtracts the mean and scales to unit variance, it is used to improve the performance of the clustering by normalizing the variables. Feature scaling is a common step during preprocessing in machine learning. Both the scaler and DBSCAN are imported from scikit learn and there is also a custom TrackML library for dealing with loading events or shuffling etc. not shown here. In any case, the clustering gave me a score of 0.20817 out of 1 on the test data using eps=0.00715. Eps is the maximum distance between two samples for one to be considered as in the neighborhood of the other.

With the score in mind we can conclude that the helix clustering is really not that accurate in this simple form. At least it acts sort of as a baseline, and spoiler alert, the top performing methods are a bit trickier to implement. Clustering was still used in a number of algorithms, some reaching upwards of 0.8 score.

Challenge runner-up

Kaggle user outrunner based his solution on an artificial neural network. The model outputs an adjacency matrix with ones and zeros, one meaning the hits are on the same track and zero that they are not. The model size was so large (N ~ 100k) it had to be split up into smaller units. They take as input a pair of two hits and output their relationship, or the probability of them being on the same track.

Outrunner’s neural network model

All pairs of hits are considered and 27 features are accounted for in each pair. Then a 5 hidden layers multilayer perceptron (MLP) with 4k-2k-2k-2k-1k neurons is trained on it. Usually the methods reconstructing tracks do not take all hit pairs as input, but instead only adjacent ones, more or less conventional connect the dots. Among the features used are unit vectors for hit direction, and using the assumption that tracks are linear or helix shaped, the amount of false positives can be reduced. Because not surprisingly there is a predominance of pairs not belonging to the same tracks when, as stated, all possible pairs are used.

To reconstruct the tracks outrunner starts with an initial hit and forms a seed from the pair with the highest predicted score. The third hit added is the one maximizing the sum of probabilities when put together with the pair. It also undergoes a check to see if it fits the circle passing from the origin through the seed pair. This procedure is iterated until no further hits qualifies for the track. After that new initial hits leading to new tracks are created in the same manner until no hits remain in the data.

The last step is to resolve inconsistencies between the tracks since many of them include the same hits. The quality of an individual track is quantified by the amount of hits uniquely assigned to it. This measure is used to order overlapping tracks and merge them optimally as seen below.

Example track reconstruction

This deep learning method came in second in the Kaggle competition with a score of 0.90302. Further reading about the runner-up over here:

Winning contestant

The winning contribution came from team Top Quarks with a score of 0.92182. In terms of machine learning their solution has some logistic regression for pruning away excess tracks which arises, but other than that it is based on classical mathematical modeling with statistics and 3D geometry. It also included a crude model for the magnetic field as a function of z position in the detector. Interestingly they used quite few training events with most model fitting done using only one event [2].

The algorithm

Seed generation: The algorithm starts by selecting promising pairs of hits. Most tracks were covered by 50 pairs of adjacent layer in the innermost part of the detector, so the candidates for starting point pairs, or seeds, are all pairs of hits on those layers. The number of candidates were reduced by logistic regression pruning using different parameters, most notably how far from the origin you get if you extrapolate the line from those two points, and secondly the hit angles which gives the direction for the particle. Extension to triplets: A third point is added to the track by extending the line between the existing points to adjacent layers and storing the 10 closest ones as candidates. Bad triplets are rejected, again using logistic regression pruning. Track following: A helix is fitted to the triplet and extended into the other layers. The hits closest to it are added to the track. Track consolidation: There are overlapping modules which leads to measuring points not accounted for, or in other words multiple hits per layer in the detector. If these hits are closer than a threshold they are added to the tracks. Track ambiguity resolution: Each track has been considered on its own so far leading to a massive overlap between the different tracks in terms of hits. A scoring metric based on the training data is used to select the best track. Then all hits contained in that path is removed from all other conflicting paths, this is done iteratively until no conflicts remain.

Winning algorithm: 1. pair finding, 2. extension to triplet, 3. addition of hits from overlapping modules and 4. final track disambiguation. [3]

Conclusion

A direct comparison to the state of the art in particle tracking is not available according to the publication from the competition [3], but we do know that current charged particle tracking algorithms are based on the three stages: seeding, track following and track selection which many of the top ranking solutions also used. A total of 651 teams participated in the competition leading to many novel solutions being invented.

I have presented a simple way of applying clustering to the data, and there are more complex and more accurate solutions using that same technique which are fairly fast. The same goes for the winning algorithm that uses heavily optimized data structures and little in terms of machine learning, whereas the solution with neural networks has a run time which is “an astronomical number” according to the Kaggle interview linked to earlier. I read a promising comment from the winner to the runner-up regarding merging their two successful techniques: “I can only imagine how well a solution using your near-optimal pair scoring and my track extension / selection could score”. All these different approaches most definitely gave CERN scientists some food for thought.