Dilated Convolutions (DL’s) are pretty cool. If you haven’t heard of them before, I’d recommend http://www.inference.vc/dilated-convolutions-and-kronecker-factorisation/ as a good starting point to the benefits of it.

One of the first applications of it in ML seems to be in Multi-Scale Context Aggregation By Dilated Convolutions, which primarily discusses its application in semantic segmentation (in which it performs remarkably well). Since then, this interesting little kernel operation has gone on to do great things in computer vision, with nearly every one of it’s 700+ citations from segmentation-related papers attempting to use dilation in interesting ways.

Past that, there’s another interesting and intuitive application of dilation: time series data. As the pivotal paper WaveNet demonstrated, Dilated CNN’s are able to consider the global context of time series datasets, and are often able to achieve new highs in accuracy. However, try as I might, I cannot find a general comparison between dilation vs no-dilation for any dataset with signals that do not have a huge amount of global information (compared to local); ie, not voice. Thus, in this blog post, I will do two things: establish this comparison on a time-series dataset that is short and is a ‘snapshot’ of a particular target.

For this, I’ve selected an interesting, multi-channeled dataset: the TUH EEG Six-Way Event Classification Dataset, which can be found here, underneath TUH EEG Six-Way Event Classification Corpus (v1.0.0). To give a little backstory, the TUH EEG corpus is a massive repository of EEG signals, collected and annotated to excruciating detail. Within this repository, Temple University (the academic group running this corpus) have created quite a few training/testing sets for specific problems, including abnormal EEG identification, EEG seizure identification, and, the one we’re focusing on today, EEG event identification.

The data is structured slightly confusingly (at least to somebody who’s never worked with EEG data before), but the README gives a very good overview of how to work with this dataset. The signals themselves have a variable number of channels, with some patients having 37 and others having less than 25, but the creators of the dataset have ensured that all patients have at least all the electrodes included in the internationally recognized 10/20 configuration.

Furthermore, the authors have created a conversion guide to go from the 10/20 configuration to something called the ‘TCP Montage’ (which is 22 signals) while stating that this montage was ‘[their] preferred way of viewing seizure data‘. I’m not entirely sure why it is preferred, but as I dropped my neuroscience major years ago, I decided to go with best practices and convert it. In case you’re hoping to work with this dataset, I’ve attached a nice function for this conversion in the code that goes along with this project.

Past that, the description at the top of the README says everything else you really need to know about the dataset classes:

This [dataset] is a subset of the TUH EEG Corpus and contains sessions that are known to contain events including periodic lateralized epileptiform discharge,

generalized periodic epileptiform discharge, spike and slow wave discharges,

artifact, [eye movement, and background].

In other words, we’ve got some set of target classes (periodic lateralized epileptiform discharge, generalized periodic epileptiform discharge, and spike and slow wave discharges), along with a few classes that can be considered noise (artifact, eye movements, and background). These are ALL considered EEG events. To that end, we can see that there is really two possible problems we can attempt to solve: the two-way problem (target vs noise) or the six-way problem (subsets of targets vs subsets of noise).

Finally, the dataset is ordered such that these events occur in a 1-second timeframe, and as the signals are recorded at 250Hz, each input feature will be a (250 x 22) matrix with a corresponding EEG event as its target.

There’s definitely some other stuff I’m missing out on (and feel free to read through the README for further information) but that’s pretty much all that’s necessary to know for this blog post.

Alright, so how can we craft our CNN to, theoretically, solve this problem better with dilation? My first thought is to use an ensemble of CNN’s, each with a different starting dilation rate in the first few layers. Something like this:

Where each block represents a layer (or multiple in the case of the blocks), and the variables represent the dilation rate for that specific layer. Consider that we’re tackling both the two-way and six-way problem with our network, and we end up with four networks.

Where the dilated CNN would have dilation rates:

and

and

and

And the undilated CNN would have dilation rates:

and

and

and

So, what’s the rational behind the dilation rates in the dilated CNN? Well, I want each CNN branch to represent a different level of context (or receptive field) of the overall signal. Specifically, I’d like the first branch to have a tiny receptive field, the second to have a larger receptive field, and the third’s receptive field to encompass nearly the entire signal. This way, I can get both the local and global context while training my CNN. I’ll be comparing this to a vanilla ensemble CNN, where all three branches are identical to each other.

Luckily, I got to avoid doing any math here (!!) by using a neat receptive field calculation tool released just a few days ago (https://fomoro.com/tools/receptive-fields/) to figure out good dilation rate values. Here are some nice visualizations of each of the three CNN branches and their effective receptive field:

Is there anything else we can throw at this problem? How about multi-task learning? Here’s something I didn’t mention: EEG events do not simultaneously occur on all signals at once. In fact, in every single event, the EEG event occurs on a specific montage, from 0 to 21. Perhaps we can also also train the network for the ‘offending channel‘, and see if that helps out the accuracy? This would, intuitively, allow the network to gain a bit more information about the nuances about the signals and, hopefully, foster more informative gradients. These network architectures would look something like this:

This brings the total amount of architectures we’re testing to 8 (as we’d also be trying out dilation rates on the multi-task networks)! Finally, here are some finer details about the network:

The FCN layers node sizes are 1024 -> 512 -> 256 -> 64 -> 6 or 2, depending on which network we’re using. For the multi-task architecture, the 256 layer splits off into two 64->6 or 2 and 64 -> 22. There’s also a .5 Dropout between each FCN layer.

All layers use ReLu’s and BatchNorm in between subsequent layers.

Seperable Depthwise Convolution is used for all convolution layers.

‘Same’ padding is used in all convolution layers.

In both the single-task and multi-task architectures, we weigh all the classes equally during training.

Training is done with SGD and a .001 LR, .9 Momentum, w/ Nesterov Acceleration, batch-size of 128, and a linearly decaying (with epoch) learning rate.

Trained to a max of 30 epochs, because p2.2x instances are expensive.

Whew! Now that we’re done with all that, what do the results look like? Well, before that, let’s establish what the current SOTA is. While there may be one better than this, this seems to be best performer (published a little over 6 months ago): Automatic Analysis of EEG’s Using Big Data And Hybrid Deep Learning Architectures.

Let’s start with the two-way task: target (1) vs noise (0). We have four normalized confusion matrices for this task:

Alright, not bad. However, the results aren’t as one-sided as I originally hoped, some architectures perform well in certain areas and perform terrible in others. For example, the best overall performer is definitely ‘Single-Task w/o Dilation‘. But if we care more about false positives, as is usually the case in medically-oriented research, it’s Multi-Task w/o Dilation by a close margin, which boasts a false-positive rate of 3.45% and a specificity over 96%! In any case, it seems as if dilation does relatively little for the two-way classification problem, which I am very surprised by. As for our ability to beat the SOTA, here is the contending confusion matrix:

Hmm. We can claim our victory on false-positive and specificity rates with our Multi-Task w/o Dilation architecture (which is amazing!) but the victory is pyrrhic; our false negative rate is extremely high compared to the SOTA. Perhaps I overdid it on the class-balancing :).

Next up is the six-way classification challenge. Again, the previous paper also seems to have the SOTA result on that. Here are our four competing confusion matrices:

Ugh, even worse than before. No clear winner and no clear losers. And here’s the SOTA, with identical row/column ordering as our matrices:

Not really sure how to call this. My method definitely has an edge when it comes to detecting the noise classes, but I hesitate to call that a win on my end, purely because it feels applicably useless. In any case, the SOTA definitely destroys all of my methods when it comes to subsets of the target class.

Turns out dilation is not the holy grail I originally thought it was (at least when it’s applied to relatively small, localized time series waves!). As to whether or not I beat the SOTA with my method, I’d err on the edge of caution and say that the authors got me beat in being clinically useful, but I can proudly state that I have the lowest false-positive score in the two-way EEG Event Classification task (and likely one of the highest false negative scores)!

And with that, I end this blog post. The code for this project can be found here: https://github.com/Abhishaike/EEG_Event_Classification. Please let me know if you have any questions, and thank you for reading!