Creating The Reinforcement Learning Playground

Now that we have a playground for our bot (ie an image with blood vessels and brain tissue), we need to set up a place for it to learn to avoid blood vessels while maximizing electrode density. To do this we are going to shrink our spatial volume down to a small square crop of the original input image.

This small play area will be the place our bot learns to make selections that maximize the distance from blood vessels.

Blue (Blood Vessels) | Green (Learners Selection) | Red (Nearest Blood Vessel Point)

Given any point in the image the learner chooses to “place an electrode thread,” we will calculate how close the nearest pixel containing a blood vessel is.

Q Learning

For Part 1, our learner is going to be the tried and true model-free Q Learning algorithm. To fully understand Q Learning you Have To Memorize This Equation Entirely.

Seriously! Get this tattooed on your arm.

Of course I jest. Although that might appear terrifying for those of us less acquainted with the math, the Q Learning formula is really describing something quite simple.

Let's make this personal. How do you make decisions? If we were to follow you everywhere for a couple of days and summarize your decision-making process, I am willing to bet it will contain these elements in one form or another.

1. You observe: What is going on around you, what facts do you have available? 2. You act: usually based on habits or principles. 3. Receive feedback: You decided to yell at your boss. Your feedback is a demotion. 4. You update: your habits or principles-based on feedback. 5. Rinse & Repeat.

As all of us go through life we repeat this cycle countless times. This is analogous in ways to what Q Learning is doing.

In the case of Q Learning, we explicitly maintain what actions the learner took, and the specific feedback it received from those actions in something called the Q Table.

In the above image, if the learner found itself in state 1 (row 1), it would immediately determine that taking action 5 (column 5) would yield the highest reward of 100 given the state 1.

The Q Table starts out filled with random values. As the learner spends more time in the environment, observing, taking actions and receiving feedback this Q Table gets updated and becomes a better guide as to what decisions should be made given a state. It is quite literally a lookup table for what the learner should do given a scenario it has previously encountered.

That nasty Q Formula then is simply the mathematical equation that tells us how to update the table as the learner takes actions given states, and receives rewards.

Combing Q Learning With Our Playground

Now that we have a place in which the learner can receive feedback on how close its selections are to blood vessels and a method to allow the learner to select better sites over time (Q Learning), let's learn!

Here I let the learner play for 25,000 iterations. In each iteration, it has 4 choices. Move Left, Right, Up or Down. If at any point the learner collides with a blood vessel it receives a massive penalty. Otherwise, its reward is the distance to the nearest blood vessel. In this simulation, we expect the average distance to the nearest blood vessel to increase over time (avoid hemorrhaging).

Our learner's baby photos! Here most of the actions its taking are nonsensical and semi-random.

Over time, however, it did succeed in (unstably) maximizing reward (distance to the nearest vessel). This plot shows our per episode reward over time.

The y-axis represents our reward, the x-axis represents the epoch number.

There is barely a trend towards average higher reward over time.

Lessons Learned & Part 2

It should go without saying that Q Learning is not the optimal solution to my problem formulation. In order to create something robust and truly functional, I think the powers of Deep Learning will be required.

Despite that, I still had a blast getting a hands-on look at some image manipulation techniques (color segmentation) and Q Learning. Although the learner didn’t perform as well as I would have liked, it was a more effective lesson in Q Learning than any of the university courses I have taken.

Part 2

In Part 2 I am going to drill down on blood vessel segmentation. Although my nieve color mask produced useable images for use in the simulation, I think I can do better using Semantic Segmentation.

I am going to reach out to Neuralink and see if they are willing to supply me with video footage from their surgery bot. This would make the task of training a ConvNet to perform semantic segmentation of blood vessels much easier. If you know anyone who works at Neuralink please ask for that video footage on my behalf. Thus far I have received no responses.

See you in Part 2!

Sources