A Reinforcement Learning method on a car simulator which was developed using pyGame. The car was able to drive on its own with less than 10 minutes of training simulation. States considered were the signal color ( red, green ), the direction in which other cars were moving and assuming the car travels using a GPS system which acts as a guide.

To test any system on how it works we need some metrics to judge for this particular example following metrics are used.

Consider a system where choice is randomly made at each step, i.e. the car goes in random directions at each step without following any rules. The results would be as follows.

A 10-Trial Rolling means experiment has been conducted 10 times to remove any chance that system might give biased results. As expected the system fails miserably both safety and reliability are failed.

States / Features used to make decisions

GPS direction (left,right,straight) Signal Light ( Red, Green, None) Oncoming traffic ( left,right, straight,None) The intended direction of travel for a vehicle across the intersection from the Smartcab. Returns None if no vehicle is present. Right sensor The intended direction of travel for a vehicle to the Smartcab‘s right. Returns None if no vehicle is present.

We only need to consider if the vehicle to the right is moving forward, in all other cases the cab will have right of way according to US driving rules.

If we use the above features total combination would be 3 * 2 * 4 * 2 = 48 states so if the system can learn all these state information correctly then it would be able to drive perfectly.

Q Learning

The concept of Q-Learning is fairly straightforward: For every state the agent visits, create an entry in the Q-table for all state-action pairs available. Then, when the agent encounters a state and performs an action, update the Q-value associated with that state-action pair based on the reward received and the iterative update rule implemented. Of course, additional benefits come from Q-Learning, such that we can have the agent choose the best action for each state based on the Q-values of each state-action pair possible.

This is how the Q-table will be formed. Each state will be a key of the self.Q dictionary, and each value will then be another dictionary that holds the action and Q-value. Here is an example:

{ 'state-1': { 'action-1' : Qvalue-1, 'action-2' : Qvalue-2, ... }, 'state-2': { 'action-1' : Qvalue-1, ... }, ... }

The agent is expected to learn from its behavior eventually, it is also important that agent acts based on what it has learned so far. This balance can be made using decaying epsilon factor

Following are the results with q learning method rather than random choices.

As we can see the system is more reliable now and how the bad actions have reduced, but the system isn’t ready yet it is able to reach the destination but not without avoiding accidents. This indicates that it needs to learn more in order to achieve that I modified the epsilon such that model learns more slowly.

epsilon = 1/(t*t) , t-number of trails

After modifying the epsilon values the new results are more reliable as shown below.

To ensure the robustness number of trials also have been increased to 300 instead of 200 and you can observe how exploration factor decreases after some time and system tends to act based on state information it has learned from previous experiments.

Full details of the code along with python simulator code is present on my github page.