A Demostration of Q Learning
A Simple example of Reinforcement Learning

The Algorithm

This implementation of the Q Learning algorithm uses the following function to calculate the new Q(s,a),
Q(s,a) = R(a') + gamma * max(Q(s',a'))
where gamma = .9

The probability that the robot will move in a given direction is calculated by solving the following function.
P(a) = exp(alpha Q(a,s)) / sum(i = 0..3, exp (alpha Q(ai, s)))
where alpha = .5
Using this approach all directions are possible but those with the higher Q values are more likely.
the alpha value was included to flatten the difference between higer and lower Q values somewhat.

Instructions

The display

The display consists of three regions.

The first, to the left side of the applet, is the largest. It shows the whole of the 20 by 20 region including walls (black squares) , absorbing reward locations (numbered red and green squares) , and the current location of the robot ( a blue circle).

The second portion is to the upper right. This is a sample 9 grid locations from the whole map centered on the robots current location. It shows either walls (Blue squares), reward locations (red and green squares) or normal locations which include the Q values associated with moving in the four cardinal directions.

The third region is the button region. Here the user sets the value for a reward region and starts and sets the speed of the animation.

Inserting Walls

To insert a wall time into the grid click on the desired location with the left mouse button. This will add a single obstical tile.

Inserting a Reward region

Right clicking on the grid will add a reward region of the value currently displayed in the reward value field of the button panel.

Starting the robot

Once the grid is layed out pressing the start button will start the robot moving around his environment. As the robot reaches a reward state it is imediatly moved to a random location on the grid map. Th Q values are updated in real time and the changes are reflected in the Q value display in the upper right hand corner of the applet. The delay between movements is controlled by the lower slider bar. The range of dilay times runs from 0 to 1 second.

Caveat

As the applet stands, there is no way to stop the animation once it has started. IN order to see the performance of the robot in a new grid environment, the user must reload the applet.

The Applet