CM50270 Reinforcement Learning¶

rl_cw_2_racetrack

CM50270 Reinforcement Learning¶
Graded Assessment: Racetrack¶
In this assignment, you will compare the performance of three reinforcement learning algorithms – On-Policy First-Visit Monte-Carlo Control, Sarsa, and Q-Learning – in a simple racetrack environment. You will then implement a modified TD agent that improves upon the learning performance of a basic Q-Learning agent.

Total number of marks: 30 Marks

Contribution to Unit Grade: 30%

What to submit: Your completed Jupyter notebook (.ipynb file) which should include all of your source code. Please do not change the file name or compress/zip your submission.

Where to submit: CM50270 Moodle Page.

This coursework will be marked anonymously. Please do not include any identifying information on the files you submit.

You are required to work individually on this coursework. You are welcome to discuss ideas with others but you must design your own implementation and write your own code and answers. If you do include any any third-party code or text in your submission, please reference it appropriately.

Do not plagiarise. Plagiarism is a serious academic offence. Both your code and written answers will be automatically checked for possible instances of plagiarism. For details on what plagiarism is and how to avoid it, please visit the following webpage: http://www.bath.ac.uk/library/help/infoguides/plagiarism.html

If you are asked to use specific variable names, data-types, function signatures and notebook cells, please ensure that you follow these instructions. Not doing so will cause the our marking software to reject your work, and will assign you a score of zero for that exercise. Please do not delete or duplicate existing cells: if you need additional cells, please insert new ones. If our marking software rejects your work because you have not followed our instructions, you may not get any credit for your work.

For this coursework, you may use the Python standard library, numpy, and matplotlib. You should also use the racetrack_env.py file, which we have provided for you. Please do not use any other non-standard, third-party libraries. If we are unable to run your code because you have used unsupported external libraries, you may not get any credit for your work.

Please ensure that your code is readable. If we cannot tell what your code is doing when marking, you may not get full credit for your work.

Please remember to save and backup your work regularly.

Please be sure to restart the kernel and run your code from start-to-finish (Kernel → Restart & Run All) before submitting your notebook. Otherwise, you may not be aware that you are using variables in memory that you have deleted.

Your total runtime must be less than 10 minutes on the University’s lab computers. If your submission exceeds this, it will be automatically interrupted, and you may not get full credit for your work.

Please adhere to written answer length limits. When marking, we will not read beyond the specified maximum word counts, and you may not get any credit for anything written beyond them.

The Racetrack Environment¶
We have implemented a custom environment called “Racetrack” for you to use during this piece of coursework. It is inspired by the environment described in the course textbook (Reinforcement Learning, Sutton & Barto, 2018, Exercise 5.12), but is not exactly the same.

Environment Description¶
Consider driving a race car around a turn on a racetrack. In order to complete the race as quickly as possible, you would want to drive as fast as you can but, to avoid running off the track, you must slow down while turning.

In our simplified racetrack environment, the agent is at one of a discrete set of grid positions. The agent also has a discrete speed in two directions, $x$ and $y$. So the state is represented as follows:
$$(\text{position}_y, \text{position}_x, \text{velocity}_y, \text{velocity}_x)$$

The agent collects a reward of -1 at each time step, an additional -10 for leaving the track (i.e., ending up on a black grid square in the figure below), and an additional +10 for reaching the finish line (any of the red grid squares). The agent starts each episode on a randomly selected grid-square on the starting line (green grid squares) with a speed of zero in both directions. At each time step, the agent can change its speed in both directions. Each speed can be changed by +1, -1 or 0, giving a total of nine actions. For example, the agent may increase its speed in the $x$ direction by -1 and its speed in the $y$ direction by +1. The agent’s speed cannot be greater than +10 or less than -10 in either direction.

The agent’s next state is determined by its current grid square, its current speed in two directions, and the changes it makes to its speed in the two directions. This environment is stochastic. When the agent tries to change its speed, no change occurs (in either direction) with probability 0.2. In other words, 20% of the time, the agent’s action is ignored and the car’s speed remains the same in both directions.

If the agent leaves the track, it is returned to a random start grid-square and has its speed set to zero in both directions; the episode continues. An episode ends only when the agent transitions to a goal grid-square.

Environment Implementation¶
We have implemented the above environment in the racetrack_env.py file, for you to use in this coursework. Please use this implementation instead of writing your own, and please do not modify the environment.

We provide a RacetrackEnv class for your agents to interact with. The class has the following methods:

reset() – this method initialises the environment, chooses a random starting state, and returns it. This method should be called before the start of every episode.
step(action) – this method takes an integer action (more on this later), and executes one time-step in the environment. It returns a tuple containing the next state, the reward collected, and whether the next state is a terminal state.
render(sleep_time) – this method renders a matplotlib graph representing the environment. It takes an optional float parameter giving the number of seconds to display each time-step. This method is useful for testing and debugging, but should not be used during training since it is very slow. Do not use this method in your final submission.
get_actions() – a simple method that returns the available actions in the current state. Always returns a list containing integers in the range [0-8] (more on this later).

In our code, states are represented as Python tuples – specifically a tuple of four integers. For example, if the agent is in a grid square with coordinates ($Y = 2$, $X = 3$), and is moving zero cells vertically and one cell horizontally per time-step, the state is represented as (2, 3, 0, 1). Tuples of this kind will be returned by the reset() and step(action) methods. It is worth noting that tuples can be used to index certain Python data-structures, such as dictionaries.

There are nine actions available to the agent in each state, as described above. However, to simplify your code, we have represented each of the nine actions as an integer in the range [0-8]. The table below shows the index of each action, along with the corresponding changes it will cause to the agent’s speed in each direction.

For example, taking action 8 will increase the agent’s speed in the $x$ direction, but decrease its speed in the $y$ direction.

Racetrack Code Example¶
Below, we go through a quick example of using the RaceTrackEnv class.

First, we import the class, then create a RaceTrackEnv object called env. We then initialise the environment using the reset() method, and take a look at the initial state variable and the result of plot().

%matplotlib inline

# Set random seed to make example reproducable.
import numpy as np
import random
random.seed(seed)
np.random.seed(seed)

from racetrack_env import RacetrackEnv

# Instantiate environment object.
env = RacetrackEnv()

# Initialise/reset environment.
state = env.reset()
env.render()
print(“Initial State: {}”.format(state))

As you can see, reset() has returned a valid initial state as a four-tuple. The function plot() uses the same colour-scheme as described above, but also includes a yellow grid-square to indicate the current position of the agent.

Let’s make the agent go upward by using step(1), then inspect the result (recall that action 1 increments the agent’s vertical speed while leaving the agent’s horizontal speed unchanged).

# Let us increase the agent’s vertical speed (action 1).
next_state, reward, terminal = env.step(1)
env.render()
print(“Next State: {}, Reward: {}, Terminal: {}”.format(next_state, reward, terminal))

You can see that the agent has moved one square upwards, and now has a positive vertical speed (indicated by the yellow arrow). Let’s set up a loop to see what happens if we take the action a few more times, causing it to repeatedly leave the track.

num_steps = 50
for t in range(num_steps) :
next_state, reward, terminal = env.step(1)
env.render()

Exercise 1: Comparing Fundamental RL Algorithms (12 Marks)¶
Below, we have plotted learning curves showing the performance of On-Policy Monte Carlo Control, Sarsa, and Q-Learning in the Racetrack environment.
We have included an unaltered version of the learning curve, as well as a cropped version to make it easier to compare agents’ performance towards the end of training.

from racetrack_env import plot_combined_results

# Plotting Combined Learning Curve.
%matplotlib inline
plot_combined_results()

Based on these results, and your understanding of the three algorithms used to produce them, please answer the following discussion questions.

Question 1: Briefly compare the performance of each of the three agents.

Question 2: Why do you think that your Monte Carlo and Temporal-Difference agents behaved differently?

Question 3: Does the performance of your Sarsa and Q-Learning agents meet your expectations? Why do you think that this is the case?

Question 4: What could be done to improve the performance of these agents?

Please do not exceed 60 words for any of your answers.

Please write your answers for Exercise 1 in this markdown cell.

Exercise 2: Modified Temporal-Difference Learning Agent (18 Marks)¶
Exercise 2a: Implementation¶
In this exercise, you must implement a Temporal-Difference learning agent that learns to reach a goal state in the racetrack more efficiently than the Q-Learning agent shown above. You may base your implementation on Q-Learning (Reinforcement Learning, Sutton & Barto, 2018, Section 6.5 p.131), the pseudocode for which is reproduced below, but you may also base your implementation on Sarsa if you wish.

In order to score high marks in this exercise, you will need to extend your solution beyond a simple Q-Learning or Sarsa agent to achieve a higher return and/or to learn more efficiently (i.e. using fewer interactions with the environment). Ideas for improving your agent will have been discussed in lectures, and more can be found in the unit textbook (Reinforcement Learning, Sutton & Barto, 2018). However you go about improving your agent, it must still use a tabular Temporal-Difference learning method at its core (i.e., it should not make use of function approximation, neural networks etc.).

Please use the following parameter settings:

Number of training episodes $= 150$.
Number of agents averaged should be at least 5.

If you use incorrect parameters, you may not get any credit for your work.

You may adjust all other parameters as you see fit.

Your implementation of a tabular modified Temporal-Difference learning agent should produce a list named modified_agent_rewards. This list should contain one list for each agent that you train. Each sub-list should contain the undiscounted sum of rewards earned during each episode by the corresponding agent.

For example, if you train $20$ agents, your modified_agent_rewards list will contain $20$ sub-lists, each containing $150$ integers. This list will be used to plot an average learning curve, which will be used to mark your work.

# Please write your code for Exercise 2a in this cell or in as many cells as you want ABOVE this cell.
# You should implement your modified TD learning agent here.
# Do NOT delete or duplicate this cell.

# YOUR CODE HERE

Exercise 2b: Comparison & Discussion¶
Below, we have used your results to plot the performance of your modified agent and a Q-Learning agent on the same set of axes.

A cropped version of this learning curve has also been plotted, to make it easier to compare the performance of your agents towards the end of training.

If you wish, you may plot additional graphs below these learning curves to support the points you make in your discussion.

from racetrack_env import plot_modified_agent_results
from racetrack_env import simple_issue_checking

# Checking Modified Agent Results for Obvious Issues.
simple_issue_checking(modified_agent_rewards, modified_agent = True)

# Plotting Modified Agent Learning Curve.
%matplotlib inline
plot_modified_agent_results(modified_agent_rewards)

Based on your results, and your understanding of the algorithm and modifications that you have implemented, please answer the following discussion questions.

Question 1: What modifications did you make to your agent?

Question 2: What effect(s) did you expect your modifications to have on the performance of your agent?

Question 3: Did your modifications have the effect(s) you expected? Why do you think that this was the case?

Question 4: If you had more time, what would you do to further improve the performance of your agent?

Please do not exceed 60 words for any of your answers.

Please note that your implementation and discussion will be assessed jointly. This means that, in order to score highly, you will need to correctly implement appropriate modifications to your agent AND discuss them well.

Please write your answers for Exercise 2b in this markdown cell.