##
ST455 Half Unit

Reinforcement Learning

**This information is for the 2024/25 session.**

**Teacher responsible**

Mr Chengchun Shi COL 8.08

**Availability**

This course is available on the MPA in Data Science for Public Policy, MSc in Applicable Mathematics, MSc in Applied Social Data Science, MSc in Data Science, MSc in Geographic Data Science, MSc in Health Data Science, MSc in Management of Information Systems and Digital Innovation, MSc in Operations Research & Analytics, MSc in Quantitative Methods for Risk Management, MSc in Statistics, MSc in Statistics (Financial Statistics), MSc in Statistics (Financial Statistics) (Research), MSc in Statistics (Research), MSc in Statistics (Social Statistics) and MSc in Statistics (Social Statistics) (Research). This course is available with permission as an outside option to students on other programmes where regulations permit.

This course has a limited number of places (it is controlled access). MSc Data Science students will be given priority for enrolment in this course; followed by students in the Department of Statistics (including students on the MSc in Health Data Science), and those with the course listed in their programme regulations.

**Pre-requisites**

The course requires some mathematics, in particular some use of vectors and some calculus. Basic knowledge of computer programming is expected. Knowledge of Python is useful.

**Course content**

This course is about reinforcement learning, covering the fundamental concepts of reinforcement learning framework and solution methods. The focus is on the underlying methodology as well as practical implementation and evaluation using software code. The course will cover the following topics:

- Introduction – course overview, epsilon-greedy, upper confidence bound algorithm, Thompson sampling
- Foundations of reinforcement learning – Markov decision process, Bellman optimality equation, the existence of optimal stationary policy
- Dynamic programing and Monte Carlo methods – policy evaluation, policy improvement, policy iteration, value iteration based on dynamic programming, and Monte Carlo methods for reinforcement learning, including Monte Carlo estimation and Monte Carlo control
- Temporal difference learning – temporal difference learning, temporal difference prediction, Sarsa, Q-learning double Q-learning, and n-step temporal difference predictions, TD(lambda).
- TD learning with function approximation – types of function approximators (value and action-value function approximator), gradient based methods for value function prediction, convergence guarantees with linear function approximator, fitted q-iteration.
- Applications to TD learning – deep Q-network and the MDP order dispatch policy
- Policy-based learning – policy-gradient theorem, REINFORCE, actor-critic methods that combine policy function approximation with action-value function approximation
- Model-based learning – Dyna, Monte carlo tree search, AlphaGo
- Batch policy optimisation – pssimistic principle, MOPO, lower confidence bound algorithm
- Batch off-policy evaluation – importance sampling-based method, doubly robust method, marginalized importance sampling, double reinforcement learning

**Teaching**

20 hours of lectures and 15 hours of classes in the WT.

This course will be delivered through a combination of classes and lectures totalling a minimum of 35 hours in Winter Term. This course includes a reading week in Week 6 of Winter Term.

**Formative coursework**

Students will be expected to produce 8 problem sets in the WT.

**Indicative reading**

- Puterman, M. L. (1994). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. https://onlinelibrary.wiley.com/doi/book/10.1002/9780470316887
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press. http://incompleteideas.net/book/RLbook2020.pdf
- OpenAI Gym, https://gym.openai.com/

**Assessment**

Project (80%), continuous assessment (10%) and continuous assessment (10%) in the WT.

Two of the problem sets submitted by students weekly will be assessed (20% in total). Each problem set will have an individual mark of 10% and submission will be required in WT Weeks 4 and 7. In addition, there will be a take-home exam (80%) in the form of a group project in which they will demonstrate the ability to apply and evaluate different reinforcement learning algorithms.

** Key facts **

Department: Statistics

Total students 2023/24: 63

Average class size 2023/24: 31

Controlled access 2023/24: Yes

Value: Half Unit

**Course selection videos**

Some departments have produced short videos to introduce their courses. Please refer to the course selection videos index page for further information.

**Personal development skills**

- Self-management
- Problem solving
- Application of information skills
- Communication
- Application of numeracy skills
- Commercial awareness
- Specialist skills