ST443      Half Unit
Machine Learning and Data Mining

This information is for the 2025/26 session.

Course Convenor

Dr Philipp Sterzinger

Availability

This course is compulsory on the MSc in Data Science. This course is available on the MPA in Data Science for Public Policy, MRes in Management (Marketing), MSc in Applied Social Data Science, MSc in Econometrics and Mathematical Economics, MSc in Economics and Management, MSc in Geographic Data Science, MSc in Health Data Science, MSc in Quantitative Methods for Risk Management, MSc in Statistics, MSc in Statistics (Financial Statistics), MSc in Statistics (Financial Statistics) (Research), MSc in Statistics (Research), MSc in Statistics (Social Statistics) and MSc in Statistics (Social Statistics) (Research). This course is available with permission as an outside option to students on other programmes where regulations permit. This course uses controlled access as part of the course selection process.

How to apply: Compulsory for MSc Data Science students. Priority is given to Department of Statistics students, including students on the MSc in Health Data Science and those with the course listed in their programme regulations. Early application is advisable (space permitting).

Students should check that they meet the pre-requisites in the course guide before applying. Students not from the Department of Statistics should submit a short statement indicating a) why they think the course is suitable for them given their background knowledge and b) their motivation for their choice.

Deadline for application: Due to the nature of the method of application, interested students should apply as soon as possible after the opening selection and no later than 10.00am on Friday 26 September 2025.

Course lecturers will aim to make initial offers to students on LSE For You by Friday 26 September.

For queries contact: Stats-Msc@lse.ac.uk

This course has a limited number of places (it is controlled access) and demand is typically high. This may mean that you’re not able to get a place on this course.

Requisites

Mutually exclusive courses:

This course cannot be taken with MA429 at any time on the same degree programme.

Additional requisites:

The course will be taught from a statistical perspective and students must have a very solid understanding of linear regression models.

Course content

Machine learning and data mining are fields situated between statistics and computer science. They focus on objectives such as prediction, classification and clustering, particularly in contexts where datasets are large, commonly referred to as the world of 'big data'.

This course will commence with the classical statistical methodology of linear regression as a foundation. From there, it will progress to provide an introduction to machine learning and data mining methods from a statistical perspective. In this framework, machine learning will be conceptualised as 'statistical learning', aligning with the titles of the books in the essential reading list.

The course aims to cover modern non-linear methods such as decision trees, random forests, bagging, boosting and support vector machines. Additionally, it will delve into advanced approaches, such as ridge regression, the LASSO, linear and quadratic discriminant analysis, k-means clustering, and k-nearest neighbours.

Teaching

15 hours of seminars and 20 hours of lectures in the Autumn Term.

This course has a reading week in Week 6 of Autumn Term.

The first part of the course reviews regression methods and covers logistic regression, linear and quadratic discriminant analysis, cross-validation, variable selection, nearest neighbours and shrinkage methods. The second part of the course introduces non-linear models and covers tree methods, bagging, random forests, boosting, support vector machines, principal components analysis, k-means clustering, and hierarchical clustering.

Formative assessment

Students will be expected to produce 5 problem sets in the AT.

The problem sets will consist of both theoretical questions and data problems that require the implementation of various methods in class using a computer.

 

Indicative reading

James, G., Witten, D., Hastie, T. and Tibshirani, R. An Introduction to Statistical Learning. 2nd Edition, Springer, 2021. Available online at https://www.statlearning.com/

Hastie, T., Tibshirani, R. and Friedman, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd Edition, Springer,  2009. Available online at http://statweb.stanford.edu/~tibs/ElemStatLearn/index.html 

Assessment

Exam (70%), duration: 120 Minutes in the Spring exam period

Project (30%) in Autumn Term Week 11


Key facts

Department: Statistics

Course Study Period: Autumn Term

Unit value: Half unit

FHEQ Level: Level 7

CEFR Level: Null

Total students 2024/25: 77

Average class size 2024/25: 19

Controlled access 2024/25: No
Guidelines for interpreting course guide information

Course selection videos

Some departments have produced short videos to introduce their courses. Please refer to the course selection videos index page for further information.

Personal development skills

  • Self-management
  • Team working
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills