ST443      Half Unit
Machine Learning and Data Mining

This information is for the 2020/21 session.

Teacher responsible

Dr Xinghao Qiao

Availability

This course is compulsory on the MSc in Data Science. This course is available on the MSc in Applied Social Data Science, MSc in Quantitative Methods for Risk Management, MSc in Statistics, MSc in Statistics (Financial Statistics), MSc in Statistics (Financial Statistics) (LSE and Fudan), MSc in Statistics (Financial Statistics) (Research), MSc in Statistics (Research), MSc in Statistics (Social Statistics) and MSc in Statistics (Social Statistics) (Research). This course is available with permission as an outside option to students on other programmes where regulations permit.

Pre-requisites

The course will be taught from a statistical perspective and students must have a very solid understanding of linear regression models

Students are not permitted to take this course alongside Algorithmic Techniques for Data Mining (MA429)

Course content

Machine learning and data mining are emerging fields between statistics and computer science which focus on the statistical objectives of prediction, classification and clustering and are particularly orientated to contexts where datasets are large, the so-called world of 'big data'. This course will start from the classical statistical methodology of linear regression and then build on this framework to provide an introduction to machine learning and data mining methods from a statistical perspective. Thus, machine learning will be conceived of as 'statistical learning', following the titles of the books in the essential reading list.   The course will aim to cover modern non-linear methods such as spline methods, generalised additive models, decision trees, random forests, bagging, boosting and support vector machines, as well as more advanced linear approaches, such as ridge regression, the lasso, linear discriminant analysis, k-means clustering, nearest neighbours. 

Teaching

The first part of the course reviews regression methods and covers linear and quadratic discriminant analysis, cross-validation, variable selection, nearest neighbours, shrinkage, dimension reduction methods. The second part of the course introduces non-linear models and covers, splines, generalized additive models, tree methods, bagging, random forest, support vector machines, principal components analysis, k-means, hierarchical clustering.

This course will be delivered through a combination of classes and lectures totalling a minimum of 15 hours across Michaelmas Term / 20 hours across Michaelmas Term. This year, some or all of this teaching may be delivered through a combination of virtual classes and flipped-lectures delivered as short online videos. This course includes a reading week in Week 6 of Michaelmas/Lent Term.

Formative coursework

Students will be expected to produce 5 problem sets in the MT.

The problem sets will consist of some theory questions and data problems that require the implementation of different methods in class using a computer package.

Indicative reading

James, G., Witten, D., Hastie, T. and Tibshirani, R. An Introduction to Statistical Learning. Springer, 2017. Available online at http://www-bcf.usc.edu/~gareth/ISL/  

Hastie, T., Tibshirani, R. and Friedman, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd Edition, Springer,  2009. Available online at http://statweb.stanford.edu/~tibs/ElemStatLearn/index.html 

Assessment

Exam (70%, duration: 2 hours) in the summer exam period.
Project (30%) in the MT Week 11.

Student performance results

(2016/17 - 2018/19 combined)

Classification % of students
Distinction 32.2
Merit 35.6
Pass 24.9
Fail 7.3

Important information in response to COVID-19

Please note that during 2020/21 academic year some variation to teaching and learning activities may be required to respond to changes in public health advice and/or to account for the situation of students in attendance on campus and those studying online during the early part of the academic year. For assessment, this may involve changes to mode of delivery and/or the format or weighting of assessments. Changes will only be made if required and students will be notified about any changes to teaching or assessment plans at the earliest opportunity.

Key facts

Department: Statistics

Total students 2019/20: 78

Average class size 2019/20: 26

Controlled access 2019/20: Yes

Value: Half Unit

Guidelines for interpreting course guide information

Personal development skills

  • Self-management
  • Team working
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills