##
ST443 Half Unit

Machine Learning and Data Mining

**This information is for the 2015/16 session.**

**Teacher responsible**

Dr Xinghao Qiao

**Availability**

This course is available on the MSc in Statistics, MSc in Statistics (Financial Statistics), MSc in Statistics (Financial Statistics) (Research) and MSc in Statistics (Research). This course is available with permission as an outside option to students on other programmes where regulations permit.

**Pre-requisites**

The course will be taught from a statistical perspective and students must have a sound knowledge of statistical methods of regression analysis , as covered for example in *'Statistical Inference: Principles, Methods and Computation *(ST425)'

Students are not permitted to take this course alongside MG4E1 Algorithmic Techniques for Data Mining.

**Course content**

Machine learning and data mining are emerging fields between statistics and computer science which focus on the statistical objectives of prediction, classification and clustering and are particularly orientated to contexts where datasets are large, the so-called world of 'big data'. This course will start from the classical statistical methodology of linear regression and then build on this framework to provide an introduction to machine learning and data mining methods from a statistical perspective. Thus, machine learning will be conceived of as 'statistical learning', following the titles of the books in the essential reading list. The course will aim to cover modern non-linear methods such as generalized additive models, decision trees, boosting, bagging and support vector machines, as well as more advanced linear approaches, such as LASSO, linear discriminant analysis, k-means clustering, nearest neighbours. Methods suitable for time series data, such as the use of state-space models, will also be considered.

**Teaching**

20 hours of lectures and 10 hours of computer workshops in the LT.

The first part of the course reviews regression methods and covers linear discriminant analysis, variable selection, nearest neighbours, shrinkage and dimension reduction methods. The second part of the course introduces non-linear models and covers polynomial regression, splines, generalized additive models, tree methods, bagging, boosting, support vector machines, k-means, hierarchical clustering and some methods suited to time series.

Week 6 will be used as a reading week and exercises will be given out to students to work through at home.

**Formative coursework**

Students will be expected to produce 8 problem sets in the LT.

The problem sets will consist of some theory questions and data problems that require the implementation of different methods in class using a computer package

**Indicative reading**

James, G., Witten, D., Hastie, T. and Tibshirani, R. An Introduction to Statistical Learning. Springer, 2014. Available online at http://www-bcf.usc.edu/~gareth/ISL/

Hastie, T., Tibshirani, R. and Friedman, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd Edition, Springer, 2009. Available online at http://statweb.stanford.edu/~tibs/ElemStatLearn/index.html

Bishop, G. Pattern Recognition and Machine Learning. Springer-Verlag, 2006.

**Assessment**

Exam (80%, duration: 1 hour and 50 minutes, reading time: 10 minutes) in the main exam period.

Project (20%) in the ST.

** Key facts **

Department: Statistics

Total students 2014/15: Unavailable

Average class size 2014/15: Unavailable

Controlled access 2014/15: No

Value: Half Unit

**Personal development skills**

- Self-management
- Team working
- Problem solving
- Application of information skills
- Communication
- Application of numeracy skills
- Specialist skills