ST446      Half Unit
Distributed Computing for Big Data

This information is for the 2017/18 session.

Teacher responsible

Prof Milan Vojnovic COL 2.05A

Availability

This course is available on the MSc in Data Science. This course is available with permission as an outside option to students on other programmes where regulations permit.

Pre-requisites

Some basic programming knowledge in Python or other programming language is desirable.

Course content

The course covers basic principles and techniques for distributed processing of large-scale datasets across clusters of computers with an emphasis on machine learning tasks. It covers the basic principles of different computation paradigms developed for batch, streaming, iterative, and graph data processing. The course is largely based on using Apache Hadoop computing framework, especially Apache Spark, a popular fast and general engine for large-scale data processing. The course also covers the basic principles of numerical computations using data flow graphs, which are used for computations on multiple CPUs and GPUs for learning deep neural networks such as by popular open-source software libraries Tensorflow developed by Google and The Microsoft Cognitive Toolkit developed by Microsoft.

The course covers canonical machine learning tasks that arise in real-world applications such as recommendation of items to users, k-means clustering for anomaly detection, latent semantic analysis of textual data sources such as Wikipedia, analysis of online social networks, analysis of geospatial data, estimation of financial risk, and deep learning for image recognition.

Teaching

20 hours of lectures and 15 hours of computer workshops in the LT.

Formative coursework

Students will be expected to produce 10 problem sets in the LT.

Eight of the weekly problem sets will represent formative coursework. The other two will represent summative assessment.

Indicative reading

Francesco Pierfederici. Distributed Computing with Python. Packt Publishing, 2016.

Tom White. Hadoop: The Definitive Guide. O'Reilly, 2015.

Holde Karau, Andy Konwinski, Patrick, Wendell, and Matei Zaharia. Learning Spark - Lightning-Fast Big Data Analysis. O'Reilly, 2015.

Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills. Advanced Analytics with Spark – Patterns for Learning from Data at Scale. O’Reilly, 2015.

Goodfellow, Youshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.

Open source software

Dispy: Distributed and parallel computing with Python: http://dispy.sourceforge.net 

Spark: http://spark.apache.org  

TensorFlow: Open Source Software Library for Machine Intelligence https://www.tensorflow.org

The Microsoft Cognitive Framework, https://www.microsoft.com/en-us/research/product/cognitive-toolkit 

NVIDIA GPUs – The Engine of Deep Learning: https://developer.nvidia.com/deep-learning

Assessment

Project (80%) in the LT.
Continuous assessment (10%) in the Week 4.
Continuous assessment (10%) in the Week 7.

The main assessment will consist of an individual project to develop a package for fitting statistical models of the student's own choice to big data sets.

In addition, among the 10 weekly problem sets, there will be two (in weeks 4 and 7) which will contribute to summative assessment (10% each).

Key facts

Department: Statistics

Total students 2016/17: Unavailable

Average class size 2016/17: Unavailable

Controlled access 2016/17: No

Value: Half Unit

Guidelines for interpreting course guide information

Personal development skills

  • Self-management
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills