ST446      Half Unit
Distributed Computing for Big Data

This information is for the 2021/22 session.

Teacher responsible

Prof Milan Vojnovic COL 5.05


This course is available on the MSc in Applied Social Data Science, MSc in Data Science, MSc in Econometrics and Mathematical Economics, MSc in Geographic Data Science, MSc in Health Data Science, MSc in Operations Research & Analytics, MSc in Quantitative Methods for Risk Management, MSc in Statistics, MSc in Statistics (Financial Statistics), MSc in Statistics (Financial Statistics) (LSE and Fudan), MSc in Statistics (Financial Statistics) (Research), MSc in Statistics (Research), MSc in Statistics (Social Statistics) and MSc in Statistics (Social Statistics) (Research). This course is available with permission as an outside option to students on other programmes where regulations permit.

This course has a limited number of places (it is controlled access) and demand is typically high. This may mean that you are not able to get a place on this course. The MSc in Data Science students are given priority for enrolment in this course.


Basic knowledge of Python or some other programming knowledge is desirable.

Course content

The course covers basic principles of systems for distributed processing of big data including distributed file systems; distributed computation models such as Mapreduce, resilient distributed datasets, and distributed dataflow graph computations; structured querying over large datasets; graph data processing systems; stream data processing systems; scalable machine learning algorithms for classification, regression, collaborative filtering, topic modelling and other tasks. The course enables students to learn about the principles and gain hands-on experience in working with the state of the art computing technologies such as Apache Spark, a general engine for large-scale data processing, and Apache TensorFlow, a popular software library for (distributed) learning of deep neural networks. Through weekly exercises and course project work, student can gain experience in performing data analytics tasks on their laptops and cloud computing platforms.

For more information, please see the course handout:


This course will be delivered through a combination of classes, and lectures and Q&A sessions totalling a minimum of 35 hours across Lent Term. This year, some of this teaching may be delivered through a combination of virtual classes and flipped-lectures delivered as short online videos.

Formative coursework

Students will be expected to produce 10 problem sets in the LT.

Eight of the weekly problem sets will represent formative coursework. The other two will represent summative assessment.

Indicative reading

  • Karau, H., Konwinski, A., Wendell, P. and Zaharia, M., Learning Spark: Lightining-fast Data Analysis, O’Reilly, 2015
  • Karau, H. and Warren, R., High Performance Spark: Best Practices for Scaling & Optimizing Apache Spark, O’Reilly, 2017
  • Drabas, T. and Lee D., Learning PySpark, Packt, 2016
  • White, T., Hadoop: The Definitive Guide, O’Reilly, 4th Edition, 2015
  • Apache Spark Documentation
  • Apache TensorFlow Documentation


Project (80%) in the LT.
Continuous assessment (10%) in the LT Week 4.
Continuous assessment (10%) in the LT Week 7.

The project assessment consists of applying distributed computing methods to analyse a bigdata dataset, and evaluating their efficiency.

In addition, among the 10 weekly problem sets, there will be two (in weeks 4 and 7) which will contribute to summative assessment (10% each).

Course selection videos

Some departments have produced short videos to introduce their courses. Please refer to the course selection videos index page for further information.

Student performance results

(2017/18 - 2019/20 combined)

Classification % of students
Distinction 42.9
Merit 39.3
Pass 15.5
Fail 2.4

Important information in response to COVID-19

Please note that during 2021/22 academic year some variation to teaching and learning activities may be required to respond to changes in public health advice and/or to account for the differing needs of students in attendance on campus and those who might be studying online. For example, this may involve changes to the mode of teaching delivery and/or the format or weighting of assessments. Changes will only be made if required and students will be notified about any changes to teaching or assessment plans at the earliest opportunity.

Key facts

Department: Statistics

Total students 2020/21: 53

Average class size 2020/21: 27

Controlled access 2020/21: Yes

Value: Half Unit

Guidelines for interpreting course guide information

Personal development skills

  • Self-management
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills