ST446 Half Unit
Distributed Computing for Big Data
This information is for the 2025/26 session.
Course Convenor
Dr Marcos Barreto
Availability
This course is available on the MPA in Data Science for Public Policy, MSc in Applied Social Data Science, MSc in Data Science, MSc in Econometrics and Mathematical Economics, MSc in Geographic Data Science, MSc in Health Data Science, MSc in Quantitative Methods for Risk Management, MSc in Statistics, MSc in Statistics (Financial Statistics), MSc in Statistics (Financial Statistics) (Research), MSc in Statistics (Research), MSc in Statistics (Social Statistics) and MSc in Statistics (Social Statistics) (Research). This course is available with permission as an outside option to students on other programmes where regulations permit. This course uses controlled access as part of the course selection process.
How to apply: Please be advised that spaces on this course will be extremely limited, so early application is advisable. Priority will be given to students on the MSc Data Science.
Students from any other programmes should submit a short statement indicating a) any experience with cloud computing and/or big data tools, and b) why they think the course is suitable for them given their background knowledge.
Deadline for application: Due to the nature of the method of application, interested students should apply as soon as possible after the opening selection and no later than 10.00am on Friday 26 September 2025.
Course lecturers will aim to make initial offers to students on LSE For You by Friday 26 September.
For queries contact: Stats-Msc@lse.ac.uk
This course has a limited number of places (it is controlled access) and demand is typically high. This may mean that you are not able to get a place on this course. The MSc in Data Science students are given priority for enrolment in this course.
Requisites
Additional requisites:
Basic knowledge of Python or some other programming knowledge is desirable.
Course content
The course covers principles of distributed processing systems for big data, including distributed file systems (such as Hadoop); distributed computation models (such as MapReduce); resilient distributed datasets (Spark RDDs); structured querying over large datasets (Spark Dataframes and SQL); stream data processing systems (Kafka and Kinesis); and scalable machine learning models (Spark MLlib and TensorFlow)
The course makes use of AWS Academy learning resources and enables students to learn about the principles and gain hands-on experience in working with industry standard cloud computing technologies. Through weekly exercises and course project work, student can gain experience in performing data analytics tasks on their laptops and cloud computing platforms.
Teaching
15 hours of seminars and 20 hours of lectures in the Winter Term.
This course has a reading week in Week 6 of Winter Term.
Formative assessment
Students will be expected to produce 10 problem sets in the WT.
Eight of the weekly problem sets will represent formative coursework. The other two will represent summative assessment.
Indicative reading
- Damji, J., Weing, B., Das, T., Lee. D. Learning Spark: Lightining-fast Data Analysis, O’Reilly, 2nd Edition, 2020
- Karau, H. and Warren, R., High Performance Spark: Best Practices for Scaling & Optimizing Apache Spark, O’Reilly, 2017
- Drabas, T. and Lee D., Learning PySpark, Packt, 2016
- White, T., Hadoop: The Definitive Guide, O’Reilly, 4th Edition, 2015
- Triguero, I. and Galar, M. Large-Scale Data Analytics with Python and Spark: a hands-on guide to implementing machine learning solutions. Cambridge, 2024.
Additional reading:
- Marz, N., Warren, J. Big Data: Principles and best practices of scalable realtime data systems. Manning, 2015.
- Kleppmann, M. Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O'Reilly, 2016.
- Foster, I., Ghani, R., Jarmin, R. S., Kreuter, F., Lanie, J. (Eds.). Big Data and Social Science: Data Science Methods and Tools for Research and Practice. CRC Press, 2nd edition, 2021.
- Li, K-C., Jiang, H., Zomaya, A. (Eds.). Big Data Management and Processing. CRC Press, 2017.
- Huang, S., Deng. H. Data Analytics: A Small Data Approach. CRC Press, 2021.
- Apache Spark Documentation https://spark.apache.org/docs/latest
- Apache TensorFlow Documentation https://www.tensorflow.org
Assessment
Problem sets (20%) in Winter Term Week 6
Problem sets (20%) in Winter Term Week 11
Project (60%) in May
This component of assessment includes an element of group work.
Summative assessments: a problem set submitted in WT Week 6 (20%), a problem set submitted in WT Week 11 (20%), a project (60%) in the WT.
Key facts
Department: Statistics
Course Study Period: Winter Term
Unit value: Half unit
FHEQ Level: Level 7
CEFR Level: Null
Total students 2024/25: 98
Average class size 2024/25: 33
Controlled access 2024/25: NoCourse selection videos
Some departments have produced short videos to introduce their courses. Please refer to the course selection videos index page for further information.
Personal development skills
- Self-management
- Problem solving
- Application of information skills
- Communication
- Application of numeracy skills
- Specialist skills