ST445      Half Unit
Managing and Visualising Data

This information is for the 2020/21 session.

Teacher responsible

Dr Chengchun Shi

Availability

This course is compulsory on the MSc in Data Science. This course is available on the MSc in Applied Social Data Science, MSc in Statistics, MSc in Statistics (Financial Statistics), MSc in Statistics (Financial Statistics) (LSE and Fudan), MSc in Statistics (Financial Statistics) (Research), MSc in Statistics (Research), MSc in Statistics (Social Statistics) and MSc in Statistics (Social Statistics) (Research). This course is available with permission as an outside option to students on other programmes where regulations permit.

Priority will be given to Applied Social Data Science students and students in the Department of Statistics, where the course is listed on their programme regulations.

 

Course content

The focus of the course is on the fundamental principles and best practices for data manipulation and visualisation. The course is based on using Python as the primary programming language and various software packages. 

The first five weeks will focus on data manipulation which covers the basic concepts such as data types and data models. Students learn how to create data model instances, load data into them, and manipulate and query data. The course will cover data structures for scientific computing and their manipulation through the Python package NumPy, and high-level data structures and functions for working with structured or tabular data through the Python package Pandas. We will cover the basic concepts of relational data models and SQL query language for creating and querying database tables.

The last five weeks focus on data visualisation starting with the exploratory data analysis using various statistical plots. We will explain visualisations used for evaluation of binary classifiers such as receiver operating curve plots and precision recall plots. We will explain the principles of some dimensionality reduction methods used for visualisation of high-dimensional data points, starting with classical methods such as multidimensional scaling to more recent methods such as stochastic neighbour embedding. We will discuss the basic principles of graph data visualisation methods and different graph data layouts. The data visualisations will be materialised in code using Python packages such as matplotlib, Seaborn, and various scikit-learn modules.

The course handout is available here: https://lse-st445.github.io/. 

Teaching

This course will be delivered through a combination of classes and lectures totalling a minimum of 35 hours in Michaelmas Term. This year, some or all of this teaching may be delivered through a combination of virtual classes and flipped-lectures delivered as short online videos. This course includes a reading week in Week 6 of Michaelmas Term.

Formative coursework

Students will be expected to produce 6 problem sets in the MT.

Indicative reading

Mckinney, W., Python for Data Analysis, 2nd Edition, O’Reilly 2017

Muller, A. C. and Guido, S., Introduction to Machine Learning with Python, O’Reilly, 2016

Geron, A., Hands-on Machine Learning with Scikit-Learn & TensorFlow, O’Reilly, 2017

Ramakrishnan, R. and Gehrke, J., Database Management Systems, 3rd Edition, McGraw Hill, 2002

Obe, R. and Hsu, L., PostgreSQL Up & Running, 3rd Edition, O’Reilly 2017

Robinson, I., Webber, J. and Eifrem, E., Graph Databases, 2nd Edition, O’Reilly 2015

Wickham, Hadley. Ggplot2: Elegant Graphics for Data Analysis, Springer, 2009

Murray, S., Interactive Data Visualisation for the Web, O'Reilly, 2013

Matplotlib, https://matplotlib.org

Seaborn: statistical data visualization https://seaborn.pydata.org

Sci-kit learn, Machine learning in Python, http://scikit-learn.org




Assessment

Project (60%) and continuous assessment (40%) in the MT.

Four of the problem sets submitted by students weekly will be assessed (40% in total). Each problem set will have an individual mark of 10% and submission will be required in MT Weeks 3, 5, 8 and 10. In addition, there will be a take-home exam (60%) in the form of an individual project in which they will demonstrate the ability to manage data and visualise it through  effective statistical graphics using principles they have learnt on the course. This may be done by publishing the visualisation and code to a GitHub repository and GitHub pages website.

Important information in response to COVID-19

Please note that during 2020/21 academic year some variation to teaching and learning activities may be required to respond to changes in public health advice and/or to account for the situation of students in attendance on campus and those studying online during the early part of the academic year. For assessment, this may involve changes to mode of delivery and/or the format or weighting of assessments. Changes will only be made if required and students will be notified about any changes to teaching or assessment plans at the earliest opportunity.

Key facts

Department: Statistics

Total students 2019/20: 37

Average class size 2019/20: 37

Controlled access 2019/20: Yes

Value: Half Unit

Guidelines for interpreting course guide information

Personal development skills

  • Self-management
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills