ST445      Half Unit
Managing and Visualising Data

This information is for the 2018/19 session.

Teacher responsible

Dr Steve Bell (guest lecturer)

Availability

This course is compulsory on the MSc in Data Science. This course is available on the MSc in Applied Social Data Science. This course is available with permission as an outside option to students on other programmes where regulations permit.

Course content

The course consists of two parts that respectively cover data manipulation and data visualisation. The focus of the course is on the fundamental principles of data manipulation and data visualisation and hands-on exercises using Python as the main programming language and various packages used by modern data scientists. The course covers workflow management for data cleaning and preparation which is typically the most time consuming part of a data science project, as well as data analysis methods, and presentation of data analysis results using various data visualisation means.

The first five weeks focus on data manipulation which covers the basic concepts such as data types and data models, including relational and non-relational database data models and query languages. Students learn how to create data model instances, load data into them, and manipulate and query data using various application programming interfaces. The course covers data structures for scientific computing and their manipulation through the Python package NumPy, which includes manipulation of multidimensional array objects, functions for performing element-wise computations using arrays, tools for reading and writing array-based datasets to files, linear algebra operations and random number generators. The course also covers the use of high-level data structures and functions designed for working with structured or tabular data through the Python package pandas. This involves using the DataFrame, a tabular column-oriented data structure and the Series, a one-dimensional labelled array object. We cover the basic concepts of relational data models and SQL query language for creating and querying database tables as well as some noSQL database models. Students will learn how to perform data analytics in Python on data imported from various data sources, including delimiter-separated file formats such as csv and tsv files, JSON and XML files, SQL databases such as MySQL and PostgresSQL as well as NoSQL databases such as various document, key-value and graph databases.

The last five weeks focus on data visualisation starting with the elements of exploratory data analysis using various statistical plots. We discuss standard plots for univariate data analysis such as histograms, smoothed histograms using kernel-density estimators, empirical cumulative distribution functions, boxplots and violin plots. We then move on to standard plots for bivariate data analysis such as scatter plots, matrix data visualisation using cluster heat maps, seriation and spectral bi-clustering methods for reordering of rows and columns of a matrix data. We discuss data visualisation techniques for common tasks such as evaluation of the predictive performance of machine learning classifiers, data dimensionality reduction, and graph data visualisation. We explain plots for evaluation of binary classifiers such as receiver operating curve plots and precision recall plots. We explain the theoretical principles of dimensionality reduction methods used for visualisation of high-dimensional data points, starting with classical methods such as multidimensional scaling to more recent methods such as stochastic neighbour embedding. We explain the basic principles of graph data visualisation methods and different graph data layouts. The data visualisations are materialised in code using various Python packages such as matplotlib, Seaborn, scikit-learn modules for clustering, manifold learning and metrics, and graphviz and networkX libraries for graph data visualisation.

Teaching

20 hours of lectures and 15 hours of computer workshops in the MT.

Formative coursework

Students will be expected to produce 6 problem sets in the MT.

Indicative reading

Mckinney, W., Python for Data Analysis, 2nd Edition, O’Reilly 2017

Muller, A. C. and Guido, S., Introduction to Machine Learning with Python, O’Reilly, 2016

Geron, A., Hands-on Machine Learning with Scikit-Learn & TensorFlow, O’Reilly, 2017

Ramakrishnan, R. and Gehrke, J., Database Management Systems, 3rd Edition, McGraw Hill, 2002

Obe, R. and Hsu, L., PostgreSQL Up & Running, 3rd Edition, O’Reilly 2017

Robinson, I., Webber, J. and Eifrem, E., Graph Databases, 2nd Edition, O’Reilly 2015

Wickham, Hadley. Ggplot2: Elegant Graphics for Data Analysis, Springer, 2009

Murray, S., Interactive Data Visualisation for the Web, O'Reilly, 2013

Matplotlib, https://matplotlib.org

Seaborn: statistical data visualization https://seaborn.pydata.org

Sci-kit learn, Machine learning in Python, http://scikit-learn.org

Assessment

Project (60%) and continuous assessment (40%) in the MT.

Four of the problem sets submitted by students weekly will be assessed (40% in total). Each problem set will have an individual mark of 10% and submission will be required in MT Weeks 3, 6, 8 and 10. In addition, there will be a take-home exam (60%) in the form of an individual project in which they will demonstrate the ability to manage data and visualise it through  effective statistical graphics using principles they have learnt on the course. This may be done by publishing the visualisation and code to a GitHub repository and GitHub pages website.

Key facts

Department: Statistics

Total students 2017/18: 30

Average class size 2017/18: 29

Controlled access 2017/18: Yes

Lecture capture used 2017/18: Yes (MT)

Value: Half Unit

Guidelines for interpreting course guide information

Personal development skills

  • Self-management
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills