ST445 Half Unit
Managing and Visualising Data
This information is for the 2017/18 session.
Prof Kenneth Benoit COL 8.11 and Prof Milan Vojnovic COL 2.05A
This course is compulsory on the MSc in Data Science. This course is available on the MSc in Media and Communications (Data and Society) and MSc in Social Research Methods. This course is available with permission as an outside option to students on other programmes where regulations permit.
Priority will be given to students on the MSc in Data Science. In addition, the course is available to students on the MSc in Social Research Methods and the MSc in Media and Communications (Data and Society).
The course will be divided into two halves taught by Prof Kenneth Benoit and Prof Milan Vojnovic, respectively.
The first five weeks will focus on data structures and databases, covering the principles of digital methods for storing and structuring data, including data types, relational and non-relational database design, and query languages. Students will learn to build, populate, manipulate and query databases based on datasets relevant to their fields of interest. The course will also cover workflow management for typical data transformation and cleaning projects, frequently the starting point and most time-consuming part of any data science project.
This part of the course will introduce principles and applications of the electronic storage, structuring, manipulation, transformation, extraction, and dissemination of data. This includes data types, database design, data base implementation, and data analysis through structured queries. Through joining operations, we will also cover the challenges of data linkage and how to combine datasets from different sources. We begin by discussing concepts in fundamental data types, and how data is stored and recorded electronically. We will cover database design, especially relational databases, using substantive examples across a variety of fields. Students are introduced to SQL through MySQL, and programming assignments in this unit of the course will be designed to insure that students learn to create, populate and query an SQL database. We will introduce NoSQL using MongoDB and the JSON data format for comparison. For both types of database, students will be encouraged to work with data relevant to their own interests as they learn to create, populate and query data. In the final section of the data section of the course, we will step through a complete workflow including data cleaning and transformation, illustrating many of the practical challenges faced at the outset of any data analysis or data science project.
The second five weeks will focus on visualising data, starting with introduction and discussion of basic statistical plots, best practices, explaining standard plots used for classification and regression error analysis, graph data visualisation, projection methods for visualising high-dimensional data, interactive data visualisation, and map data visualisation.
This part of the course will cover: basic statistical plots including basic univariate data plots, line plots, bar plots, histograms, empirical distribution function plots, boxplots, scatter plots, and heat maps; standard plots for classification and regression error analysis such as precision-recall curves, receiver operating characteristic (ROC), area under the curve (AUC), and the relations between them; graph data visualisation methods and the underlying principles; projection methods for high-dimensional data such as Principal Component Analysis (PCA) and t-distributed stochastic neighbour embedding (t-SNA); basic interactive plots using D3 and other tools; overlying data on geographical map data.
20 hours of lectures and 15 hours of computer workshops in the MT.
Students will be expected to produce 6 problem sets in the MT.
Kristina Chodorow. MongoDB: the Definitive Guide, 2nd Ed. O'Reilly 2013
Clare Churcher. Beginning Database Design: from Novice to Professional. Apress, 2007
Seyed M. Tahaghoghi, and Williams E. Hugh. Learning MySQL. O'Reilly, 2006.
Narasimha Karumanchi. Data Structures and Algorithms Made Easy: Data Structures and Algorithmic Puzzles, 2nd Ed., CreateSpace Independent Publishing Platform, 2011.
Kent Lee. Data Structures and Algorithms with Python. Springer, 2015.
Peter Lake,. Concise Guide to Databases: A Practical Introduction. Springer, 2013.
Thomas Nield. Getting Started with SQL: A hands-on Approach for Beginners. O'Reilly, 2016.
Alberto Cairo. The Functional Art: An Introduction to Information Graphics and Visualization, New Riders, 2013.
Rafe M. J. Donahue. Fundamental Statistical Concepts in Presenting Data: Principles for Constructing Better Graphics, 2011 downloadable from http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RafeDonahue/fscipdpfcbg_currentversion.pdf.
Leland Wilkinson. The Grammar of Graphics, 2nd Ed., Springer, 2005.
Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis, 2nd edition, Springer, 2016.
Dianne Cook and Deborah Swayne. Interactive and Dynamic Graphics for Data Analysis - with R and GGobi, Springer, 2007.
Scott Murray. Interactive Data Visualisation for the Web, O'Reilly, 2013.
Project (60%) in the MT.
Continuous assessment (10%) in the Week 3.
Continuous assessment (10%) in the Week 6.
Continuous assessment (10%) in the Week 8.
Continuous assessment (10%) in the Week 10.
Four of the problem sets submitted by students weekly will be assessed (40% in total). In addition, there will be a take-home exam (60%) in the form of an individual project in which they will demonstrate the ability to manage data and visualise it through effective statistical graphics using principles they have learnt on the course. This may be done by publishing the visualisation and code to a GitHub repository and GitHub pages website.
Total students 2016/17: Unavailable
Average class size 2016/17: Unavailable
Controlled access 2016/17: No
Value: Half Unit
Personal development skills
- Problem solving
- Application of information skills
- Application of numeracy skills
- Specialist skills