MY573      Half Unit
Managing and Visualising Data

This information is for the 2017/18 session.

Teacher responsible

Prof Kenneth Benoit COL8.11

Availability

This course is available on the PhD in Methodology. This course is available with permission as an outside option to students on other programmes where regulations permit.

This course is available to all research students where regulations permit.

Course content

The course be divided into two halves.

The first five weeks will focus on data structures and databases, covering the principles of digital methods for storing

and structuring data, including data types, relational and non-relational database design, and query languages.

Students will learn to build, populate, manipulate and query databases based on datasets relevant to their fields of

interest. The course will also cover workflow management for typical data transformation and cleaning projects,

frequently the starting point and most time-consuming part of any data science project.

This part of the course will introduce principles and applications of the electronic storage, structuring, manipulation,

transformation, extraction, and dissemination of data. This includes data types, database design, data base

implementation, and data analysis through structured queries. Through joining operations, we will also cover the

challenges of data linkage and how to combine datasets from different sources. We begin by discussing concepts in

fundamental data types, and how data is stored and recorded electronically. We will cover database design,

especially relational databases, using substantive examples across a variety of fields. Students are introduced to SQL

through MySQL, and programming assignments in this unit of the course will be designed to insure that students learn

to create, populate and query an SQL database. We will introduce NoSQL using MongoDB and the JSON data format

for comparison. For both types of database, students will be encouraged to work with data relevant to their own

interests as they learn to create, populate and query data. In the final section of the data section of the course, we will

step through a complete workflow including data cleaning and transformation, illustrating many of the practical

challenges faced at the outset of any data analysis or data science project.

The second five weeks will focus on visualising data, starting with univariate and bivariate data, discussing the

advantages/disadvantages of some commonly used graphics, then turning to more sophisticated tools, including

three-dimensional tools, maps and interactive and dynamic graphics.

This part of the course will cover: data visualisation basics (history and classic examples; best practice for univariate

and bivariate data; image formats and resolution); data visualisation principles (cognition and human visual

perception; grammar of graphics; application to examples); design principles (graphic design; layout; visual style; titles and annotations; animations; interactive and dynamic graphics); statistical analysis and maps (binwidths/bandwidths

for histograms and kernel density estimation; regression diagnostics; maps).

Teaching

20 hours of lectures and 15 hours of lectures in the MT.

Formative coursework

Students will be expected to produce 6 problem sets in the MT.

Indicative reading

Wilkinson, Leland. The Grammar of Graphics, 2nd Ed., Springer, 2005.

Wickham, Hadley. Ggplot2: Elegant Graphics for Data Analysis, Springer, 2009.

Cook, Dianne and Swayne, Deborah. Interactive and Dynamic Graphics for Data Analysis - with R and GGobi,

Springer, 2007.

Murray, Scott. Interactive Data Visualisation for the Web, O'Reilly, 2013.

Assessment

Project (60%) and continuous assessment (40%) in the MT.

Four of the problem sets submitted by students weekly will be assessed (40% in total). In addition, there will be a

take-home exam (60%) in the form of an individual project in which they will demonstrate the ability to manage data

and visualise it through effective statistical graphics using principles they have learnt on the course. This may be

done by publishing the visualisation and code to a GitHub repository and GitHub pages website.

Marking of these assessments will be at a level appropriate for PhD students. For the project, it is expected that PhD students submit a more detailed project that what will be expected of students taking the MSc level course.

Key facts

Department: Methodology

Total students 2016/17: Unavailable

Average class size 2016/17: Unavailable

Value: Half Unit

Guidelines for interpreting course guide information

Personal development skills

  • Self-management
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills