DS105M      Half Unit
Data for Data Science

This information is for the 2021/22 session.

Teacher responsible

Prof Kenneth Benoit PEL.4.01C

Course content

This course will cover the fundamentals of data, with an aim to understanding how data is generated, how it is collected, how it must be transformed for use and storage, how it is stored, and the ways it can be retrieved and communicated. The course will also cover workflow management for typical data transformation and cleaning projects, frequently the starting point and most time consuming part of any data science project. This course uses a project-based learning approach towards the study of online publishing and group-based collaboration, essential ingredients of modern data science projects.

It introduces the principles and applications of the electronic storage, structuring, manipulation, transformation, extraction, and dissemination of data. This includes data types, how data is stored and recorded electronically, the concept and fundamentals of databases. It also covers how data is formatted and communicated. It presents basic methods for obtaining data from the Internet, including simple methods for web scraping and the use of APIs to submit queries that return structured data. Finally, it covers methods for formatting and publishing data.

Sharing and publishing data will also form a key part of this module and will include key skills in on-line publishing, including the elements of web design, the technical elements of web technologies and web programming, as well as the use of revision-control and group collaboration tools such as GitHub. Each student will build an interactive website based on content relevant to their domain-related interests, and will use GitHub for accessing and submitting course materials and assignments. The final project will involve group work to create a data-based website published on GitHub.

This module is not designed to be a hands-on introduction to the use of databases, but does introduce the concepts of databases. For more detailed learning on databases, we will encourage students to take ST207 Databases.

Teaching

16 hours and 40 minutes of lectures and 13 hours and 30 minutes of classes in the MT.

A combination of classes and lectures totalling a minimum of 33.5 hours (counting 50 mins as an hour) across Michaelmas Term, with a reading week in Week 6.

Formative coursework

Students will be expected to produce 5 pieces of coursework in the MT.

Students will work on weekly, structured problem sets in the staff-led class sessions. Example solutions will be provided at the end of each week. 

Indicative reading

  • Duckett, Jon. HTML and CSS: Design and Build Websites. New York: Wiley, 2011.
  • Lake, Peter. Concise Guide to Databases: A Practical Introduction. Springer, 2013.
  • Sklar, David Learning PHP 5 O’Reilly, 2004. GitHub Guides at https://guides.github.com, including: “Understanding the GitHub Flow”, “Hello World”, and “Getting Started with GitHub Pages”.
  • Jacobson, Daniel APIs: A Strategy Guide. O'Reilly: 2012.
  • Zafarani, R., Abbasi, M. A. and Liu, H. (2014) Social Media Mining: An introduction. Cambridge University Press.
  • Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt.
  • Kitchin, R. (2014). The data revolution: Big data, open data, data infrastructures and their consequences. Sage.

Assessment

Coursework (60%, 1000 words) and group project (40%) in the LT.

Key facts

Department: Data Science Institute

Total students 2020/21: Unavailable

Average class size 2020/21: Unavailable

Capped 2020/21: No

Value: Half Unit

Guidelines for interpreting course guide information

Personal development skills

  • Self-management
  • Team working
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills