DS105M Half Unit
Data for Data Science
This information is for the 2021/22 session.
Prof Kenneth Benoit PEL.4.01C
This course will cover the fundamentals of data, with an aim to understanding how data is generated, how it is collected, how it must be transformed for use and storage, how it is stored, and the ways it can be retrieved and communicated. The course will also cover workflow management for typical data transformation and cleaning projects, frequently the starting point and most time consuming part of any data science project. This course uses a project-based learning approach towards the study of online publishing and group-based collaboration, essential ingredients of modern data science projects.
It introduces the principles and applications of the electronic storage, structuring, manipulation, transformation, extraction, and dissemination of data. This includes data types, how data is stored and recorded electronically, the concept and fundamentals of databases. It also covers how data is formatted and communicated. It presents basic methods for obtaining data from the Internet, including simple methods for web scraping and the use of APIs to submit queries that return structured data. Finally, it covers methods for formatting and publishing data.
Sharing and publishing data will also form a key part of this module and will include key skills in on-line publishing, including the elements of web design, the technical elements of web technologies and web programming, as well as the use of revision-control and group collaboration tools such as GitHub. Each student will build an interactive website based on content relevant to their domain-related interests, and will use GitHub for accessing and submitting course materials and assignments. The final project will involve group work to create a data-based website published on GitHub.
This module is not designed to be a hands-on introduction to the use of databases, but does introduce the concepts of databases. For more detailed learning on databases, we will encourage students to take ST207 Databases.
16 hours and 40 minutes of lectures and 13 hours and 30 minutes of classes in the MT.
A combination of classes and lectures totalling a minimum of 33.5 hours (counting 50 mins as an hour) across Michaelmas Term, with a reading week in Week 6.
Students will be expected to produce 5 pieces of coursework in the MT.
Students will work on weekly, structured problem sets in the staff-led class sessions. Example solutions will be provided at the end of each week.
- Duckett, Jon. HTML and CSS: Design and Build Websites. New York: Wiley, 2011.
- Lake, Peter. Concise Guide to Databases: A Practical Introduction. Springer, 2013.
- Sklar, David Learning PHP 5 O’Reilly, 2004. GitHub Guides at https://guides.github.com, including: “Understanding the GitHub Flow”, “Hello World”, and “Getting Started with GitHub Pages”.
- Jacobson, Daniel APIs: A Strategy Guide. O'Reilly: 2012.
- Zafarani, R., Abbasi, M. A. and Liu, H. (2014) Social Media Mining: An introduction. Cambridge University Press.
- Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt.
- Kitchin, R. (2014). The data revolution: Big data, open data, data infrastructures and their consequences. Sage.
Coursework (60%, 1000 words) and group project (40%) in the LT.
Department: Data Science Institute
Total students 2020/21: Unavailable
Average class size 2020/21: Unavailable
Capped 2020/21: No
Value: Half Unit
Personal development skills
- Team working
- Problem solving
- Application of information skills
- Application of numeracy skills
- Specialist skills