MY572 Half Unit
Data for Data Scientists
This information is for the 2022/23 session.
Dr Friedrich Geiecke
This course is available on the MPhil/PhD in Social Research Methods. This course is available with permission as an outside option to students on other programmes where regulations permit.
This course is not controlled access. If you register for a place and meet the prerequisites, if any, you are likely to be given a place
This course will cover the principles of digital methods for storing and structuring data, including data types, relational and nonrelational database design, and query languages. Students will learn to build, populate, manipulate and query databases based on datasets relevant to their fields of interest. The course will also cover workflow management for typical data transformation and cleaning projects, frequently the starting point and most time-consuming part of any data science project. This course uses a project-based learning approach towards the study of online publishing and group-based collaboration, essential ingredients of modern data science projects. The coverage of data sharing will include key skills in on-line publishing, including the elements of web design, the technical elements of web technologies and web programming, as well as the use of revision-control and group collaboration tools such as GitHub. Each student will build one or more interactive website based on content relevant to his/her domain-related interests, and will use GitHub for accessing and submitting course materials and assignments.
In this course, we introduce principles and applications of the electronic storage, structuring, manipulation, transformation, extraction, and dissemination of data. This includes data types, database design, data base implementation, and data analysis through structured queries. Through joining operations, we will also cover the challenges of data linkage and how to combine datasets from different sources. We begin by discussing concepts in fundamental data types, and how data is stored and recorded electronically. We will cover database design, especially relational databases, using substantive examples across a variety of fields. Students are introduced to SQL through MySQL, and programming assignments in this unit of the course will be designed to insure that students learn to create, populate and query an SQL database. We will introduce NoSQL using MongoDB and the JSON data format for comparison. For both types of database, students will be encouraged to work with data relevant to their own interests as they learn to create, populate and query data. In the final section of the data section of the course, we will step through a complete workflow including data cleaning and transformation, illustrating many of the practical challenges faced at the outset of any data analysis or data science project.
This course is delivered through a combination of classes and lectures totalling a minimum of 20 hours across Michaelmas Term.
This course has a reading week in Week 6 of MT.
Students will be expected to produce 10 problem sets in the MT.
Students will work on weekly, structured problem sets in the staff-led class sessions. Example solutions will be provided at the end of each week.
- Chodorow, Kristina MongoDB: The Definitive Guide, 2nd Edition O’Reilly 2013.
- Churcher, Clare. Beginning Database Design: From Novice to Professional. Apress, 2007.
- Tahaghoghi, Seyed M. and Hugh E. Williams. Learning MySQL. O'Reilly, 2006. Karumanchi, Narasimha. Data Structures and Algorithms Made Easy: Data Structure and Algorithmic Puzzles, Second Edition. CreateSpace Independent Publishing Platform, 2011.
- Lee, Kent. Data Structures and Algorithms with Python. Springer, 2015.
- Lake, Peter. Concise Guide to Databases: A Practical Introduction. Springer, 2013.
- Nield, Thomas. Getting Started with SQL: A hands-on approach for beginners. O’Reilly, 2016.
- Byron, Angela and Addison Berry, Nathan Haug, Jeff Eaton, James Walker, Jeff Robbins Using Drupal: Choosing and Configuring Modules to Build Dynamic Websites. O'Reilly Media, 2008.
- Duckett, Jon HTML and CSS: Design and Build Websites New York: Wiley, 2011.
- Rice, Dylan. Twitter Bootstrap In Your Pocket. CreateSpace Independent Publishing Platform, 2016.
- Sklar, David Learning PHP 5 O’Reilly, 2004. GitHub Guides at https://guides.github.com, including: “Understanding the GitHub Flow”, “Hello World”, and “Getting Started with GitHub Pages”.
- Jacobson, Daniel APIs: A Strategy Guide O’Reilly: 2012.
- London, Kyle Developing Large Web Applications: Producing Code That Can Grow and Thrive O’Reilly, 2010.
Take-home assessment (50%) and problem sets (50%) in the MT.
Marking of these assessments will be at a level appropriate for PhD students.
Total students 2021/22: 6
Average class size 2021/22: 3
Value: Half Unit
Course selection videos
Some departments have produced short videos to introduce their courses. Please refer to the course selection videos index page for further information.
Personal development skills
- Team working
- Problem solving
- Application of information skills
- Application of numeracy skills
- Specialist skills