DS205      Half Unit
Advanced Data Manipulation

This information is for the 2024/25 session.

Teacher responsible

Dr Jonathan Cardoso Silva COL 1.03

Availability

This course is available on the BSc in Data Science and BSc in Politics and Data Science. This course is available as an outside option to students on other programmes where regulations permit and to General Course students.

Pre-requisites

Students should have a firm grounding in programming, for example, by having completed Data for Data Science (DS105) OR Programming for Data Science (ST101). Students will also be admitted who have already completed training in programming or more introductory data manipulation concepts from those modules, for instance in advanced secondary school programmes or through summer school training such as Data Engineering for the Social World (ME204), OR Macroeconomics I (EC1B1).

When considering prerequisites for programming skills beyond the usual route (i.e., LSE courses mentioned above), we can accept students who demonstrate proficiency through a programming course certificate or coding portfolio. This should confirm their ability to independently write `for` loops, `while` loops, manipulate lists and dictionaries, write custom functions, and execute scripts.

Course content

The primary objective of this course is to equip students with the skills to collect and manage 'real data' in a computationally efficient manner. The course emphasises practical learning, with a focus on live coding demonstrations during all lectures and seminars.

Initially, the course will begin with a review of vectorised programming concepts, using the pandas library under the Python programming ecosystem. Students will then learn how to independently collect data from websites and publicly available APIs, with structured problem sets provided for practice over several weeks. Additionally, students will be introduced to the best practices of integrating generative AI tools like ChatGPT and GitHub Copilot to aid in writing code and resolving errors, with code maintainability in mind. The curriculum will then delve into the appropriate techniques for storing data in databases, along with fundamentals of SQL programming skills to efficiently manipulate database queries. Integration of SQL with vectorised programming libraries will also be covered.

In the later part of the course, we will also address advanced data manipulation techniques for unstructured data such as text, using natural language processing methods including using APIs to process text using AI tools and large language models for common tasks such as sentiment analysis or text classification. We will demonstrate applications of such techniques to social media analysis.

In the end, students will practice collaborating in a group project under agile methods through GitHub. Groups will create data manipulation workflows, from data collection to interactive visualisations, with the freedom to choose from publicly available data sources.

Students keen on delving deeper into data science may find DS205 to be a valuable complement to DS202. Although both courses emphasise practical skills and programming, DS202 focuses on introductory machine learning algorithms that are useful to social scientists, while DS205 focuses on essential skills required for preparing data before using any algorithms. It's worth noting that these courses do not follow a linear prerequisite path and can even be taken concurrently.

Teaching

20 hours of lectures and 15 hours of seminars in the WT.

This course is delivered through a combination of classes and lectures totalling a minimum of 30 hours across Winter Term. This course has a Reading Week in Week 6 of WT.

Formative coursework

Students will be expected to produce 2 problem sets and 1 presentation in the WT.

In addition to weekly coding exercises during seminars, students will complete two structured problem sets independently, due in Week 03 and Week 06. These problem sets serve as preparation for the individual summative problem set due in Week 09.

Group projects start in the Week 10 seminar, with instructors assisting students in forming groups and selecting suitable data sources. In the Week 11 seminar, students present their project plans for feedback, leading up to the final project development and submission in the Spring Term.

Indicative reading

  • Mitchell, Ryan E. Web Scraping with Python: Collecting More Data from the Modern Web. 2nd edition. Sebastopol [CA]: O'Reilly Media, 2018.

     
  • Abiteboul, Serge, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, and Pierre Senellart. Web Data Management. 1st edition. Cambridge University Press, 2011.
  • McKinney, Wes. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and Jupyter. 3rd edition. Beijing Boston Farnham Sebastopol Tokyo: O'Reilly, 2022. Made freely available online by the author: https://wesmckinney.com/book/.
  • VanderPlas, Jake. Python Data Science Handbook: Essential Tools for Working with Data. 2nd edition. Beijing Boston Farnham Sebastopol Tokyo: O'Reilly, 2023. Made freely available online by the author: https://jakevdp.github.io/PythonDataScienceHandbook/.
  • Duckett, Jon. HTML & CSS: Design and Build Websites. Indianapolis, Indiana: John Wiley & Sons Inc, 2014.
  • Jacobson, Daniel, Dan Woods, and Gregory Brail. APIs: A Strategy Guide. Sebastopol [CA]: O'Reilly, 2012.
  • Domdouzis, Konstantinos, Peter Lake, and Paul Crowther. Concise Guide to Databases: A Practical Introduction. 2nd edition. Undergraduate Topics in Computer Science. Cham: Springer, 2021. 
  • Chacon, Scott. Pro Git. 2nd edition. New York, NY: Apress, 2014.
  • Zafarani, Reza, Mohammad Ali Abbasi, and Huan Liu. Social Media Mining: An Introduction. 1st edition. Cambridge University Press, 2014.
  • Grimmer, Justin, Margaret E. Roberts, and Brandon M. Stewart. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton: Princeton University Press, 2022.

Assessment

Problem sets (60%) in the WT.
Group project (40%) in the period between WT and ST.

Key facts

Department: Data Science Institute

Total students 2023/24: Unavailable

Average class size 2023/24: Unavailable

Capped 2023/24: No

Value: Half Unit

Guidelines for interpreting course guide information

Course selection videos

Some departments have produced short videos to introduce their courses. Please refer to the course selection videos index page for further information.

Personal development skills

  • Self-management
  • Team working
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills