DS205      Half Unit
Advanced Data Manipulation

This information is for the 2025/26 session.

Course Convenor

Dr Jon Cardoso Silva

Availability

This course is available on the Erasmus Reciprocal Programme of Study and Exchange Programme for Students from University of California, Berkeley. This course is freely available as an outside option to students on other programmes where regulations permit. It does not require permission. This course is available with permission to General Course students.

Requisites

Pre-requisites:

Students must have completed DS105A or DS105W or ST101 or EC1B1 before taking this course.

Assumed prior knowledge:

Students will also be admitted who have already completed relevant training in programming or more introductory data manipulation concepts from those modules, for instance in advanced secondary school programmes or through summer school training such as Data Engineering for the Social World (ME204)

Additional requisites:

Students should have a firm grounding in programming, for example, by having completed Data for Data Science (DS105) OR Programming for Data Science (ST101). Students will also be admitted who have already completed training in programming or more introductory data manipulation concepts from those modules, for instance in advanced secondary school programmes or through summer school training such as Data Engineering for the Social World (ME204), OR Macroeconomics I (EC1B1).

When considering prerequisites for programming skills beyond the usual route (i.e., LSE courses mentioned above), we can accept students who demonstrate proficiency through a programming course certificate or coding portfolio. This should confirm their ability to independently write 'for' loops, 'while' loops, manipulate lists and dictionaries, write custom functions, and execute scripts.

Course content

What the course is about: This course provides a capstone-like experience where students master advanced data engineering techniques while collaborating with a real industry client. Students apply programming, web scraping, API development, and natural language processing skills to build data products that have practical impact. The course begins with technical foundations and culminates in group projects that address real-world challenges, bridging academic learning with professional practice.

The Intended Learning Outcomes of this course (what you can expect to learn) are:

  • Write code following programming best practices that aim to optimise computer memory and execution time
  • Efficiently collect data from diverse online sources and store it in databases
  • Write SQL queries to retrieve data from databases and perform basic data analysis
  • Master advanced data manipulation techniques, with a central focus on vectorisation for efficient data analysis
  • Create markdown reports and interactive data visualisations to communicate insights gained from data analysis to technical and non-technical audiences
  • Organise tasks of a group work using techniques from agile methodologies, such as project boards
  • Use AI tools for working with data, code or database queries in an effective manner
  • Master the use of Git commands for group work

The course begins with a review of vectorised programming concepts using pandas within the Python ecosystem. Students then learn to collect data through web scraping (using Scrapy and Selenium) and from public APIs. We cover database storage techniques, SQL programming fundamentals, and advanced methods for handling unstructured data using natural language processing and embedding models.

The culminating group project involves applying these skills to real-world data challenges in partnership with an industry client. In the 2024/25 academic year, students worked with the Transition Pathway Initiative (TPI), developing APIs for climate data and retrieval systems for policy documents, resulting in public GitHub repositories (tpi-apis and rag-fact-sheets). For 2025/26, we anticipate continuing our partnership with TPI, with specific projects to be defined in January 2026.

This client-focused structure provides students with professional experience in agile teamwork, client relationship management, and delivering impactful solutions. The inclusive curriculum incorporates diverse datasets and case studies, ensuring students engage with both technical challenges and ethical considerations in data representation across different contexts.

Notes

  • This is an advanced course in computing. You must already be very familiar with Markdown, Python programming (write your own modules and functions, code documentation, scripting vs Jupyter Notebook programming), the numpy and pandas libraries as well as the basics of Git and GitHub.
  • Students who enjoyed DS105's web scraping components will find more advanced treatment of these topics in DS205, along with more sophisticated data manipulation techniques and client-focused project experience.
  • This course complements DS202 (Introduction to Data Science). While DS202 focuses on machine learning algorithms for social scientists, DS205 focuses on the data engineering skills needed to prepare and manage data throughout the entire pipeline.

Teaching

15 hours of seminars and 20 hours of lectures in the Winter Term.

This course has a reading week in Week 6 of Winter Term.

Formative assessment

The course includes practical, skill-building formative exercises that prepare students for the summative assessments:

Technical Challenge Series: Students complete weekly coding exercises focusing on core data engineering skills - data visualisation, custom API development, and web scraping. These hands-on tasks use real climate data and professional development workflows through GitHub.

Project Preparation: During the final weeks of term, students receive guidance on group formation and project selection for the collaboration with our industry partner. Students discuss potential project tracks and form teams based on technical interests and complementary skills.

All formative work deliberately mirrors the structure and requirements of the summative assessments, giving students opportunity to practice with simpler versions of real-world data engineering challenges before tackling their client project.

Indicative reading

  • Lawson, R. (2023). Web Scraping with Python: Collecting More Data from the Modern Web (3rd ed.). O'Reilly Media.
    Provides comprehensive coverage of web scraping techniques using Scrapy and Selenium, with practical examples.
  • Chacon, S., & Straub, B. (2014). Pro Git (2nd ed.). Apress.
    The definitive guide to Git and GitHub for collaborative development, written by GitHub's co-founder. Freely available at https://git-scm.com/book/en/v2.
  • Voron, F. (2023). Building Data Science Applications with FastAPI (2nd ed.). Packt Publishing.
    Focuses specifically on integrating data science models with FastAPI, making it ideal for creating APIs that serve the type of data we will use in the course
  • Tunstall, L., von Werra, L., & Wolf, T. (2022). Natural Language Processing with Transformers (Revised ed.). O'Reilly Media.
    Written by Hugging Face team members, covers implementing transformer models for text analysis tasks.
  • Bouchard, L.F., & Peters, L. (2024). Building LLMs for Production. Towards AI, Inc.
    Comprehensive guide to developing production-ready LLM systems with extensive focus on RAG implementation for document analysis.
  • Rothman, D. (2024). RAG-Driven Generative AI: Build custom retrieval augmented generation pipelines with LlamaIndex, Deep Lake, and Pinecone. Packt Publishing.
    Cutting-edge resource on building custom RAG pipelines, with emphasis on traceable outputs.
  • Webersinke, N., Kraus, M., Bingler, J., & Leippold, M. (2021). ClimateBERT: A Pretrained Language Model for Climate-Related Text. arXiv:2110.12010.
    Specialized transformer model pre-trained on 1.6M paragraphs of climate-related texts, directly applicable to TPI projects. Code available at https://github.com/ClimateBert/language-model.

Assessment

Problem sets (60%)

Project (40%)

This course features problem sets (60%) that assess individual technical proficiency in data engineering and NLP, followed by a client-focused group project (40%). All assessments require professional GitHub workflows and emphasise both practical implementation skills and documentation practices. Detailed feedback is provided after each assessment with opportunities to earn additional marks by implementing suggested improvements.


Key facts

Department: Data Science Institute

Course Study Period: Winter Term

Unit value: Half unit

FHEQ Level: Level 5

CEFR Level: Null

Total students 2024/25: 23

Average class size 2024/25: 12

Capped 2024/25: No
Guidelines for interpreting course guide information

Course selection videos

Some departments have produced short videos to introduce their courses. Please refer to the course selection videos index page for further information.

Personal development skills

  • Leadership
  • Self-management
  • Team working
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Commercial awareness
  • Specialist skills