Skip to main content

ME204: Data Engineering for the Social World

Subject Area: Research Methods, Data Science, and Mathematics

Apply now

Course details

  • Department
    Data Science Institute
  • Application code
    SS-ME204
Dates
Session oneNot running in 2024
Session twoOpen - 8 Jul 2024 - 26 Jul 2024
Session threeNot running in 2024

Apply

Applications are open

We are accepting applications. Apply early to avoid disappointment.

Overview

Data science has opened up a world of exciting and endless possibilities in the social sciences. Recent innovative tools, ranging from big data and data visualisation to machine learning models, are increasingly used by social scientists to make sense of their data.

However, and perhaps surprisingly to many, the day-to-day of a data science project involves very little machine learning. It is often said, anecdotally, that data scientists spend 80% of their time cleaning and pre-processing data and only 20% building or deploying machine learning models. Therefore, whether you want to pursue a career in data science or to experience the data science way of doing things, it is crucial that you first learn how to handle data.

A person proficient at collecting, storing, and adequately pre-processing data is more likely to extract interesting insights from their data even before applying complex algorithms to a data set. This process is part of a data science and analytics subfield called data engineering.

This course will teach you to reason about data and how to collect real data from websites, APIs or other sources. It will also teach you the best practices for efficient data storage, the basics of SQL language, and the tools available in R to pre-process and reshape data. You will learn to put data in a "tidy" format, allowing you to re-purpose it for future analysis, be it for exploratory data analysis, visualisation or machine learning. You will also be free to choose the data sources that align the most with your interests.

By the end of this course, you will be able to produce a visual dashboard to display your collected data and showcase your newly acquired data-wrangling abilities.

Key information

Prerequisites: Students should already be familiar with computer programming at an introductory level (variables, if-else, loops, functions). If you are not using R, we strongly encourage you to familiarise yourself before the start of the course. Suggestions: R for Data Science book, chapters 1-8.

Level: 200 level. Read more information on levels in our FAQs

Fees: Please see Fees and payments

Lectures: 36 hours

Classes: 18 hours

Assessment: A mid-term problem set (25%) and a final project (75%). 

Typical credit: 3-4 credits (US) 7.5 ECTS points (EU)

Please note: Assessment is optional but may be required for credit by your home institution. Your home institution will be able to advise how you can meet their credit requirements. For more information on exams and credit, read Teaching and assessment

Is this course right for you?

This course is ideal for those seeking a hands-on experience with a data science project, whether you want to pursue a career in data science or to experience the data science way of doing things. It is also recommended if you want to strengthen your programming skills. This course will also be relevant if you are starting an MSc or MBA programme of study and wish to learn introductory concepts in the area.

Outcomes

Aims of this course:

Develop the skills to collect public data from the Web or from APIs, connect multiple data sources and build dashboards to communicate insights obtained from data.

Learning Objectives:

  • Understand the basic elements of computer programming, using the language R
  • Understand the basic structure of data types and common data formats
  • Show familiarity with international standards for common file formats
  • Manage a typical data acquisition, cleaning, structuring, and analysis workflow using practical examples
  • Understand the concept and fundamentals of databases
  • Recognise the difference between “messy data” and “tidy data”
  • Clean data and diagnose common problems involved in data corruption and how to fix them, using the R package tidyverse
  • Match and combine data from multiple sources using Structured Query Language (SQL)
  • Markup Language (XML) and the Markdown format for formatting documents and web pages
  • Create and maintain simple websites using HTML and CSS
  • Use APIs to send and retrieve data from Internet sources
  • Use regular expression (regex) to parse and filter text data
  • Use the collaboration and version control system GitHub, based on the git version control system
  • Implement a standard structure of files and directories suitable for data collection workflow
  • Build a real-time Shiny dashboard and publish it on Github

Content

Prachin Patel, India

I enjoyed that the course was practical. All of the theory we learned in lectures was then applied in classes, and the reinforcement of the ideas really helped me to learn.

Faculty

The design of this course is guided by LSE faculty, as well as industry experts, who will share their experience and in-depth knowledge with you throughout the course.

Jonathan Cardoso-Silva

Dr Jonathan Cardoso-Silva

Assistant Professor (Education)

Department

The Data Science Institute (DSI) forms the institutional cornerstone of data science activity at the London School of Economics and Political Science. Working alongside the academic departments across the School, the DSI's mission is to foster the study of data science and new forms of data with a focus on their social, economic, and political aspects.

The DSI aims to host, facilitate and promote research in social and economic data science through an annual programme of seminars, workshops and research projects delivered by a range of academic experts and research students.

Apply

Applications are open

We are accepting applications. Apply early to avoid disappointment.