DS105A      Half Unit
Data for Data Science

This information is for the 2023/24 session.

Teacher responsible

Dr Jonathan Cardoso-Silva COL.1.03

Availability

This course is available on the BSc in Politics and Data Science. This course is available as an outside option to students on other programmes where regulations permit and to General Course students.

While this course is not capped and, in principle, any student who requests a place is likely to be given one, restrictions might need to be imposed if the demand is too high.

Material from the previous year can be found on the course's dedicated public webpage: https://lse-dsi.github.io/DS105/.

Course content

The main goal of this course is to teach students how to collect and handle 'real data' in a hands-on manner. The first few weeks of the course will cover theoretical concepts through traditional lectures with slides, but then the format will shift to a more practical approach. Live coding demonstrations will be used to guide students through the material, which can be followed in real-time on their laptops. Python will be the primary programming language used in staff-led lectures and classes, but students are also permitted to use R for their assignments if they prefer.

An important note on programming: While programming is not strictly required for this course, basic programming knowledge, preferably in Python or R, is highly recommended. Students should be comfortable creating and updating variables, creating simple functions, and using flow control expressions like if-else statements, for and while loops, etc. Those who are new to coding may find the course challenging, and we encourage them to consider the Winter iteration of the course, DS105W. This will provide additional time to improve their programming skills. We recommend that students with limited programming experience explore courses such as ST101, Digital Skills Lab workshops or the pre-sessional courses listed on the DS105 Moodle page.

In terms of content, the learning objectives of this course are to:

  • Understand the basic structure of data types and common data formats.
  • Show familiarity with international standards for common data types.
  • Manage a typical data acquisition, cleaning, structuring, and analysis workflow using practical examples.
  • Clean data and diagnose common problems involved in data corruption and how to fix them.
  • Understand the concept and fundamentals of databases.
  • Link data from various sources.
  • Use the collaboration and version control system GitHub, based on the git version control system.
  • Markup Language and the Markdown format for formatting documents and web pages.
  • Create and maintain simple websites using HTML and CSS.
  • Use APIs to send and retrieve data from Internet sources.

Note that this course introduces basic concepts of databases, but it is not a hands-on course on their use. If you are interested in learning more about databases, we recommend that you consider taking ST207 Databases, which covers the topic in more detail.

Teaching

16 hours and 40 minutes of lectures and 13 hours and 30 minutes of classes in the AT.

Reading Week in Week 6.

Formative coursework

In the initial sessions, students will work on weekly and structured problem sets in the staff-led class sessions. Examples of exercises involve navigating the terminal on the computer, accessing computer servers, writing code to scrape and pre-process data from the Web, and setting up GitHub.

Later on, students will be expected to work on their group projects in the staff-led class sessions.

Indicative reading

  • Mayer-Schönberger, Viktor, and Kenneth Cukier. Big Data: A Revolution That Will Transform How We Live, Work and Think. 1st edition. London: Murray, 2013.
  • Kitchin, Rob. The Data Revolution: Big Data, Open Data, Data Infrastructures & Their Consequences. Los Angeles, California: SAGE Publications, 2014.
  • Newham, Cameron, and Bill Rosenblatt. Learning the Bash Shell. 3rd edition. Beijing [China]; Sebastopol [CA]. O'Reilly, 2005.
  • Shotts, William E. The Linux Command Line: A Complete Introduction. 2nd edition. San Francisco: No Starch Press, 2019. Made freely available online by the author: https://www.linuxcommand.org/tlcl.php
  • Sweigart, Al. Automate the Boring Stuff with Python: Practical Programming for Total Beginners. 2nd edition. San Francisco: No Starch Press, 2020.
  • Ramalho, Luciano. Fluent Python: Clear, Concise, and Effective Programming. Second edition. Beijing [China]; Boston [MA]: O'Reilly, 2022.
  • Mitchell, Ryan E. Web Scraping with Python: Collecting More Data from the Modern Web. 2nd edition. Sebastopol [CA]: O'Reilly Media, 2018.
  • Abiteboul, Serge, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, and Pierre Senellart. Web Data Management. 1st edition. Cambridge University Press, 2011.
  • McKinney, Wes. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and Jupyter. 3rd edition. Beijing Boston Farnham Sebastopol Tokyo: O'Reilly, 2022. Made freely available online by the author: https://wesmckinney.com/book/.
  • VanderPlas, Jake. Python Data Science Handbook: Essential Tools for Working with Data. 2nd edition. Bejing Boston Farnham Sebastopol Tokyo: O'Reilly, 2023. Made freely available online by the author: https://jakevdp.github.io/PythonDataScienceHandbook/.
  • Duckett, Jon. HTML & CSS: Design and Build Websites. Indianapolis, Indiana: John Wiley & Sons Inc, 2014.
  • Jacobson, Daniel, Dan Woods, and Gregory Brail. APIs: A Strategy Guide. Sebastopol [CA]: O'Reilly, 2012.
  • Domdouzis, Konstantinos, Peter Lake, and Paul Crowther. Concise Guide to Databases: A Practical Introduction. 2nd edition. Undergraduate Topics in Computer Science. Cham: Springer, 2021.
  • Chacon, Scott. Pro Git. 2nd edition. New York, NY: Apress, 2014.
  • Zafarani, Reza, Mohammad Ali Abbasi, and Huan Liu. Social Media Mining: An Introduction. 1st edition. Cambridge University Press, 2014. 
  • Grimmer, Justin, Margaret E. Roberts, and Brandon M. Stewart. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton: Princeton University Press, 2022.

Assessment

Coursework (60%, 1000 words) and group project (40%) in the WT.

The coursework will involve three individual programming assignments.

The group project will include progress presentations and a final public report. The final part of the group project will involve creating a data-based public website that will be published on GitHub. The deadline for the final project will be set for the early Winter Term.

Key facts

Department: Data Science Institute

Total students 2022/23: Unavailable

Average class size 2022/23: Unavailable

Capped 2022/23: No

Value: Half Unit

Guidelines for interpreting course guide information

Course selection videos

Some departments have produced short videos to introduce their courses. Please refer to the course selection videos index page for further information.

Personal development skills

  • Self-management
  • Team working
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills