This course aims to provide an introduction to the data science approach to the quantitative analysis of data using the methods of statistical learning, an approach blending classical statistical methods with recent advances in computational and machine learning. We will cover the main analytical methods from this field with hands-on applications using example datasets, so that students gain experience with and confidence in using the methods we cover. We also cover data preparation and processing, key-value formatted data (JSON), and unstructured textual data. At the end of this course students will have a sound understanding of the field of data science, the ability to analyse data using some of its main methods, and a solid foundation for more advanced or more specialised study.
The course will be delivered as a series of morning lectures, followed by lab sessions in the afternoon where students will apply the lessons in a series of instructor-guided exercises using data provided as part of the exercises.
The course will cover the following topics:
- an overview of data science and the challenge of working with big data using statistical methods
- how to integrate the insights from data analytics into knowledge generation and decision-making
- how to acquire data, both structured and unstructured, and to process it, store it, and convert it into a format suitable for analysis
- the basics of statistical inference including probability and probability distributions, modelling, experimental design
- an overview of classification methods and related methods for assessing model fit and cross-validating predictive models
- supervised learning approaches, including linear and logistic regression, decision trees, and naïve Bayes
- unsupervised learning approaches, including clustering, association rules, and principal components analysis
- quantitative methods of text analysis, including mining social media and other online resources
- social network analysis, covering the basics of social graph data and analysing social networks
- data visualisation through a variety of graphs.
Main texts
James et al. (2013) An Introduction to Statistical Leaning: With applications in R . Springer.
Zumel, N. and Mount, J. (2014). Practical Data Science with R. Manning Publications.
The following are supplemental texts which you may also find useful:
Lantz, B. (2013). Machine Learning with R. Packt Publishing.
Conway, D. and White, J. (2012) Machine Learning for Hackers . O'Reilly Media.
Leskovec, J., Rajaraman, A. and Ullman, J. (2011). Mining of Massive Datasets . Cambridge University Press.
Zafarani, R., Abbasi, M. A. and Liu, H. (2014) Social Media Mining: An introduction . Cambridge University Press.
Software used
R.
Schedule
Please note: A full timetable will be provided at registration on Monday 14 August. The below schedule is subject to change.
Week one (hours)
|
Morning lecture
|
Afternoon class
|
Mon
|
3 hours
|
1.5 hours
|
Tues
|
3 hours
|
1.5 hours
|
Weds
|
3 hours
|
1.5 hours
|
Thurs
|
3 hours
|
1.5 hours
|
Fri
|
3 hours
|
1.5 hours
|
Week two (hours)
|
Morning lecture
|
Afternoon class
|
Mon
|
3 hours
|
1.5 hours
|
Tues
|
3 hours
|
1.5 hours
|
Weds
|
3 hours
|
1.5 hours
|
Thurs
|
3 hours
|
1.5 hours
|
Fri
|
3 hours
|
Exam
|