ST457      Half Unit
Graph Data Analytics and Representation Learning

This information is for the 2022/23 session.

Teacher responsible

Prof. Zoltan Szabo (COL.5.14)

Homepage: https://zoltansz.github.io/

Availability

This course is available on the MSc in Data Science, MSc in Geographic Data Science, MSc in Health Data Science, MSc in Operations Research & Analytics, MSc in Quantitative Methods for Risk Management, MSc in Statistics, MSc in Statistics (Financial Statistics), MSc in Statistics (Financial Statistics) (LSE and Fudan), MSc in Statistics (Financial Statistics) (Research), MSc in Statistics (Research), MSc in Statistics (Social Statistics) and MSc in Statistics (Social Statistics) (Research). This course is available with permission as an outside option to students on other programmes where regulations permit.

This course has a limited number of places (it is controlled access) and the demand can be high. This means that you might not be able to get a place on this course. MSc in Data Science students are given priority for enrollment in this course.

Pre-requisites

No particular course is required as pre-requisite. The course requires basic knowledge of linear algebra, calculus, probability, (un)supervised learning, and programming experience in Python (used throughout the classes). Familiarity with notions such as vector, matrix, matrix-vector multiplication, inner product and distance of vectors, transpose and inverse of a matrix, eigenvalue, eigenvector, derivative of a function, probability mass/density function, some formulation of regression, classification and clustering is beneficial.

Course content

Graphs are among the most widely-used data structures in machine learning. Their power comes from the flexibility of capturing relations (edges) of collections of entities (nodes) which arise in a variety of contexts including economic, communication, transportation, citation, social, neuron, computer, or particle networks, knowledge, scene or code graphs, molecules or 3D shapes. Graphs naturally generalize unstructured vectorial data and structured data such as time series, images or bags of entities. The goal of this course is to provide an overview of the fundamental computational methods leveraging this additional relational structure and leading to improved prediction. We will cover examples and techniques for node classification (which can be applied for example to determine whether a user is a bot, to classify the topic of papers, or to determine the function of proteins), link prediction (for instance to recommend content on online platforms, to complete knowledge graphs, or to predict drug side-effects), clustering and community detection (for example to determine collaborating communities in citation networks, or to reveal fraudulent groups of users in financial transaction networks), graph classification / regression / clustering (for instance to predict the toxicity or the solubility of molecules, or to detect malicious computer programs) and graph generation (e.g. for drug discovery or material design).

We will cover the following topics:

  1. types and representation of graphs, examples of prototype tasks tackled,
  2. basic graph statistics for node classification, neighborhood overlap statistics for link prediction, PageRank,
  3. spectral methods,
  4. traditional dimensionality reduction techniques,
  5. learning node embeddings, encoder-decoder framework, factorization-based methods, random walk embeddings,
  6. extension of node embeddings to multi-relational data, knowledge graphs,
  7. node embedding with graph neural networks (GNNs), message passing framework, extension to graph-level embedding,
  8. practical hints for GNNs, relation to approximate graph isomorphism tests,
  9. R-convolution framework, graph kernels based on bag of structures and information diffusion,
  10. generative graph models: Erdos-Renyi model, stochastic block model, preferential attachment model, generative adversarial network-based techniques.

Teaching

20 hours of lectures and 15 hours of classes in the LT.

This course will be delivered through a combination of lectures and classes totalling a minimum of 35 hours across Lent Term. This course includes a reading week in Week 6 of Lent Term.

Formative coursework

Students will be expected to produce 10 problem sets in the LT.

Indicative reading

  1. William L. Hamilton. Graph Representation Learning. Morgan and Claypool, 2020. [https://www.cs.mcgill.ca/~wlh/grl_book/]
  2. Karsten Borgwardt, Elisabetta Ghisu, Felipe Llinares-Lopez, Leslie O’Bray, and Bastian Rieck. Graph kernels: State-of-the-art and future challenges. Foundations and Trends in Machine Learning, 13(5-6):531-712, 2020. [https://www.nowpublishers.com/article/Details/MAL-076, https://ieeexplore.ieee.org/document/9307216, https://arxiv.org/abs/2011.03854]
  3. Mark Newman. Networks. Oxford University Press, 2018. [https://global.oup.com/academic/product/networks-9780198805090?cc=gb&lang=en&]

Assessment

Project (80%) in the LT.
Continuous assessment (10%) in the LT Week 4.
Continuous assessment (10%) in the LT Week 9.

Two problems sets submitted by students will be assessed (20% in total). In addition, there will be a graded take-home research project (80%) which will be completed by the students individually, in which they will demonstrate the ability to apply and train an appropriate model to a specific problem and dataset using principles they have learnt in the course. 

Key facts

Department: Statistics

Total students 2021/22: Unavailable

Average class size 2021/22: Unavailable

Controlled access 2021/22: No

Value: Half Unit

Guidelines for interpreting course guide information

Course selection videos

Some departments have produced short videos to introduce their courses. Please refer to the course selection videos index page for further information.

Personal development skills

  • Self-management
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills