istockphoto-1285395672-612x612

Features

Our spotlight series

"When survey science met web tracking: presenting an error framework for metered data" - Oriol Bosch-Jover, fourth-year PhD candidate in Social Research Methods

Part of the problem that most metered data research faces now is that researchers are designing their data collection strategies in the dark. Researchers cannot recognise and report errors they encounter without a clear understanding of what those errors might look like.

For each half-term of the year, we showcase one piece of work from our faculty or research students.

In our first feature of 2023, Oriol Bosch-Jover talks us through his recent publication, When survey science met web tracking: presenting an error framework for metered data. The paper was published in the Journal of the Royal Statistical Society: Series A in November 2022.

 

First of all, could you tell us about the context in which your research sits?

Given the widespread adoption of the Internet, measuring what people consume and do online is crucial across almost all areas of social science research. Just think about all the debate around the prevalence and potential dangers of online misinformation. To measure online phenomena, researchers are increasingly relying on digital trace data under the assumption that the non-intrusive, objective, and granular nature of these traces will lead to higher quality statistics than the ones obtained through surveys.

Although this kind of data might certainly be of higher quality than surveys, with potentially transformative consequences for many fields, most research to date has uncritically embraced digital trace data without assessing its quality in a comparable way as with surveys. This is problematic since errors could distort the conclusions and policy decisions reached using this data. Hence, the standards need to be raised: we need to develop quality standards and best practices when using digital traces for social research, similarly as what it is done for surveys.

 

What is web tracking / metered data? What are its benefits and drawbacks?

Web tracking data / metered data (maybe we should agree on one umbrella term!) is one of the most common types of digital trace data used in the social sciences, which documents information about the URLs and apps that individuals visit online, and extra information such as the time they spend there or the HTML content that they see. This information is collected using digital tracking solutions. These solutions, called sometimes meters, are a heterogeneous group of tracking technologies that can be installed, upon agreement, by participants on their computers, smartphones, and tablets.

Web tracking data presents many benefits. To name a few, metered data is objective, meaning that we do not need to rely on individual’s remembering what they did online. It is very granular, which allows to collect more information than what would be possible with surveys (e.g., we cannot ask thousands of questions on a single questionnaire, every day). And it is collected in real-time, allowing to analyse live events and external shocks. However, not everything is perfect. Although little is known about the drawbacks of this kind of data, there are some things that are clear. On the one hand, installing a meter can be perceived as intrusive and even burdensome to some individuals. Potentially those willing to participate in such studies might be different than the general population, leading to biases. On the other hand, metered data depends on highly complex and often not designed for research technologies. The complexity and novelty of these technologies, combined with their dependency on what devices and operating systems allow them to track, make them prone to errors. For instance, iOS devices are considerably harder to track, with the few available approaches producing lower quality data than the one coming from Android devices.

 

At what stages can errors be introduced into a study? How can researchers recognise and report the decisions they make and errors they encounter?

At any stage! Every decision that a researcher must make when collecting web tracking data can introduce errors. Obviously, there are some decisions that are more complex than others, mainly because little evidence exists about their associated errors, or how to tackle them. For instance, the stages related to developing or choosing a technology, installing it in people’s devices, and then collecting data are key to any project but very little is known about how to do it properly.

Part of the problem that most metered data research faces now is that researchers are designing their data collection strategies in the dark. Researchers cannot recognise and report errors they encounter without a clear understanding of what those errors might look like. That’s precisely the gap we are filling with our Total Error framework for digital traces collected with Meters (TEM). By following our framework, researchers can 1) document the decisions made in each stage of the process, and 2) be aware of the errors that they might encounter when making those decisions. For instance, a key stage for any web tracking project is to make sure that participants are tracked in all the devices that they use to go online. If this is not achieved, researchers will miss part of what people do online, leading to potential biases. By being mindful about that, researchers can clearly define what devices need to be tracked and try to maximise their coverage. If this is not possible or is out of their control, they should collect auxiliary information to assess the proportion or participants affected by what we call tracking undercoverage (not being tracked in all the devices used to go online) and report it as it is done for nonresponse in surveys.

 

Could you tell us about the case study you presented as part of this paper?

An error framework is only useful as far as it can be applied in real life. Therefore, we teamed with a team of international scholars to use the TEM to design the Triangle of Polarisation, Political Trust, and Political Communication (TRI-POL) project (https://www.upf.edu/web/tri-pol). This international project, funded by the Spanish Ministry of Science and Innovation and the BBVA foundation, is the first attempt to collect both longitudinal survey and metered data, from the same individuals, to understand whether and how online behaviours are related to affective polarization across Southern European and Latin American countries. Specifically, the TRI-POL project consists of a three-wave survey conducted in Argentina, Chile, Italy, Portugal, and Spain, matched at the individual level with metered data. The TRI-POL is also the first project combining surveys and web tracking data to make all the data available open access, allowing any researcher to study this and other phenomena. Anyone can find and use the data, and the data protocols here: https://osf.io/3t7jz/. By using the TEM, hence, TRI-POL is the first ever project to be designed acknowledging the errors of metered data, with strategies in place to minimize, quantify and report those errors. This might hopefully help set the standards for future research using web tracking data.

 

And what were the key findings or outcomes of this piece of work?

The main finding of this paper is that collecting high-quality metered data is complex and involves a high degree of uncertainty. This paper shows that both theoretically and empirically web tracking data is indeed affected by bias inducing errors. It makes sense, it is a complex endeavour! This does not imply that metered data should not be used, or that previous research might be biased. It means, instead, that working with metered data requires a high degree of care and transparency, as well as best design, analysis, and report practices. What would you say about a survey that does not report information about the mode, or response rates? That is what has happened with almost all research published so far.

The TEM is in itself the main outcome. It is a tool that can be used not only to conduct better research with web tracking data but can also serve as the foundation for future empirical research.

 

Finally, what will you be working on next?

Now that we know that digital trace data can be affected by a plethora of errors, the next step is to quantify them!

Right now, I am focusing on developing approaches to estimate the data quality of digital trace data in a comparable way as for surveys, through the combination of traditional psychometric techniques and computational methods. For instance, I have developed a method that allows predicting the true-score reliability and validity of measures created with digital trace data based on the characteristics of those measures. This is possible thanks to the use of Structural Equation Modelling techniques and random forests of regression trees at the same time. Results so far are quite enlightening: (1) the reliability and validity of digital trace data measures, on average, is not that different from the one of surveys, and (2) design decisions do make a big difference in the resulting data quality.

Additionally, I am working on a paper that will showcase how to simultaneously estimate the measurement errors in survey and digital trace data, by accommodating the MultiTrait-MultiMethod approach to the specific characteristics of digital trace data (i.e., zero-inflation). This should allow to better understand the size of the measurement errors in digital trace data, and when it might be best to use each data source. 

 

 

 

 

 

The TEM is in itself the main outcome. It is a tool that can be used not only to conduct better research with web tracking data but can also serve as the foundation for future empirical research.