National Statistics Disclosure Control: Now and the future
Tuesday 11 January 2005
Where available, presentations can be accessed by clicking on the link next to the programme item. All presentations are copyright the authors.
A British Society for Population Studies day meeting, held in the Graham Wallas Room at the London School of Economics, Houghton Street, London WC2A 2AE on 11 January 2005.
10.00am Registration and coffee
10.30am Introduction by the Chair. John Hollis, Greater London Authority
10.45am Where are we now? Keith Spicer, Office for National Statistics, with a Scottish view from Frank Thomas, GROS. Spicer Presentation. Thomas presentation.
11.30am Comfort break
11.35am Len Cook: hero or zero of the 2001 Census? A look at the impact of disclosure control on aggregate census outputs. Paul Williamson, University of Liverpool
12.20pm Impact of Disclosure Control on Labour Market Statistics. Jill Tuffnell, Cambridgeshire County Council. Presentation
13.00pm Lunch break
14.00pm The effects of small cell adjustment on origin-destination data in the 2001 census. Oliver Duke Williams, University of Leeds, with additional material from Eileen Howes, Greater London Authority. Presentation
14.45pm Statistical Disclosure Control Methods for Census Outputs. Natalie Shlomo, Office for National Statistics. Presentation
15.30pm Panel discussion All speakers. Chaired by Angela Dale, CCSR, University of Manchester, who will also comment on the progress of household SARs.
Report of the meeting
This meeting was organised by Oliver Duke-Williams and John Hollis to bring users of Census and other data together with representatives from the national statistical agencies. The meeting was very well attended, with all places reserved some time before the date of the meeting.
The meeting commenced with an introduction and welcome from John Hollis of the Greater London Authority.
The first presentation was from Keith Spicer of the Office for National Statistics (ONS), who set the context for the session by discussing ONS' current policies, and the confidentiality policy and legislative framework within which they operate. Keith described two facets of disclosure: identity disclosure, in which an individual's identity is revealed, and attribute disclosure in which additional characteristics relating to a known individual are revealed. Keith made the argument that the nature of disclosure risk had changed from earlier Censuses, due to the policy decision to make results easily available via the Web, and the widespread electronic storage of many data sets. For the release of data from the 2001 Census, ONS used a combination of strategies; pre-tabulation measures including record swapping, the post tabulation measure of small cell adjustment (SCA), and thresholds on area size. It was agreed that the procedure followed with the 2001 Census had not been perfect, especially with changes to the SDC methodology being introduced at a late stage. Advice was given that users should attempt to calculate desired values my adding together the minimum number of components; this might also include subtraction of cells from a larger initial value.
The release of microdata such as the Samples of Anonymised Records (SARs) brings separate concerns. Keith described the process of Post-Randomisation (aka PRAMming) that has been applied to the individual SAR, in which some 'risky' variables have been perturbed on an undisclosed percentage of records. A description was also given in the proposals for in-house access to a richer SAR data set, at sites in London and Titchfield.
Keith's talk was complemented by a short presentation by Frank Thomas of the General Register Office, Scotland. The approach taken in Scotland differed in a number of key areas with that taken elsewhere in the UK: the target size and threshold size for small areas was lower in Scotland, and (with the exception of some workplace tables) small cell adjustment was not used. Frank argued that record swapping was ineffective at preventing disclosure of population uniques, and suggested that over-imputation should be used for the results of the 2011 Census.
Paul Williamson from the University of Liverpool gave an interesting presentation in which he applied various SDC methods to sets of base data, and compared the numbers of cells modified, and the effects on various types of analysis including generation of percentages, rank correlation and bi-variate and multi-variate regressions. As well as a SCA, rounding to base 3 and to base 5, and two types of 'Barnardisation', the method that was used with the small area results of the 1991 Census. Paul's results showed that for many purposes, rounding the results of the 2001 Census to base 3 (with table marginals being independently rounded) would have resulted in less information loss than the SCA approach actually used. In all cases, Barnardisation would have been preferable from an analytical point of view.
Paul concluded that for rate-based analysis, the decision to use SCA as opposed to rounding all values to base 3 was 'right', but for count-based analysis, the decision was 'wrong'. However, this assessment begs the question of whether or not additional modification to the data was justified: many Census users believe that under-enumeration, coupled with the edit add imputation process already create sufficient noise in the data. More research was required, it was argued, on issues of public perception of the collection and analysis of data.
Jill Tuffnell of Cambridgeshire County Council talked about labour market statistics, and highlighted problems caused by the use of disclosure control on industrial sector data. Concerns that publication of data by detailed industrial classifications may be disclosive to businesses have meant that there are no published tables containing industry data at ward level. This has resulted in local authorities being unable to generate the statistics that they require for planning purposes. The problem is particularly acute when data are required for spatial areas that do not match up with district boundaries (but rather must be aggregated from ward or output area level geographies).
Census data are not the only data produced by ONS that are subject to disclosure control. Jill also described the effect of disclosure control on claimant unemployment data. These are characterised by small numbers in many areas, rendering analysis by population sub-groups (such as age or gender specific groups) useless for many wards.
Oliver Duke-Williams, from the University of Leeds, described the effects of small cell adjustment on one group of data sets released from the 2001 Census: the interaction or origin-destination data. These data sets include the Special Migration Statistics and Special Workplace Statistics; the data themselves are characterised by the extreme dominance of small values. As a result, these data sets are the ones in which the effects of SCA are most noticeable. Oliver showed a number of graphs showing the distribution of values in these data sets. The effects of SCA remain apparent even when data for several small areas are aggregated together. One approach for overcoming the apparent problems of SCA is to generate mean values from several independently adjusted cells that all purport to show the same value. The resulting distributions of values fit expected distributions closely, and their research worth is now being investigated.
Oliver's presentation was complemented by a presentation from Eileen Howes of the Greater London Authority. Eileen concentrated on results for London drawn from the Special Workplace Statistics. Comparisons were made between data aggregated directly from Output Areas, and data drawn from the separately prepared Theme Tables. Although the results should be the same, the SCA process has led to some significant and un-predictable discrepancies. Eileen questioned the fitness-for-purpose of the output area level data, and commented that she would advise users of these data to exercise considerable caution in interpreting the results.
The final main presentation of the day was given by Natalie Shlomo, of ONS, who talked about current methodological research being done with a focus on the 2011 Census. Research was being carried out comparing alternative approaches within a risk-utility framework that takes into account both risks of disclosure and the needs of the user community (with utility being assessed via information-loss measures.) Natalie introduced a series of Risk-Information loss graphs, which showed the amounts of information loss for a variety of SDC methods applied with different parameters, and allowed direct comparison between methods.
Natalie argued that the optimum strategy would be a mixture of different methods (pre- and post-tabular methods, as well as perturbative and non-perturbative), and commented that ONS would be soliciting feedback on proposed methodologies.
The meeting concluded with a general question-and-answer session for all the presenters. This was led by Angela Dale of the University of Manchester, who first described the latest developments in the SARs. As outlined by the first speaker, Keith Spicer, proposals have been circulated for the establishment of a facility for SAR users to access a richer sample of the individual sample under controlled conditions at ONS sites in London and Titchfield. A daily charge would apply to researchers using this facility. Angela described the differences between the samples, and also the likely specifications of a controlled access version of the household SAR.