Using Reproducible Analytical Pipelines (RAP) to improve statistics

This is case study for Principle V4: Innovation and improvement 

In 2021, OSR published its review on Reproducible Analytical Pipelines: Overcoming barriers to adoption. The Reproducible Analytical Pipeline, also referred to as RAP, is a set of principles and good practices for data analysis and presentation.  

RAP was developed by statistics producers in the Department for Culture, Media and Sport and the Government Digital Service in 2017 as a solution to overcome several problems: in particular, time-consuming and error-prone manual processes, and an overreliance on spreadsheets and proprietary software for data storage, analysis and presentation. RAP combines modern statistical tools with software development good practice to carry out all the steps of statistical production, from input data to the final output, in a high quality, sustainable and transparent way.  

A minimum standard of RAP was developed by the Best Practice and Impact Team (now the Analysis Standards and Pipelines (ASAP) team) which are:  

  • Peer review to ensure the process is reproducible and identify improvements 
  • No or minimal manual interference, for example copy-paste, point-click or drag-drop steps – instead the process should be carried out using computer code which can be inspected by others 
  • Open-source programming languages, such as R or Python, for coding so that processes do not rely on proprietary software licenses and can be reproduced by statistics producers and users 
  • Version control software, such as Git, to guarantee an audit trail of changes made to code 
  • Publication of code, whenever possible, on code hosting platforms such as GitHub to improve transparency 
  • Well-commented code and embedded documentation to ensure the process can be understood and used by others 
  • Embedding of existing quality assurance practices in code, following guidance set by organisations and the GSS 

These fundamental principles that form the basis for the minimum standard can be further enhanced – for example by writing code in modular functions that allow for reuse, or introducing unit tests to ensure that code works as expected. It is also important to note that adopting RAP principles is not necessarily about incorporating all of the above – implementing just some of these principles will generate valuable improvements. 

RAP benefits – enabling innovation and improvement in official statistics – the ONS Centre for Crime and Justice (CCJ) 

The Nature of Crime data tables produced by the Centre for Crime and Justice (CCJ) at ONS previously relied heavily on Excel and SPSS. To reduce manual effort, save time and improve reproducibility, the CCJ replaced the existing process with R and python code and introduced Git for version control.  

Implementing RAP principles resulted in a significant reduction in the time taken to produce the statistics: what was originally three weeks’ worth of work for thirteen analysts was reduced to under an hours’ work for one. The CCJ were also able to create new analysis more quickly (as an example, it took an hour to add nine new tables to the python pipeline).  

With the time saved, the CCJ focused on providing more value for users – publishing historic time series, adding more measures and granularity to the tables, and developing its survey processes to provide new crime estimates about COVID-19. The team adapted the code for this project in order to automate the production of other statistics, such as those on violent crime. Overall, implementing RAP allowed the CCJ to continue to meet its existing output commitments whilst freeing up resources to focus on meeting user needs.  

The code for the crime tables is available on GitHub and the team has blogged about its RAP transformation.   

Planning how to implement RAP principles – Statistics producers should be empowered to develop RAPs themselves  

The process to achieve the above results involved demonstrating the efficiency and quality improvements to senior leaders at the CCJ who then established a team to deliver further RAP developments. With agreement from their line managers, the members of staff who were interested dedicated two days a week to this team. Support from the Deputy Director and other senior leaders was essential in protecting this time commitment and prioritising development work among competing priorities. This level of senior support also meant that analysts felt more able to get involved in the project in the first place. 

To support the development work, the Good Practice Team (GPT), now ASAP provided mentoring and training. This helped to embed RAP knowledge and skills within CCJ. Despite some initial apprehension about implementing RAP, the team members became confident in the new skills they developed and felt proud of their work and have now gone on to create and share their own crime_analysis package. The CCJ applied this approach to offering mentoring internally without the support of GPT and continued to focus on skills development across the division. To illustrate this, CCJ are now using a pair-programming technique to quality assure code and have created a bespoke RAP learning pathway specific to the data and table production processes for the team. 

This example shows how producers can enable innovation and improvement in official statistics when they are empowered to develop RAP in their areas. With commitment and support from senior managers to implement RAP principles, the team have been able to continue to meet its existing output commitments, while using its newly freed up resources to focus on meeting new user needs.

Improving the clarity, comparability, and transparency of UK homelessness statistics

This is a case study for Principle V3: Clarity and Insight.

Homelessness is a devolved matter in the UK, with different legislative and policy requirements, so homelessness statistics are produced by each UK country separately and are drawn from administrative data systems.  As such, there are significant differences between the official or government statistics in each country and information about comparability is generally limited. Nevertheless, users are interested in comparing these statistics, specifically comparing regions and cities across the UK or understanding the UK picture of homelessness.

The following quote from the Scottish Government’s Homelessness and Rough Sleeping Action Plan, published November 2018, helps to highlight why enhanced clarity and insight is needed from the statistics produced on this topic:

“Everyone needs a safe, warm place they can call home. Home is more than a physical place to live. It’s where we feel secure, have roots and a sense of belonging. Home supports our physical and emotional health and wellbeing and to be without one seems unthinkable. Yet for too many people this is their reality as they face the blight of homelessness.”

Harmonisation, the process of increasing comparability and coherence of statistics, is an important enabler of cross UK comparisons, and can help avoid unnecessary confusion and erroneous comparisons. As stated in the Office for Statistics Regulation systemic review on Housing and Planning Statistics, published in November 2017, ‘transparent information about statistical definitions and methods, together with judgements about strengths and limitations, is essential in supporting users’ confidence in statistics’. Homelessness can be highly politicised and attract wide user attention. Therefore, it is important that users understand what is being measured, the extent to which it is comparable with related statistics, and the limitations of the statistics.

The GSS Strategy Delivery Team and GSS Harmonisation Team undertook a collaborative piece of work to address these issues. A cross-GSS Homelessness and Rough Sleeping Statistics Group was created to strengthen links across departments and the devolved administrations, and encourage collaboration.

In February 2019, after extensive stakeholder engagement with government departments, the devolved administrations, academics and third sector organisations, the GSS Harmonisation Team produced a report investigating the feasibility of harmonising UK definitions of homelessness. This report identified the different definitions of homelessness in use across the UK and assessed what can be done to improve the clarity, comparability and transparency of homelessness statistics.

This research concluded that although a general definition for homelessness could be created, developing a harmonised definition that government departments and the devolved administrations could incorporate into their statistics is challenging. The feasibility report was the first step in helping to provide transparency about the comparability of official homelessness statistics across the UK.

In September 2019, following the recommendations of the feasibility report produced earlier in the year, the GSS Harmonisation Team published an interactive tool for UK homelessness statistics to explain the comparability of homelessness statistics in a user-friendly format. This involved collaboration with statistics producers in the four UK countries and a wide array of stakeholders. In addition to this, as part of the wider work on coherence of Housing and Planning statistics, the GSS Strategy Delivery Team published an article bringing together existing homelessness data sources from across the UK to assess comparability, coherence and data limitations and to begin to identify patterns and trends for the UK as a whole. The  article utilised the interactive tool as a framework for making homelessness comparisons and identifying UK trends. The GSS Harmonisation Team are also developing guidance on comparability for the statistical publications, which will help users to better understand the processes and legislation behind the different statistics, and where comparisons can and can’t be made.

This example shows how cross-ONS collaboration between the GSS Harmonisation Team and GSS Strategy Delivery Team is enhancing the clarity and insight provided by UK homelessness statistics. By engaging with the various producers to understand and document the extent of comparability and consistency of the different sources, and developing a UK wide perspective, their work is supporting users in the appropriate interpretation and use of the various homelessness statistics produced across the UK and helping to provide clarity in an area of significant public and policy concern.

Developing harmonised national indicators of loneliness

This is a case study for Principle Q2: Sound methods.

In 2018, in response to the manifesto published by the Jo Cox Commission on Loneliness, The Prime Minister called lonelinessone of the greatest public health challenges of our time”. As such, a consistent approach is needed to measuring how loneliness affects people’s lives and who is more susceptible to it. The Prime Minister tasked the Office for National Statistics (ONS) with developing the evidence base and to develop national indicators of loneliness, suitable for use on major studies, to inform future policy in England.

The harmonisation of the new loneliness indicators was important for enabling more surveys to measure loneliness in the same way, in order to build a better evidence base more quickly. This is needed to enable a better understanding of what factors are most associated with loneliness, what the effects of loneliness are for different people, and how it can be prevented or alleviated. As this is a devolved matter, ONS took this work forward for England, with scope for future work to harmonise across the Devolved Administrations.

In December 2018, following consultations with key stakeholders and experts, and extensive collaboration with the ONS Quality of Life team, the GSS Harmonisation Team published the Harmonised Principles for measuring loneliness. The principles can be used to measure loneliness using any survey or administrative data source, which ensures a consistent approach can be adopted across major studies to inform future policy in England.

After identifying the need for indicators across all ages, the GSS Harmonisation Team agreed upon two sets of indicator questions and one direct loneliness question. The first set of four indicator questions is recommended for use with adults while there is an alternatively worded set recommended for use with children. The questions were tested and then used in several established surveys using different survey modes, including paper self-completion (English Longitudinal Study of Aging), online self-completion (Community Life Survey, Good Childhood Index Survey), and telephone interview (Opinions Survey).

All four questions are also due to be adopted on the:

And the direct loneliness question is due to be included on the:

Given the important link between health and loneliness, there is also ongoing work with various agencies including Public Health England, NHS England and NHS Digital to include the loneliness measures in key surveys, such as the Health Survey for England. Work is also ongoing to continue harmonisation of the loneliness indicators across the GSS, including consultation with the Devolved Administrations.

This example shows how the GSS Harmonisation Team has worked effectively with statistics producers across government and experts in loneliness measurement, to develop consistent methods for measuring loneliness in both adults and children. These measures can then be adopted in a comparable way across major studies to help inform effective government policy responses in this area of current public debate.

Developing and refining UK House Price Index methods

This is a case study for Principle Q2: Sound methods.

The UK House Price Index (UK HPI) has been published since June 2016 and is produced by HM Land Registry in partnership with the Office for National Statistics (ONS), Registers of Scotland and Land and Property Services Northern Ireland (referred to as HM Land Registry and partners).

The method used to produce the UK HPI was originally published in Development of a single Official House Price Index which set out the rationale for the approach, the data sources used and how it complied with international standards. It also considered users’ questions raised during an earlier methods consultation and from a peer review conducted by the Government Statistical Service Methodology Advisory Committee.

Each month, the UK HPI presents a first estimate of average house prices in the UK based on the available sales transactions data for the latest reference period. The first estimate then updated in subsequent months as more sales transaction data become available for inclusion in the calculation.

In March 2017, there was a large increase in the magnitude of revisions between first and subsequent estimates of annual change to average house prices. This negatively affected some users’ confidence in UK HPI as they were unable to understand or explain house price trends using the first estimate with certainty. After investigating, ONS established that they were being driven by volatility in new build property prices, compounded by an operational backlog in HM Land Registry registering new build sales transactions.

HM Land Registry and partners took steps to improve the methods by changing the calculation for the first estimate to reduce its sensitivity to the impact of new build transactions. The approach was developed by GSS methodologists, and several options were tested before a final one was chosen.

HM Land Registry and partners communicated the method change to users prior to its implementation through the About the UK HPI section of the UK HPI release, a blog, and later produced an enhanced Quality and Methodology report which includes details of the impact of the changes and supporting analysis. Details about the HM Land Registry operational backlog have also been included in Section 4.4 of About the UK HPI, with a reference to HM Land Registry’s speed of service and its future plans, which present information about average completion times for new build registrations.

As a result, the scale of revisions to the first estimate of UK HPI annual change to average house prices has reduced, and is more stable over time. HM Land Registry and partners and UK HPI users are now more assured that delays in processing new build registrations are not adversely impacting on the robustness of the UK HPI first estimates.

HM Land Registry and partners also compare UK HPI with other non-official house prices indices to identify and explain any differences between the series, and publish their analyses in an annual article Comparing house price indices in the UK.

This example shows how HM Land Registry and partners have transparently developed UK HPI’s methods by collaborating with relevant experts during their development, informed users in advance about methods changes with clear reasons and explanations of their impact, and published supporting information that helpfully sets out the rationale behind their various decisions.

Clarity and insight in government statistical outputs

This is a case study for Principle V3: Clarity and insight.

The Royal Statistical Society (RSS) selected the 2018 winner and first and second runners-up for the Campion Award for Official Statistics. The purpose of the award is to recognise outstanding innovations or developments in official statistics that improve the users’ experience.

  • The Department for Environment, Food and Rural Affairs (Defra) won for The Future Farming and Environment Evidence Compendium published in February 2018. According to the judging panel, Defra showed excellent use of administrative data with a direct impact on policy and communication with users.
  • The Northern Ireland Statistics and Research Agency (NISRA) received plaudits for Northern Ireland Multiple Deprivation Measure 2017 (NIMDM2017), the official measure of deprivation in Northern Ireland. Judges thought the Measure brings together complex data and presentation at a very local level in a politically-sensitive context
  • The Office for National Statistics (ONS) and Home Office (HO) were recognised for their 2017 joint article, What’s happening with international student migration. The judging panel thought this piece of work dealt with a sensitive matter of real public interest and that the statistics both informed the debate and corrected a misunderstanding

All three are excellent examples of statistics that are presented clearly and explained meaningfully.

Defra’s compendium brings together data and statistics from a variety of sources and analysis such as statistical outputs, scientific research, operational research and economics. The output is an impressive effort compiling information from a wide array of sources and distilling that information into a digestible, interesting narrative. The compendium, through data and colourful graphics, demonstrates the importance of farm economics, food production, and environmental land management to help the reader understand UK agriculture and its contribution to the economy.

The newly developed NIMDM 2017 analysis package and online interactive maps provide users with robust insights into deprivation. The analysis package, for example, is simple to use, provides examples of what a user can do with the data and encourages interrogation. Users can break down the data by income, employment, health and disability and other deprivation domains that combine to produce the overall multiple deprivation measure. Other Excel spreadsheets are available for different geographies such as Wards and Assembly Areas.

The ONS/HO article is an update on progress towards developing a better understanding of student migration to and from the UK. The article is a major part of ONS’s work plan, in response to debates on student migration. ONS and Home Office statisticians analysed Home Office Exit Checks data to examine what happens to non-EU students when their visas have expired following their studies; whether they leave the UK or remain by extending their visas. Using new data sources to provide a complete and coherent picture of international student outcomes was one of the requirements of our compliance check of ONS’s long-term student migration estimates. The ONS/HO analysis goes some way towards building that picture – it provides insights into a sensitive political issue that attracts intense public and media interest.

The Defra and NISRA outputs provide clarity and insight by presenting relevant statistics and data in a clear and valuable way that enables use by all types of users. The ONS/HO article makes use of alternative data sources and explains the issues to generate insights into an important topic.

Publishing information about data quality assurance processes

This is a case study for Principle Q3: Assured quality.

The Consumer Price Index including Owner Occupiers’ Housing Costs (CPIH) is published monthly by the Office for National Statistics (ONS) in its UK Consumer Price Inflation bulletin.

ONS publishes information about the quality of the Valuation Office Agency (VOA) private rents data, which are used to estimate owner occupiers’ housing costs, a key component of the inflation measure:

ONS communicates clearly with VOA to understand the quality assurance of these data. ONS is currently looking into gaining access to the private rents microdata, using the powers granted through the Digital Economy Act 2017. This is expected to help ONS further understand data quality issues.

In addition, ONS has developed several comparative analyses to provide assurance to itself and to users about the behaviour of CPIH:

  • One analysis compared different methods of estimating owner occupiers’ housing (OOH) costs
  • Another analysis compared the CPIH private rents data with other data source

By publishing clear and detailed information about data quality assurance and embedding quality assurance practices in its production process, ONS provides reassurances to itself and users about the quality of the data used to produce CPIH.