Automating statistical production to free up analytical resources

This is a case study for Principle V4: Innovation and improvement.

The Reproducible Analytical Pipeline (RAP) is an innovation initiated by the Government Digital Service (GDS) that combines techniques from academic research and software development. It aims to automate certain statistical production and publication processes – specifically, the narrative, highlights, graphs and tables. Tailor made functions work raw data up into a statistical release, freeing up resource for further analysis. The benefits of RAP are laid out in the link above, but include:

  • Auditability – the RAP method provides a permanent record of the process used to create the report, moreover, using Git for version control producers have access to all previous iterations of the code. This aids transparency, and the process itself can easily be published
  • Speed – it is quick and easy to update or reproduce the report, producers can implement small changes across multiple outputs simultaneously. The statistician, now free from doing repetitive tasks, has more time to exercise their analytical skills
  • Quality – Producers can build automated validation into the pipeline and produce a validation report, which can be continually augmented. Statisticians can therefore perform more robust quality assurance than would be possible by hand in the timeframe from receiving data to publication.
  • Knowledge transfer – all the information about how the report is produced is embedded in the code and documentation, making handover simple
  • Upskill – RAP is an opportunity to upskill individuals by giving them the opportunity to learn new skills or develop existing ones. This also upskills teams by making use of underused coding skills that may exist within their resource; coding skills are becoming ubiquitous nowadays with many STEM subject students learning to code at university

RAP therefore enables departments to develop and share high-quality reusable components of their statistics processes. This ‘reusability’ enables increased collaboration, greater consistency and quality across government, and reduced duplication of effort.

In June 2018, the Department for Transport (DfT) published its RAP debut with the automation of the Search and Rescue Helicopter (SARH) statistical tables. This was closely followed by the publication of Quarterly traffic estimates (TRA25) produced by DfT’s first bespoke Road Traffic pipeline R package. RAP methods are now being adopted across the department, with other teams building on the code already written for these reports. DfT have begun a dedicated RAP User Group to act as a support network for colleagues interested in RAPping.

DfT’s RAP successes have benefited from the early work and community code sharing approach of other departments, including:

  • Department for Digital, Culture, Media & Sport first published statistics using a custom-made R package, eesectors, in late 2016, with the code itself made freely available on GitHub.
  • Department for Education first published automated statistical tables of initial teacher training census data in November 2016, followed by the automated statistical report of pupil absence in schools in May 2017. DfE are now in the process of rolling out the RAP approach across their statistics publications
  • Ministry of Justice, as well as automating their own reports, have made a huge contribution with the development of the R package xltabr which can be used by RAPpers to easily format tables to meet presentation standards. Xtabr has also been made available to all on the Comprehensive R Archive Network.

The incorporation of data science coding skills with the traditional statistical production process, coupled with an online code sharing approach lends itself to increased collaboration, improved efficiency, and creates opportunities for government statisticians to provide further insights into their data.

Demonstrating transparency when linking and publishing data

This is a case study for Principle T6: Data governance.

The Scottish Government’s (SG)  health and homelessness in Scotland project linked local authority data about homelessness between 2001 and 2016 with NHS data on hospital admissions, outpatient visits, prescriptions, drugs misuse, and National Records of Scotland information about deaths.

Transparency around the risk assessment process helps to demonstrate a producer’s Trustworthiness to users, suppliers and the public. One of the ways that SG were able to demonstrate this was by conducting and publishing their data privacy impact assessment alongside the main analysis report. SG also published the original application for the data, the public benefit and privacy panel application and the correspondence documenting its approval, and details of how to access the data. This approach is now standard practice for all SG publications based on linked data.

Since SG carried out this work, a new tool for risk assessment – Data Protection Impact Assessments (DPIAs) – have been introduced following the 2018 Data Protection Act (DPA), as a requirement of GDPR. They are mandatory where data are combined from multiple sources and the Information Commissioner’s Office recommends they are also conducted on a voluntary basis for any large-scale processing of personal data.

The accountability principle in the DPA requires organisations to have appropriate records in place to demonstrate compliance if required. Departments can meet the DPA accountability principle by conducting a DPIA, and publishing them helps to meet the Code’s requirements for transparency (providing that they are accessibly presented). It isn’t essential to publish a DPIA in full, a summary of the process and the lessons learnt would be sufficient to demonstrate transparency.

Another step producers can take to increase transparency is to publish details of all the data share requests made to them and their outcomes. SG publishes details of the data sharing requests submitted to its Statistics Data Access Panel on its website, which also includes details about past decisions made and the justifications for those decisions.

The Department for Education in England has also been publishing details of the data share requests and outcomes in relation to ad hoc National Pupil Data Sharing for several years. In December 2017, the Department for Education broadened the scope to cover all routine sharing of personal data and have recently consulted users about further changes to make this easier to engage with and understand.

These examples show how Trustworthiness can be demonstrated by statistics producers being transparent about their approaches to the management of the data linkage process and data shares, and their relevance to some of the current legislation in this area.