Rapidly setting up an automated data collection

This is a case study for V4: Innovation and improvement.

Prior to March 2020, the Department for Education (DfE) published termly and annual pupil absence data based on information provided to them through the school census, with a lag of around two terms. However, during the COVID-19 pandemic, school attendance became a key societal issue and there was a strong need for real-time data at a national level.

Initially, DfE introduced a form for schools and colleges in England to complete manually each day. Whilst this approach provided DfE with the key information that was needed, it placed a high burden on schools and so DfE explored options for automating the collection.

DfE rapidly set up a new system which automatically collects daily attendance data from schools. This method of data collection was revolutionary for the department and its stakeholders and, because it is automated, it created no additional burden for schools. This was done on a voluntary basis to start with and reached a rate of 90% of schools choosing to participate, before the collection became mandatory at the start of the 2024/25 academic year.

The first outputs from these collections were published in September 2022 and have been published on a fortnightly basis to meet user needs, which considerably reduces the lag. The information is presented in a bulletin and a dashboard. The figures relate to the attendance of 5-to-15-year-old pupils in state-funded primary, secondary and special schools in England, and includes breakdowns for pupil groups.

This real-time automated collection has enabled policy makers in DfE to respond rapidly to arising issues, identify trends in attendance, and quickly understand and spread practice from areas showing improvements. For example, during the teacher strikes in 2023, DfE was able to produce rapid transparency data on the number of schools that were closed on strike days. Schools and local authorities are also able to use the attendance information operationally to more efficiently monitor absence by identifying pupils who need support earlier and benchmark themselves, saving time and enabling earlier intervention.

In 2023, these statistics won the Royal Statistical Society (RSS) Campion Award for Excellence in Official Statistics. The RSS noted that “the judges considered this to be an example of agile, useful data provision and an exemplar for other to follow. They were also impressed with the efforts made to ensure transparency so the findings could be communicated to a broad audience, as well as the use of new administrative data.”

Archived: Automating statistical production to free up analytical resources

This is a case study for Principle V4: Innovation and improvement.

The Reproducible Analytical Pipeline (RAP) is an innovation initiated by the Government Digital Service (GDS) that combines techniques from academic research and software development. It aims to automate certain statistical production and publication processes – specifically, the narrative, highlights, graphs and tables. Tailor made functions work raw data up into a statistical release, freeing up resource for further analysis. The benefits of RAP are laid out in the link above, but include:

  • Auditability – the RAP method provides a permanent record of the process used to create the report, moreover, using Git for version control producers have access to all previous iterations of the code. This aids transparency, and the process itself can easily be published
  • Speed – it is quick and easy to update or reproduce the report, producers can implement small changes across multiple outputs simultaneously. The statistician, now free from doing repetitive tasks, has more time to exercise their analytical skills
  • Quality – Producers can build automated validation into the pipeline and produce a validation report, which can be continually augmented. Statisticians can therefore perform more robust quality assurance than would be possible by hand in the timeframe from receiving data to publication.
  • Knowledge transfer – all the information about how the report is produced is embedded in the code and documentation, making handover simple
  • Upskill – RAP is an opportunity to upskill individuals by giving them the opportunity to learn new skills or develop existing ones. This also upskills teams by making use of underused coding skills that may exist within their resource; coding skills are becoming ubiquitous nowadays with many STEM subject students learning to code at university

RAP therefore enables departments to develop and share high-quality reusable components of their statistics processes. This ‘reusability’ enables increased collaboration, greater consistency and quality across government, and reduced duplication of effort.

In June 2018, the Department for Transport (DfT) published its RAP debut with the automation of the Search and Rescue Helicopter (SARH) statistical tables. This was closely followed by the publication of Quarterly traffic estimates (TRA25) produced by DfT’s first bespoke Road Traffic pipeline R package. RAP methods are now being adopted across the department, with other teams building on the code already written for these reports. DfT have begun a dedicated RAP User Group to act as a support network for colleagues interested in RAPping.

DfT’s RAP successes have benefited from the early work and community code sharing approach of other departments, including:

  • Department for Digital, Culture, Media & Sport first published statistics using a custom-made R package, eesectors, in late 2016, with the code itself made freely available on GitHub.
  • Department for Education first published automated statistical tables of initial teacher training census data in November 2016, followed by the automated statistical report of pupil absence in schools in May 2017. DfE are now in the process of rolling out the RAP approach across their statistics publications
  • Ministry of Justice, as well as automating their own reports, have made a huge contribution with the development of the R package xltabr which can be used by RAPpers to easily format tables to meet presentation standards. Xtabr has also been made available to all on the Comprehensive R Archive Network.

The incorporation of data science coding skills with the traditional statistical production process, coupled with an online code sharing approach lends itself to increased collaboration, improved efficiency, and creates opportunities for government statisticians to provide further insights into their data.

Demonstrating transparency when linking and publishing data

This is a case study for Principle T6: Data governance.

The Scottish Government’s (SG) health and homelessness in Scotland project linked local authority data about homelessness between 2001 and 2016 with NHS data on hospital admissions, outpatient visits, prescriptions, drugs misuse, and National Records of Scotland information about deaths.

Transparency around the risk assessment process helps to demonstrate a producer’s Trustworthiness to users, suppliers and the public. One of the ways in which SG demonstrated this was by conducting and publishing their data privacy impact assessment alongside the main analysis report. SG also published the original application for the data, the public benefit and privacy panel application and the correspondence documenting its approval, and details of how to access the data. This approach is now standard practice for all SG publications based on linked data.

Since SG carried out this work, a new tool for risk assessment – Data Protection Impact Assessments (DPIAs) – have been introduced following the 2018 Data Protection Act (DPA), as a requirement of GDPR. They are mandatory where data are combined from multiple sources and the Information Commissioner’s Office recommends they are also conducted on a voluntary basis for any large-scale processing of personal data.

The accountability principle in the DPA requires organisations to have appropriate records in place to demonstrate compliance if required. Departments can meet the DPA accountability principle by conducting a DPIA, and publishing them helps to meet the Code’s requirements for transparency (providing that they are accessibly presented). It isn’t essential to publish a DPIA in full, a summary of the process and the lessons learnt would be sufficient to demonstrate transparency.

Another step producers can take to increase transparency is to publish details of all the data share requests made to them and their outcomes. SG publishes details of the data sharing requests submitted to its Statistics Data Access Panel on its website, which also includes details about past decisions made and the justifications for those decisions.

The Department for Education in England has also been publishing details of the data share requests and outcomes in relation to ad hoc National Pupil Data Sharing for several years. In December 2017, the Department for Education broadened the scope to cover all routine sharing of personal data and have recently consulted users about further changes to make this easier to engage with and understand.

These examples show how Trustworthiness can be demonstrated by statistics producers being transparent about their approaches to the management of the data linkage process and data shares, and their relevance to some of the current legislation in this area.