Automating statistical production to free up analytical resources

This is a case study for Principle V4: Innovation and improvement.

The Reproducible Analytical Pipeline (RAP) is an innovation initiated by the Government Digital Service (GDS) that combines techniques from academic research and software development. It aims to automate certain statistical production and publication processes – specifically, the narrative, highlights, graphs and tables. Tailor made functions work raw data up into a statistical release, freeing up resource for further analysis. The benefits of RAP are laid out in the link above, but include:

  • Auditability – the RAP method provides a permanent record of the process used to create the report, moreover, using Git for version control producers have access to all previous iterations of the code. This aids transparency, and the process itself can easily be published
  • Speed – it is quick and easy to update or reproduce the report, producers can implement small changes across multiple outputs simultaneously. The statistician, now free from doing repetitive tasks, has more time to exercise their analytical skills
  • Quality – Producers can build automated validation into the pipeline and produce a validation report, which can be continually augmented. Statisticians can therefore perform more robust quality assurance than would be possible by hand in the timeframe from receiving data to publication.
  • Knowledge transfer – all the information about how the report is produced is embedded in the code and documentation, making handover simple
  • Upskill – RAP is an opportunity to upskill individuals by giving them the opportunity to learn new skills or develop existing ones. This also upskills teams by making use of underused coding skills that may exist within their resource; coding skills are becoming ubiquitous nowadays with many STEM subject students learning to code at university

RAP therefore enables departments to develop and share high-quality reusable components of their statistics processes. This ‘reusability’ enables increased collaboration, greater consistency and quality across government, and reduced duplication of effort.

In June 2018, the Department for Transport (DfT) published its RAP debut with the automation of the Search and Rescue Helicopter (SARH) statistical tables. This was closely followed by the publication of Quarterly traffic estimates (TRA25) produced by DfT’s first bespoke Road Traffic pipeline R package. RAP methods are now being adopted across the department, with other teams building on the code already written for these reports. DfT have begun a dedicated RAP User Group to act as a support network for colleagues interested in RAPping.

DfT’s RAP successes have benefited from the early work and community code sharing approach of other departments, including:

  • Department for Digital, Culture, Media & Sport first published statistics using a custom-made R package, eesectors, in late 2016, with the code itself made freely available on GitHub.
  • Department for Education first published automated statistical tables of initial teacher training census data in November 2016, followed by the automated statistical report of pupil absence in schools in May 2017. DfE are now in the process of rolling out the RAP approach across their statistics publications
  • Ministry of Justice, as well as automating their own reports, have made a huge contribution with the development of the R package xltabr which can be used by RAPpers to easily format tables to meet presentation standards. Xtabr has also been made available to all on the Comprehensive R Archive Network.

The incorporation of data science coding skills with the traditional statistical production process, coupled with an online code sharing approach lends itself to increased collaboration, improved efficiency, and creates opportunities for government statisticians to provide further insights into their data.

Developing statisticians’ coding skills to meet future organisational needs

This is a case study for Principle T5: Professional capability.

The Department for Transport (DfT) has been upskilling its analysts to facilitate the adoption of data science methods in the department. To help with this, DfT has established weekly Coffee & Coding sessions and bespoke R coding workshops, building on successful models used in the Department for Education and Business Enterprise Industry and Skills.

Coffee & Coding sessions aim to nurture and encourage a vibrant, supportive and inclusive coding community. They provide a regular opportunity for people to share coding skills, knowledge and advice, and to network and get to know each other. The format is usually a presentation followed by a Code Surgery. Presentations usually demonstrate a tool or technique and/or a show and tell of new work done within the department. Code Surgeries allow people to raise coding queries or ideas with the coding community; there is no such thing as a silly question and it is understood that the quest for knowledge necessarily includes failure.

The R workshops are a suite of sessions designed to train DfT’s statisticians in the basics of R coding. They are mainly based around the use of tidyverse R libraries to maintain regular standards, and include topics such as data wrangling with dplyr, graphing with ggplot2, and report automation with rmarkdown. DfT’s first cohort graduated in late 2018 and the second is due to start in early 2019.

DfT runs a mentorship programme (akin to the GDS Data Science Accelerator) to provide support to those taking on data science projects using a new tool or method. DfT expects that eventually there will be enough coders in the department that asking for statistical coding advice will be as easy to source as advice on using Excel.

A big part of DfT’s approach is to encourage people to share knowledge, so that pioneers trying methods for the first time generate resources for others to use and adapt. GitHub has become central to this process – DfT uses it to share code and host any materials from DfT’s weekly coding meetings and to signpost to useful resources online. DfT has also developed coding standards, that specify DfT’s minimum requirements for ‘good code’, whilst not burdening the developer with lots of extra work. For example, DfT requires that the master version of a script is not edited without going through a code review and encourage the use of automated testing (Continuous Integration) tools. The document is community edited so standards can evolve as change as needed.

DfT encourages analysts to use similar variants of code and to follow a style guide. For data analysis, R and Python have proved popular language choices, but there are also style differences within R and Python. For this reason, DfT has default suggested packages in DfT’s coding standards and approaches the R workshops with a consistent coding style, encouraging developers to use the Tidyverse syntax style. This means that a relatively new coder only has to learn this syntax style to be able to interpret typical code across the department.

DfT collaborates closely with its Digital Services team to ensure that the core functions of the software development tools work, making sure analysts can install packages for Python and R, use Git to version control their code, and use dependency management tools like packrat.

This example shows how DfT staff are provided with the time and resources required to develop new coding skills, knowledge and competencies to meet DfT’s future organisational needs and how DfT is developing new quality strategies and standards.

Innovating across the production and dissemination process

This is a case study for Principle V4: Innovation and improvement.

The National Travel Survey (NTS) team at the Department for Transport (DfT) has implemented a series of innovations and improvements during 2018/19. Some of these have been simple to implement but have a significant impact, while others have provided opportunities for the team to learn new skills that will provide long-term quality and efficiency benefits, for example, learning to use R Studio to automate data processing methods.

Making efficiencies has freed up analytical resource to make improvements in other areas, leading to a positive snowball effect. A user-first approach has been adopted, with all innovations being about how to further meet users’ needs.

Recent NTS innovations and improvements include:

  • Improving the NTS questionnaire following a feedback exercise, to check the relevance of NTS questions and the burden placed on respondents. As many NTS questions are still required by users, to make space for new topics, questions are rotated so that they are asked every other year. This ensures the survey length is not extended whilst still meeting user needs. New questions undergo extensive cognitive and panel testing to ensure participants understand them and that they collect the data users want
  • Setting up an innovative NTS Panel, consisting of NTS participants who agree to be contacted for follow-up research. This allows additional, smaller pieces of research to be conducted while not making the full NTS interview longer. The panel can target a sub-section of the population (e.g. people who cycle) where it would be disproportionately burdensome to ask everyone in the full NTS. Panel responses can also be linked back to original NTS responses, to greatly enhance the utility of the data
  • Collaborating with other analysts, including those outside of Government, to produce NTS analytical reports, demonstrating the breadth of information available in the NTS. By making the dataset accessible via the UK Data Service, and the ONS Secure Research Service, far more analysis can be undertaken than could be done by the NTS team alone
  • Advance letter and incentive experiments investigating how to boost response rates
  • Methodological improvements to collect walking data more accurately
  • Conducting a Discovery to explore whether developing a digital NTS diary could reduce respondent burden and increase data quality
  • Designing interactive tables and revising the data table categories so that it is easier for users to find the data they are searching for on GOV.UK
  • Publishing ad-hoc analyses, so they are accessible to all and enable the reuse of NTS data
  • Using R Studio to provide regular standard errors and confidence intervals for NTS statistics and ad-hoc analyses
  • Producing a user-friendly quality report to inform users about the quality of the NTS data, including sampling, methodology, quality assurance procedures and confidentiality
  • Making efficiency improvements to NTS data processing methods to greatly increase levels of automation using R, SQL and more advanced Excel functions

These improvements have led to increased engagement with a range of NTS stakeholders:

  • The publication of ad-hoc tables has drawn interest from academics and transport planners who have used the data as the basis for conducting further analysis in collaboration with DfT
  • The analytical reports produced in collaboration with external authors have provided a fresh look at what the NTS can provide and received mainstream and specialist press coverage
  • The NTS Panel has resulted in new demand from policy teams, with the team now looking forward to exploring these new research topics

The team is also testing the use of MailChimp as a new way to keep users up-to-date with NTS statistics and developments through a regular newsletter. The team hopes that this will increase its engagement with NTS users even further.

This example shows how the NTS team keeps up to date with developments that might improve NTS statistics for users, is transparent about its forthcoming development plans, and engages with users to get their feedback on plans to better meet their needs. It also shows how the NTS team collaborates with expert analysts to enhance value and insight, creates efficiencies by innovating methods and quality processes, and seeks to improve users’ experience by finding new ways to engage with them and enhancing the range of statistics that it makes available.