Reproducible Health Data Science

The development of technology to generate and analyse large, quantitative health datasets provides unprecedented opportunity to improve public health and clinical medical practices. Unfortunately, the analytical complexity of these datasets accompanied by poor data science practices has contributed to a reproducibility crisis. There is now a growing movement to improve practice in data science by integrating concepts, tools and strategies that are widely used in software engineering. This hands-on course will introduce participants to how these essential practices are now being used to improve reproducibility in research. Each practice, skill and tool will be introduced in lectures delivered by active data scientists at the MRC Integrative Epidemiology Unit (IEU) and the University of Bristol with a strong track record of both impactful and reproducible research. The course aims to demonstrate that reproducible research practices are accessible, will save time in the longer term, and will improve t

Date 29 - 30 January & 2 - 3 February 2026
Fee £1000
Format Online
Audience Open to all applicants (prerequisites apply)

Course profile

Essential pre-course computer setup (1-2 hours): Prior to the start of the course, participants will be given video instructions on how to setup their computers to be able to work through the course materials. It is essential that these steps are performed prior to the start of the course. There will be a pre-course online drop-in session to help any participants encountering problems preventing them from completing these steps. 

Main course: The 4 days of the online course will consist of a variety of learning activities set by tutors. Concepts, skills and tools will be demonstrated by a mix of live and pre-recorded videos and detailed online instructions. Participants will put theory into practice by setting up tools on their own computers, connecting to and using remote servers, and developing a realistic project over the duration of the course.  

Participants will be given a simple but real-world project comprising a quantitative health-informatics dataset and some ‘baseline’ analytical code written in R and Python. Participants are not expected to have specific scientific experience of this example data and analysis. Over the duration of the course, participants will learn new reproducibility concepts, skills and tools for tracking script versions, managing software dependencies, generating dynamic documents, constructing and maintaining pipelines, compiling packages, and more. Participants will apply these to the baseline project with guidance to make it more reproducible. By the end of the course, they will have a final working version that can be re-used as a template for their real-world projects.  

All teaching will be conducted online using Blackboard and Blackboard Collaborate.  

Please click on the sections below for more information. 

Over 4 days, this online course will consist of a variety of learning activities set by tutors. Skills and tool use will be demonstrated by a mix of live and pre recorded videos and detailed online instructions. Participants will put theory into practice by applying what they learn within one of their personal research projects. Data will be provided for practice to participants who do not have or choose not to work on a personal project. All teaching will be conducted online using Blackboard and Blackboard Collaborate. 

By the end of the course participants should be able to: 

  1. discuss the importance of reproducibility in health data science; 
  2. setup computers to securely and conveniently interact with different programming languages, filesystems and remote servers;
  3. understand principles of good project organisation using simple but scalable structures;
  4. understand different strategies towards reproducible software environments; 
  5. integrate version control and pipelining systems into daily practice;
  6. apply techniques to make analytical code more readable and reliable;
  7. create robust distributable software packages;
  8. understand how to integrate code review into scientific practice;
  9. describe approaches to reproducibility within trusted research environments;
  10. integrate dynamic documents into projects to create automated, sharable and expressive research outputs;
  11. handle and share data with minimum risk of loss or security violations. 

The course is intended for those who analyse health data and would like to learn how to improve the reproducibility of their work. It is an introductory to intermediate course. It does not include statistical instruction.  It does not cover practices specific to qualitative data analysis. For the course project, we’ll be using observational quantitative cohort data from The Cancer Genome Atlas, though participants will not be required to have any specific knowledge about these data or its underlying scientific aspects. 

If your work involves a) obtaining or creating quantitative health data, b) performing complex analyses on those data independently or in collaboration with others, and c) creating scientific outputs to present those analyses, then this course is for you.  

This course will cover: 

The importance of and pitfalls preventing reproducibility; 

  1. The importance of and pitfalls preventing reproducibility; 
  2. Setting up your compute environment using SSH keys, modern coding environments (VS Code) and relevant plugins
  3. Integrating version control into daily practice using git and GitHub
  4. Containerization (Apptainer, Docker) and other approaches to software management (conda, mamba, renv)
  5. Transparent, portable and scalable organisation of project files and data (including config files)
  6. Techniques to improve readability and reliability using code review, code linting, unit testing
  7. Using Quarto and Jupyter to create dynamic and shareable analysis reports
  8. Creating R packages that adhere to standards to share analytical functions and processes
  9. Pipelining tools to formalise the documentation and running of more complex pipelines (Snakemake)
  10. Reproducibility within trusted research environments 

All tutors are active researchers leading projects that involve the analysis of large health datasets in the MRC Integrative Epidemiology Unit (IEU) at the University of Bristol. Several have academic backgrounds in data science or related fields.  

To make sure the course is suitable for you and you will benefit from attending, please ensure you meet the following prerequisites before booking:

Conditions

Attendance is monitored.  

Pre-course activities must be completed prior to the start of the course to avoid delays and disruption for other participants 

Knowledge

Participants should have some experience handling health data, and writing and running scripts that analyse that data.  

Experience with running basic linux commands to navigate between file directories and to do basic file management 

Expertise is not required in any specific programming language, however demonstrations will tend to focus on R and Python. 

Participants are not expected to have any specific scientific or statistical expertise. The example project for the course will run some basic linear models, but understanding these is not essential. 

Software

Participants will carry out essential practical activities on their own computers. Software installation instructions will be provided prior to the course along with a short drop-in session for advice. 

Computers running Windows 11, Mac OS 13+ or Linux (e.g. Ubuntu 22.04+) operating systems can be supported on this course 

You may require elevated privileges to install some basic software on your computer such as VS Code, SSH agent (Windows), Xcode command line tools (Mac), Git for Windows including Git Bash (Windows). 

Recommendation

Access to two screens will be useful for practical sessions where one screen can be used to view instructions and the other to carry out instructions and view outputs.

Before booking this course, please make sure you read the information provided above about the target audience and prerequisites. It is important that you have access to the relevant IT resources needed for the course and meet the knowledge prerequisites to ensure you can get the most from the course.

Bookings are taken via our online booking system, for which you must register an account. To check if you are eligible for free or discounted courses please see our fees and voucher packs page. All bookings are subject to our terms & conditions, which can be read in full here.

For help and support with booking a course refer to our booking information pageFAQs or feel free to contact us directly. For available payment options please see: How to pay your short course fees.

Bookings close two weeks before the start of each courseOnce all courses have finished for the current academic year we close the booking system for updates, and re-open again in the Autumn. To be notified about our timescales for opening annual registrations and bookings sign up to our mailing list.

Participants are granted access to our virtual learning platform (Blackboard) 1 to 2 weeks in advance of the course. This allows time for any pre-course work to be completed and to familiarise with the platform.

To gain the most from the course, we recommend that you attend in full and participate in all interactive components. We endeavour to record all live lecture sessions and upload these to the online learning environment within 24 hours. This allows course participants to review these sessions at leisure and revisit them multiple times. Please note that we do not record breakout sessions.

All course participants retain access to the online learning materials and recordings for 3 months after the course. 

University of Bristol staff and postgraduate students who do not wish to attend the full course may instead register for access to the 'Materials & Recordings' version of this course: Further information and bookings.

100% of attendees recommend this course*.
*Attendee feedback from March 2025.

Here is a sample of feedback from the last run of the course:

“The range of topics taught in the course covered most of the basic needs do [sic] perform reproducible data science. All the topics were well explained by the tutors, they also spoke about their experiences and practices which was helpful. Most importantly, practical material provided in the course was well built and provided a realistic scenario to work with." - course feedback, March 2025

“Walkthrough videos were super helpful, as were the explanations [sic] of Snakemate and Containers and Packages, including when they are useful. The github repo, wiki and branches for different stages are were also great for the practicals. Troubleshooping also v. helpful." - course feedback, March 2025

“The course offered rich content, covering nearly every aspect of achieving reproducibility. The group of tutors was exceptional, and it was helpful to know which staff members had extensive experience in each aspect. The recorded practical videos were incredibly useful, ensuring I didn’t miss any commands. Thank you, Gib! Separating the troubleshooting room facilitated a balance between attendees with different skill levels." - course feedback, March 2025

“The contents of this course are very interesting and helpful - and it is clear throughout that everyone is very knowledgeable on the subject matter." - course feedback, March 2025

“Realy [sic] great content- ie [sic] I can imagine almost all of the tools covered being useful to our team. The wiki was excellent too- really essential to have this to follow easily when one's brain is feeling fried by new learning. The tutors were really great and did an amazing job." - course feedback, March 2025

“Overall, it was a good course, especially beneficial for those interested in coding. All the tutors were amazing." - course feedback, March 2025

“The subject area was really well introduced giving a good background for the importance of reproducable [sic] health data science. The speakers were clear and there were plenty of moderators at hand for the break out sessions for technical support. They answered questions and helped the course attendees." - course feedback, March 2025

“There was a lot of new information for me, and the course highlighted good practices for a professional data scientist. I also gained valuable insights simply by observing the tutors as they coded." - course feedback, March 2025