Reproducible Health Data Science

The development of technology to generate and analyse large, quantitative health datasets provides unprecedented opportunity to improve public health and clinical medical practices. Unfortunately, the analytical complexity of these datasets accompanied by poor data science practices has contributed to a reproducibility crisis. There is now a growing movement to improve practice in data science by integrating concepts, tools and strategies that are widely used in software engineering. This hands-on course will introduce participants to how these essential practices are now being used to improve reproducibility in research. Each practice, skill and tool will be introduced in lectures delivered by active data scientists at the MRC Integrative Epidemiology Unit (IEU) and the University of Bristol with a strong track record of both impactful and reproducible research.

Date 29 - 30 January & 2 - 3 February 2026
Fee £1000
Format Online
Audience Open to all applicants (prerequisites apply)

Course profile

Essential pre-course computer setup (1-2 hours): Prior to the start of the course, participants will be given video instructions on how to setup their computers to be able to work through the course materials. It is essential that these steps are performed prior to the start of the course. There will be a pre-course online drop-in session to help any participants encountering problems preventing them from completing these steps. 

Main course: The 4 days of the online course will consist of a variety of learning activities set by tutors. Concepts, skills and tools will be demonstrated by a mix of live and pre-recorded videos and detailed online instructions. Participants will put theory into practice by setting up tools on their own computers, connecting to and using remote servers, and developing a realistic project over the duration of the course.  

Participants will be given a simple but real-world project comprising a quantitative health-informatics dataset and some ‘baseline’ analytical code written in R and Python. Participants are not expected to have specific scientific experience of this example data and analysis. Over the duration of the course, participants will learn new reproducibility concepts, skills and tools for tracking script versions, managing software dependencies, generating dynamic documents, constructing and maintaining pipelines, compiling packages, and more. Participants will apply these to the baseline project with guidance to make it more reproducible. By the end of the course, they will have a final working version that can be re-used as a template for their real-world projects.  

All teaching will be conducted online using Blackboard and Blackboard Collaborate.  

Please click on the sections below for more information. 

Over 4 days, this online course will consist of a variety of learning activities set by tutors. Skills and tool use will be demonstrated by a mix of live and pre recorded videos and detailed online instructions. Participants will put theory into practice by applying what they learn within one of their personal research projects. Data will be provided for practice to participants who do not have or choose not to work on a personal project. All teaching will be conducted online using Blackboard and Blackboard Collaborate. 

By the end of the course participants should be able to: 

  1. discuss the importance of reproducibility in health data science; 
  2. setup computers to securely and conveniently interact with different programming languages, filesystems and remote servers;
  3. understand principles of good project organisation using simple but scalable structures;
  4. understand different strategies towards reproducible software environments; 
  5. integrate version control and pipelining systems into daily practice;
  6. apply techniques to make analytical code more readable and reliable;
  7. create robust distributable software packages;
  8. understand how to integrate code review into scientific practice;
  9. describe approaches to reproducibility within trusted research environments;
  10. integrate dynamic documents into projects to create automated, sharable and expressive research outputs;
  11. handle and share data with minimum risk of loss or security violations. 

The course is intended for those who analyse health data and would like to learn how to improve the reproducibility of their work. It is an introductory to intermediate course. It does not include statistical instruction.  It does not cover practices specific to qualitative data analysis. For the course project, we’ll be using observational quantitative cohort data from The Cancer Genome Atlas, though participants will not be required to have any specific knowledge about these data or its underlying scientific aspects. 

If your work involves a) obtaining or creating quantitative health data, b) performing complex analyses on those data independently or in collaboration with others, and c) creating scientific outputs to present those analyses, then this course is for you.  

This course will cover: 

The importance of and pitfalls preventing reproducibility; 

  1. The importance of and pitfalls preventing reproducibility; 
  2. Setting up your compute environment using SSH keys, modern coding environments (VS Code) and relevant plugins
  3. Integrating version control into daily practice using git and GitHub
  4. Containerization (Apptainer, Docker) and other approaches to software management (conda, mamba, renv)
  5. Transparent, portable and scalable organisation of project files and data (including config files)
  6. Techniques to improve readability and reliability using code review, code linting, unit testing
  7. Using Quarto and Jupyter to create dynamic and shareable analysis reports
  8. Creating R packages that adhere to standards to share analytical functions and processes
  9. Pipelining tools to formalise the documentation and running of more complex pipelines (Snakemake)
  10. Reproducibility within trusted research environments 

All tutors are active researchers leading projects that involve the analysis of large health datasets in the MRC Integrative Epidemiology Unit (IEU) at the University of Bristol. Several have academic backgrounds in data science or related fields.  

To make sure the course is suitable for you and you will benefit from attending, please ensure you meet the following prerequisites before booking:

Conditions

Attendance is monitored.  

Pre-course activities must be completed prior to the start of the course to avoid delays and disruption for other participants 

Knowledge

Participants should have some experience handling health data, and writing and running scripts that analyse that data.  

Experience with running basic linux commands to navigate between file directories and to do basic file management 

Expertise is not required in any specific programming language, however demonstrations will tend to focus on R and Python. 

Participants are not expected to have any specific scientific or statistical expertise. The example project for the course will run some basic linear models, but understanding these is not essential. 

Software

Participants will carry out essential practical activities on their own computers. Software installation instructions will be provided prior to the course along with a short drop-in session for advice. 

Computers running Windows 11, Mac OS 13+ or Linux (e.g. Ubuntu 22.04+) operating systems can be supported on this course 

You may require elevated privileges to install some basic software on your computer such as VS Code, SSH agent (Windows), Xcode command line tools (Mac), Git for Windows including Git Bash (Windows). 

Recommendation

Access to two screens will be useful for practical sessions where one screen can be used to view instructions and the other to carry out instructions and view outputs.

Before booking this course, please make sure you read the information provided above about the target audience and prerequisites. It is important that you have access to the relevant IT resources needed for the course and meet the knowledge prerequisites to ensure you can get the most from the course.

Bookings are taken via our online booking system, for which you must register an account. To check if you are eligible for free or discounted courses please see our fees and voucher packs page. All bookings are subject to our terms & conditions, which can be read in full here.

For help and support with booking a course refer to our booking information pageFAQs or feel free to contact us directly. For available payment options please see: How to pay your short course fees.

Bookings close two weeks before the start of each courseOnce all courses have finished for the current academic year we close the booking system for updates, and re-open again in the Autumn. To be notified about our timescales for opening annual registrations and bookings sign up to our mailing list.

Participants are granted access to our virtual learning platform (Blackboard Ultra) 1 to 2 weeks in advance of the course. This allows time for any pre-course work to be completed and to familiarise with the platform.

To gain the most from the course, we recommend that you attend in full and participate in all interactive components. We endeavour to record all live lecture sessions and upload these to the online learning environment within 24 hours. This allows course participants to review these sessions at leisure and revisit them multiple times. Please note that we do not record breakout sessions.

All course participants retain access to the online learning materials and recordings for 5 months after the course. 

University of Bristol staff and postgraduate students who do not wish to attend the full course may instead register for access to the 'Materials & Recordings' version of this course: Further information and bookings.

100% of attendees recommend this course*.
*Attendee feedback from January 2026.

Here is a sample of feedback from the last run of the course:

“This is an interesting course and highly relevant to the current research environment." - Course feedback, January 2026

“Enjoyed the practical element - and the way in which we all had to work on these projects ourselves. Very knowledgeable and engaging organisers/moderators." - Course feedback, January 2026

“I didn't know about VS code and quarto. I'll definitely use these tools in the future. I also liked it that we learnt how to use jupyter notebooks. I thought they were very difficult to use but actually, they're not! Sometimes we're just too afraid to learn new things on our own. The course made it less daunting to try new software and code." - Course feedback, January 2026

“I found the content of the course very interesting and it matched to the description of the course. I felt there was a good balance between practicals and teaching sessions." - Course feedback, January 2026

“I thought each concept was introduced and shown very well. The use of a server similar to real life work places was also very appreciated and helpful. I learnt a lot and would recommend to anybody." - Course feedback, January 2026

“The practicals were well organised. I enjoyed that you could work through the exercises on your own, and then ask questions / discuss issues in breakout rooms when needed." - Course feedback, January 2026

“The pre-recorded walk-throughs for the practicals were very useful (being able to pause to follow at your own pace), and it's great that most of them were narrated - easier to follow what's going on. The memes were great, and so was the academic content and explanations of the process of performing different tasks and always proving rationale for specific good practices." - Course feedback, January 2026

“This course covered a huge range of topics in a short time. The lecturers were hugely knowledgeable and supportive during the practicals. The practicals were a great opportunity an well integrated across the course." - Course feedback, January 2026