Skip navigation


The following are definitions of the terms we've discussed plus a few extras.

Accuracy: How close on average is the sample statistic to the population parameter that it estimates (also see bias)

Allele: The alternative form of a gene that can exist at a single locus.

Ascertainment: The process of determining what is happening in a population or study group, e.g. finding cases.

Audit: An examination or review that establishes the extent to which a condition, process or performance conforms to pre-determined standards or criteria.

Bar chart: A graph used to presenting discrete data. Each observation can fall into only one category. Frequencies of each group of observations are represented by the heights of the corresponding bars.

Baseline group: The exposure group (often the unexposed group) with which other exposure groups are compared. Also known as the reference group or reference category

Bias: a departure from the true value when one observes a prevalence in a cross-sectional study or an association between an exposure and an outcome in an analytical study.

Selection bias: A systematic difference in the likelihood of selecting subjects to take part in the study on the basis of their association between exposure and outcome status.

Loss to follow-up bias: Subjects are often lost over a follow-up period. If this loss is unrelated to both the exposure and outcome the results will not be biased. If loss to follow-up is associated with exposure and outcome then the results will be biased. This could operate in either direction so that the risk estimate may be greater or less than the true risk.

Measurement bias: A bias in how exposure and/or outcome is measured or classified that results in different quality (validity) of information collected between comparison groups.

Blinding: This is where subjects and / or the outcome assessors are unaware of treatment allocation in a trial until the study is completed.

Case control study: An epidemiological study design where subjects are recruited on the basis of the presence or absence of disease (cases or controls) and exposure is measured retrospectively. In this way it is possible to estimate the risk of disease associated with exposure usually by calculating an odds ratio.

Case definition: A set of diagnostic criteria used to classify individuals as having disease. Often but not always the same as what is used to normal clinical care.

Case series: Collections of individual cases reports. May be helpful in recognising new diseases but cannot be used to test for the presence of a valid statistical association.

Central tendency: The centre or middle value of a frequency distribution. Commonly known as the average. Mean, median and mode are examples of measures of central tendency.

Chance: Variation which is due to random fluctuations.

Clinical equipoise: A state of genuine uncertainty about the benefits or harm that may result from each of two or more regimens. This is an ethical pre-requisite for a randomised controlled trial. In practice, some evidence or ‘hunch’ is required that the new treatment may be better than the old.

Clinical iceberg: The phenomenon that doctors are only aware of the relatively small proportion of disease that presents to them.

Cohort study: (from Latin cohors warriors, tenth part of a legion) Am epidemiological study whereby a defined subset of the population can be identified and classified according to exposure status. The main feature of a cohort study is that it can determine the incidence rate of disease amongst exposed and unexposed individuals. Common synonyms include longitudinal or follow-up study.

occupational cohort: The definition of the cohort is based primarily on a common occupational exposure e.g. workers in the nuclear power industry. In this way the risk of disease can be compared with the general population or other occupational groups to determine the occupational risk.

prospective cohort: Healthy individuals are recruited, though some may already have the disease at baseline, and followed up for future disease occurrence, often for decades. Exposure is status is measured at baseline and repeat measures for change in exposure may be undertaken over the follow-up period.

retrospective cohort: Disease status for a defined subset of the population is ascertained at baseline but this is linked to pre-existing historical data on exposure either from routine records or an earlier research project so that the cohort’s experience of disease risk can be reconstructed.

Concealment: Concealment is where random allocation is hidden from investigators making it impossible for them to have any influence over allocation of patients.

Confidence interval: An interval with given probability (e.g. 95%) that the true value of a parameter such as a mean, difference between proportions or risk ratio is contained within the interval.

Confounding: A situation in which a measure of the effect of an exposure is distorted because of the association of exposure with other factor(s) (“confounders”) that influence the outcome under study.

Contingency table: A table showing the frequencies of observations for two categorical variables such that sub-categories of one variable (exposure) are indicated in rows and sub-categories of the other variable (outcome) are indicated in columns. The simplest form is the 2´2 table, when both variables are binary (dichotomous). The notation for the cells of a 2´2 table used in this course is shown in the table below:

























d=disease, h=healthy, 1=exposed, 0=unexposed


Correlation co-efficient: A measure of the strength of linear association between 2 variables.

Cross-sectional study: A study that examines the relationship between diseases (or other health-related characteristics) and other variables of interest in a particular population at one particular time. Cross-sectional studies may be used to estimate the prevalence of disease, but not the incidence of disease.

Crude (unadjusted) association: the estimated association between exposure and outcome, before possible confounding variables are taken into account.

Denominator: The lower portion of a fraction used to calculate a rate or a ratio; the population at risk; often person-years.

Descriptive studies: A study concerned with and designed only to describe the existing distribution of variables. This is in contrast to an analytical study which examines a hypothesis.

Dose-response effect: The pattern of association between increasing exposure and disease risk, i.e. more exposure the bigger the effect.

Ecological fallacy: The bias that may occur because an association observed between variables on an aggregate level does not necessarily represent the association that exists at an individual level.

Ecological study: A study in which the unit of analysis are populations or groups of people, rather than individuals. An example is the association between median income and cancer mortality rates in administrative jurisdictions such as Primary Care Trusts or Regions.

Epidemiology: The study of the distribution and determinants of health-related conditions or events in specified populations and the application of this study to the control of health problems.

descriptive epidemiology - observations relating measures of disease occurrence with basic characteristics such as age, sex, geography, ethnicity, socioeconomic status and secular trends (Time, Place, Person). Often used to generate aetiological hypotheses.

Error factor (EF): A measure of precision for a ratio measure. For example the error factor for a risk ratio (RR) is exp(1.96´s.e. of log RR). It is used to calculate the 95% confidence interval; for example the 95% C.I. for the RR is RR/EF to RR´EF.

Evidence based medicine: The conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients.

Experimental study: A study in which the investigator intentionally alters one or more factors under controlled conditions in order to study the effects of doing so; usually a randomised controlled trial.

Exposure variable: A variable whose influence on the outcome variable we wish to assess. Exposure variables are also known as risk factors, explanatory variables, independent variables or x-variables. In the context of a randomised trial the exposure variable is the treatment being assessed.

Frequency distribution: The complete summary of the frequencies of the values or categories of a measurement made on a group of persons. The distribution tells either how many or what proportion of the group was found to have each value (or each range of values) out of all the possible values that the quantitative measure can have.

Geometric mean: The back transformation (antilog) of the mean log value.

Hierarchy of evidence: This is a simple guide in helping assess evidence from different study designs. RCTs are viewed as the highest level of evidence followed by cohort, case control, cross-sectional studies, ecological studies and case series or anecdote. This should not be applied too rigorously as well cohort study may be superior to a badly designed RCT.

Histogram: A graphic representation of the frequency distribution of a variable. The area of the bar represents the frequency of the variable.

Homozygotes: An individual having identical alleles of a particular gene.

Hypothesis:  An idea expressed in such a way that it can be tested and refuted.

Incidence rate: The number of new cases of a disease, divided by the total population at risk by the time interval.

Informed consent: Consent given by the subject or responsible person for participation in a study. For informed consent to be ethically valid the investigator must disclose all risks and benefits, the participant must understand the condition and all the risks and benefits, the participant must be competent and consent must be given voluntarily.

Intention to treat analysis: Intention to treat analysis (ITT) is where all participants are analysed according to their group allocation, regardless of whether they completed the trial. The alternative is ‘on-treatment’ analysis, which is limited to those who completed the trial according to protocol. On treatment analysis defeats the main purpose of random allocation and may invalidate the results.

Inter-quartile range: The inter-quartile range describes the spread of data around the median. It is the distance between the lower quartile value and the upper quartile value of a distribution.

Intervention study: An investigation involving intentional change in some aspect of the status of subjects; introduces a therapeutic or preventive regime; designed to test a hypothesis; usually a randomised controlled trial.

Logarithmic transformation: Data are transformed by converting it to its natural log values, in order to give it a normal distribution. This facilitates some statistical analysis. Other transformations are possible but log transformation is the most common.

Mean: the average of a set of observations, derived by adding their values and then dividing by the total number of observations.

Median: A measure of central tendency, which is useful if the data is skewed. It is the value that halves the distribution. It is the middle value when the values in a set are arranged in order. If there is an even number of values the median is defined as the mean of the two middle values.

Mendelian randomisation: A type of observational study design that mimics a randomised controlled trial, where genotypes are used as proxies for environmental and behavioural exposures. These genotypes are randomly allocated at conception and so in theory should minimise confounding and improve causal inference.

Mode: The mode, another measure of central tendency, is the most frequently occurring value in a set. It is rarely used in epidemiological practice. When there is a single mode, the distribution is known as unimodal. If there is more than one peak the distribution is said to be bimodal (two peaks) or multi-modal.

Natural experiment: A type of observational study in which individual exposure status is determined by nature or other factors outside the control of investigators, with the process regarding the allocation of exposure resembling random assignment.

Normal Distribution: This is a continuous symmetrical frequency distribution where both tails extend to infinity, the arithmetic mean, mode and median are identical and its shape is determined by the mean and standard deviation.

Null hypothesis: The hypothesis that there is no difference between two groups. Statistical methods look for evidence against the null hypothesis by calculating a P value.

Number needed to treat (NNT): The number of people with a specified condition, who need to be treated for a specified period of time according to a specified protocol, in order to prevent one beneficial (NNT benefit) or adverse outcome (NNT harm). It is the inverse of the risk difference.

Numerator: The upper portion of a fraction used to calculate a rate or a ratio. 

Observational study: Non-experimental study; Epidemiological study that does not involve any intervention, experimental or otherwise; nature is allowed to take its course, with changes in one characteristic being studied in relation to changes in other characteristics. Case control and cohort studies are observational studies because the investigator is observing without intervention other than to record, classify, count and statistically analyse.

Odds ratio: The ratio of odds of exposure amongst subjects with disease compared to the odds of exposure amongst a control group. It is equal to the cross-product ratio.

Odds ratio (OR) = odds in exposed ÷ odds in non-exposed    = (d1/h1) ÷ (d0/h0) = (d1´h0) ÷ (d0´h1)

Outcome variable: A variable, often a measure of disease occurrence, whose occurrence we wish to investigate and which is therefore the focus of interest of our analysis. Outcome variables are also known as response variables, dependent variables or y-variables. In a case-control study the outcome variable is case-control status.

P value: The probability that the difference between groups would be as big as or bigger than that observed, if the null hypothesis of no difference is true. The smaller the P value, the stronger is the evidence against the null hypothesis of that there is no difference between the groups.

Pie chart: A (often unpleasant) circular diagram divided into segments, each representing a category or subset of data.

Placebo: An inert medication or procedure, i.e. having no pharmacological effect. It is intended to give patients the perception that they are receiving treatment for their complaint. From the Latin placebo ‘I shall be pleasing’.

Point estimate: A statistic calculated from the sample that is used as a (point) estimate of the value in the population (see also population parameter). For example, the sample mean might be a point estimate of the population mean.

Population: see target population. 

Population parameter: An unknown value in the population that we are trying to estimate using data collected in our sample.

Power: The ability of a study to demonstrate an association if one exists. The power of a study is determined by several factors, including the frequency of the condition under study, the magnitude of the effect, the study design and the sample size. It is the probability of observing evidence against the null hypothesis, if it is indeed false.

Precision: The amount of variation in the sample statistic; the greater the variation the smaller the precision.

Prevalence: The total number of individuals, who have an attribute or disease at a particular time or during a particular period, divided by the total population at risk. 

Proportion: the number of occurrences of an event divided by the total number of observations.

Random allocation, randomisation: Allocation of individuals, in randomised controlled trials, to the intervention group or the control group, by chance alone.

Randomised controlled trial: A study in which individuals are randomly allocated to two or more groups. Often, one of these groups will be the treatment group while the other will be a placebo group that receives no treatment other than standard care.  The key elements of randomised controlled trials are:

-        The comparison of a group receiving the treatment (or intervention) under evaluation, with a control group receiving either best practice, or an inactive intervention.

-        Use of a randomisation scheme to ensure that no systematic differences, in either known or unknown prognostic factors, arise during allocation between the groups. This should ensure that estimated treatment effects are not biased by confounding factors (see Chapter 18).

-        Allocation concealment: Successful implementation of a randomisation scheme depends on making sure that those responsible for recruiting and allocating participants to the trial have no prior knowledge about which intervention they will receive. This is called allocation concealment.

-        Where possible, a double blind design, in which neither participants nor study personnel knows what treatment has been received until the “code is broken” after the end of the trial. This is achieved by using a placebo. If a double-blind design is not possible then outcome assessment should be done by an investigator blind to the treatment received.

-        An intention to treat analysis in which the treatment and control groups are analysed with respect to their random allocation, regardless of what happened subsequently.

Rate: A measure of the frequency of occurrence of a phenomenon. The components of a rate are; the number of cases (numerator), the number at risk (denominator), a specified period of time. Unlike a risk, the denominator is usually comprised of precise ‘person years at risk’.

Reference group: see Baseline group.

Reference range: This range measures how much variation there is between the individual observations in a sample. It tells us the likely values for an individual in the population.

Regression: Finds the best mathematical model to describe y, the outcome, with respect to x, the exposure. The most common form is linear regression. The regression co-efficient is an estimate of the change in outcome (y) for a unit change in exposure (x) according to the equation is y=a+bx, where a is the intercept and b is the slope.  The regression line is a diagrammatic presentation of a regression equation.

Reverse causality: This term is applied to an exposure - outcome association which is thought to be due to the outcome actually causing the exposure rather than the other way round. For example it was noted that low cholesterol levels were associated with an increased risk of cancer of the bowel. Initially it was thought that low cholesterol may cause this disease but it was subsequently shown that patients with cancer of the bowel, though not yet diagnosed, already had lower cholesterol presumably secondary to the disease. One must always consider this alternative possible explanation for an association.

Risk: The probability that an event will occur; number of new cases of a disease (numerator) / number of people initially disease free and at risk over a specified time (denominator).

Risk difference: The difference in risk between exposed subjects and non-exposed subjects.

Risk difference = risk in exposed - risk in non-exposed  = (d1/n1) - (d0/n0)

Risk factor: A factor or characteristics that might alter the risk of disease.

Risk ratio (sometimes also referred to as the relative risk and often abbreviated to RR): The risk ratio is the risk of developing disease associated with an exposure divided by the risk of developing the disease in the absence of exposure.

Risk ratio (RR) = risk in exposed ÷ risk in non-exposed    = (d1/n1) ÷ (d0/n0)

Sample: A selected subset of a population.

Selected sample: usually a random sample of individuals that have been selected from the target population

Study sample: the sub-group of subjects from the selected sample that actually agree to take part and contribute data to the study

Sample size calculation: The mathematical process of deciding before the study begins, how many subjects should be studied. In order to calculate the required sample size, the investigator needs to specify four things:

  1. The expected level of outcome in the control (placebo) group.
  2. The smallest difference they wish to detect (% difference).
  3. The strength of the evidence (p value) they wish to find, usually 5%.
  4. The probability of detecting a difference at a specified p value, if the true difference is the size they expect. This is called the power of the study and is often set at 90%.

Sampling: The process of selecting a number of subjects from all the subjects in a particular population.

Sampling distribution: The distribution that would be observed if we derived a sample statistic, such as a mean or a difference between proportions, from repeated samples from the same population.

Sampling error: That part of the difference between the observed value of a sample statistic (such as a mean or a difference between proportions) and the true value in the population, caused by random variation.

Scatter plot: This is a graphical display of the association between two numerical values.

Skewed: An asymmetrical frequency distribution.

Standard deviation: A measure of how widely dispersed are the individual observations in a distribution. The standard deviation is the square root of the variance.

Standard error: The standard deviation of the sampling distribution of a sample statistic such as a mean or a difference between proportions.

Statistics: The science of collecting, summarising, presenting, interpreting data, estimating the magnitude/strength of relationships and testing hypotheses.

Target population: The collection of individuals about whom we wish to draw inferences or be able to generalise to. The population does not have to be humans, it can refer to animals or settings or event. Some times the target population is intangible, for example, in a clinical trial, it may be all future patients who will be prescribed a treatment, in a laboratory experiment designed to understand some phenomenon, the target population is the true behaviour of something. 

Threshold effect: A pattern of association between exposure and disease in which only subjects whose exposure is above a certain level are at increased risk

Variability: The extent to which the values of a variable in a distribution are spread out from the centre.

Variable: A quantity that varies. An attribute, phenomenon, or event that can have different values.

Numerical variable:  variables given a numerical value.

Continuous: A variable with a numerical value, which has a potentially infinite number of possible values along a continuum, within a specified range.

Discrete: A variable with a numerical value, which cannot take on any intermediate values e.g number of children, number of deaths.

Categorical variable: a variable, which refers to categories. It is given a ‘value label’, which is usually a number.

Dichotomous or binary:  A variable where only two categories are possible

Ordered categorical: A variable wherevalues are ranked according to an ordered classification.

Unordered categorical: A categorical variable where categories have no order to them