Strand 5: Methodology

Leaders: Fiona Steele and Frank Windmeijer Strand Publications

5.1 Hierarchical, Crossed and Multiple Membership Data Structures

Researchers: Steele, Goldstein, Clarke

ALSPAC data are of a highly structured nature. Measurements of the same or similar characteristics of the children are repeated over time, leading to a two-level hierarchical structure. Children are clustered in schools and neighbourhoods in a nonhierarchical way, being nested within a three-way cross-classification of primary and secondary schools and neighbourhoods. In addition, the data have a multiple membership structure due to mobility of families and children over time.

Methods for estimating cross-classified models have been proposed by Raudenbush (1993) and Rasbash and Goldstein (1994). Browne et al. (2001) describe how models for mixtures of cross-classified and multiple membership structures may be estimated. However, applications of such models are scarce because of lack of appropriate information from which to identify membership of higher- level classifications and changes over time. The ALSPAC data provide a unique opportunity to evaluate these methods and to determine the extent to which substantive conclusions change when they are used.

Although the focus of this methodological strand will be on educational attainment, the methods developed can be applied to studies of contextual effects on cognitive, health and behavioural outcomes in all our strands. A spatial model, which can be framed as a multiple membership model (Browne et al. 2001), will be used to allow for the effects of not only the neighbourhood of residence but of surrounding neighbourhoods. The model will also allow simultaneously for primary and secondary school effects on attainment and movement between schools and neighbourhoods over time. One aim of the project will be to assess the sensitivity of results to different weighting schemes in multiple membership models.

The project will use data from ALSPAC and the matched Pupil Level Annual School Census (PLASC). The research will draw on the expertise of colleagues at Bristol.

(Back to top)

5.2 Identification of Causal Effects, Mendelian Randomisation

Researchers: Windmeijer, Davey-Smith, Sterne, Santos-Silva, Burton, Clarke

It is often difficult to determine causal directions due to the possible existence of unmeasured variables that determine both ‘predictor’ and ‘response’ variables. Instrumental variables (IV) and related techniques require the existence of variables with particular properties in terms of their relationships with the processes being studied. The richness of ALSPAC, especially the existence of genetic data, allows a very detailed study of this problem that would be expected to have important implications, both in terms of data to be collected and modelling procedures.

Social medicine researchers have recently adopted the IV method to establish the causal effects of an individual’s ‘phenotype’, like cholesterol level, on health outcomes by using the person’s ‘genotype’, i.e. the genetic make up of the individual, as an instrument (see e.g. Davey-Smith and Ebrahim, 2004, for an overview of this so-called Mendelian Randomisation approach). Use of genetic information may open new ways of establishing the magnitude of causal effects in social science research.

The research will focus on the development of IV methods for nonlinear models, like probability models for binary outcomes. The standard setup is that of a triangular structural simultaneous equations model estimated by maximum likelihood assuming multivariate normality of the error distribution. Because of the non-robustness of this method other approaches have been proposed, like the parametric (Heckman, 1978) or semi-parametric (Blundell and Powell, 2004) control-function approach, non-linear IV in the logit model (Foster, 1997) and use of a “super exogenous” variable (Lewbel, 2004). Our research will document differences in the properties of these estimators and aim to develop new estimation techniques. Using the ALSPAC data we will be able to assess the strength (or weakness) of genetic markers as instruments in certain settings and we will test whether the instruments actually directly affect the outcomes. Results of this research could be used in other projects of the bid, especially 1.3ii, 1.4i, and  2.2.

(Back to top)

5.3 Methods to Deal with Missing Data

Researchers: Sterne, Tilling, Heron, Leary, Goldstein, Steele, Carlin, Carpenter, White, Clarke, Spratt

The analysis of data from longitudinal studies is often complicated by the presence of missing values, caused by participant dropout or non-response. Failure to allow appropriately for missing data can lead to both biased and inefficient statistical analyses. The most flexible method to deal with missing data in the context of longitudinal studies is multiple imputation (MI) followed by complete-data analyses 14 (Rubin, 1987). There are essentially two main approaches to performing MI: (i) multivariate normal models, Schafer (1997), and (ii) “chained equations” (van Buuren et al 1999). The relative advantages and disadvantages of these two methods are currently unclear.

The research to be conducted under the proposed programme will focus on practical issues and difficulties in deriving multiple imputed datasets in la rge longitudinal studies such as ALSPAC. The very large number of variables that are of potential use in imputation can make choice of appropriate imputation models difficult. We will focus on a number of issues. First, are certain types of variable particularly useful? Should we always include previous or subsequent measurements of a variable that we wish to impute? Second, do certain types of variable, such as measures of SEP, tend to be strongly associated with variables with missing values and therefore of use in imputation models? Third, how should we deal with potential interactions in the final model? Fourth, do imputation models that allow for longitudinal data structures have advantages over less structured models? Fifth, what are the advantages of imputation models that incorporate large numbers of variables, compared to an approach that constructs more parsimonious models? Finally, do imputations based on the multivariate normal distribution, as pioneered by Schafer, perform better than imputations using chained equations? We will address these questions using simulations and in practical examples using ALSPAC data. Results of this research could be used in all quantitative projects of the bid.

(Back to top)