e-Stat social science application and methodology

The project involves statistical work in 5 specific social science application and methodology topic areas:

Topic 1: Measuring segregation – proposal for complex modelling
(Goldstein and LEMMA 2 {Jones, Pillinger, Leckie and Fielding})
Topic 2: ESDS feasibility study project: changing circumstances during childhood
(Plewis and Walthery)
Topic 3: Social networks in multilevel structures
(Tranmer and Browne)
Topic 4: Handling missing data via multiple imputation
(Goldstein, Browne and Charlton)
Topic 5: Sample size calculations
(Browne, Goldstein and Price)

Topic 1: Measuring segregation – proposal for complex modelling

(Goldstein and LEMMA 2 {Jones, Pillinger, Leckie and Fielding})

The quasi-market reforms of the secondary education system in England and Wales, from 1988 onwards, set up new incentives and opportunities for schools and parents. Parents were given greater opportunity to choose a school for their children and were supported in making their choices through the publication of examination league tables and OFSTED inspection reports. This has created an important debate about whether social diversity or segregation among schools has changed as a result of parents exercising choice and continuing modifications to the curriculum and status of schools. In particular, the debate has focused extensively on how to define and measure segregation. Goldstein and Noden (2003) develop a modelling approach to this issue that avoids many of the drawbacks of traditional methods. The advantage of this modelling approach is that it allows the possibility for causal modelling by postulating underlying population processes.

The project will further extend the existing model to handle ordinal measures of segregation. With the availability of the National Pupil Database (NPD) we now have the possibility of modelling segregation across all maintained schools in England for at least 6 cohorts of schoolchildren from the time they enter school through to year 11. The database contains data on individual pupils, including their test scores at key stages. Our main responses will be their eligibility for free school meals (poverty) and ethnicity. The data set is very large, currently consisting of longitudinal records for over 500,000 pupils per cohort. Interest focuses on changes over time in the extent of segregation, in terms of poverty and ethnicity, among schools and among local authorities.

There is also interest in between-area segregation where areas are defined at middle or lower super output area (MSOA and LSOA). A major challenge is to be able to describe all these potential effects within a single statistical model that would allow the interactions between these different variables to be explored. This model would need to be able to handle very large datasets with many random effects that are both cross classified and have multiple membership structures that are needed to account for individual mobility among schools. A further challenge is to meet the need to run several analyses with different 'starting values', within a Bayesian MCMC framework, to check model convergence properties.

Topic 2: ESDS feasibility study project: Changing circumstances during childhood

(Plewis and Walthery)

The extent to which children's life chances are determined by their parents' economic, social and demographic positions in society is a recurring theme of UK quantitative social science. It is often assumed that these positions are fixed throughout the years of childhood and adolescence whereas, in fact, children are party to many socio-economic and socio-demographic changes as they grow up. The UK is in the fortunate position of having a number of longitudinal datasets that can be used to document these changes.

Rarely, however, are even a fraction of these studies brought together to generate a triangulated picture of childhood change. Considerable investment is needed to bring a range of studies together to describe changes of this kind in terms of

understanding how the data were produced
understanding the characteristics of the databases used to archive the data
processing large amounts of complex longitudinal data, and
analysing a range of datasets with different structures.

A feasibility study is proposed to determine how best to bring relevant studies together into a form that makes comparative and triangulated analysis relatively straightforward. The relevant longitudinal studies here break down into three groups:

studies run by university groups
studies sponsored by government
administrative datasets.

The feasibility study will assess the available data and metadata in terms of their ability to describe change both with age and across cohorts, to assess how these descriptions might be affected by differing population definitions, sampling and non-sampling errors, and measurement error, and report on these. This work is supported by the joint DWP/DCSF Child Poverty Unit.

Topic 3: Social networks in multilevel structures

(Tranmer and Browne)

The use of multilevel models to take into account complex social structure and allow for dependencies between individual units (such as people) in groups (such as geographical areas, institutions or organisations) is now well established. There is also a great potential for using multilevel models for social networks, or to incorporate social network information into a more established multilevel analysis of the type described above. At present there are relatively few examples of the use of multilevel models for social network analysis yet there is a huge and ever growing interest in this topic. Thus it is timely to demonstrate the potential of social science datasets and specialist multilevel modelling software to investigate social networks, and to investigate social network effects within other multilevel structures.

The focus of the proposed research is thus to show how social networks can be fitted as multilevel models, including complex cross-classified (CCM) or multiple membership models (MMM), for a variety of social science examples based on existing secondary datasets. Computationally intensive models, such as CCMs and MMMs, are the target for our e-enabled algorithmic tools. We will exploit the potential of the many archives of secondary data sources such as those provided by ESDS and the DAMES Node to search for and to investigate the nature and extent of social network information that is available. We will also search for potential proxies for network structure in these datasets. In many situations we may find that the network information has been aggregated in some way in the secondary data – for example we may know how many friends a survey respondent has in a local area, without knowing the precise details of their network, but this could still be very useful information when used appropriately in the models. Furthermore, the notion of network homophily may mean that certain key social variables are strongly associated with network structure thus we may be able to identify proxies for network structure and control for these in the modeling approaches.

To fully understand the way in which these models work - their precision, robustness, and behaviour when information is aggregated, missing, sampled or combined in a particular way - simulation studies will be needed alongside the real data analysis. Thus we propose to simulate datasets with full network and multilevel complexity, following as closely as possible real data situations. We will then fit the multilevel models with the full simulated dataset. We will then test various scenarios, corresponding to real datasets that contain information about, or relating to, the network, to see how best to use this information. Based on the information available in real datasets we will use the simulation studies also look at the potential for combining data from several sources and the implications for data collected under different sampling strategies, to reflect the way in which real survey data is collected.

Finally we aim to demonstrate, in a very practical way, how this approach may be used by quantitative researchers to answer important substantive questions in the social sciences, initially through the use of MLwiN and subsequently through e-stat tools once they become available.

Topic 4: Handling missing data via multiple imputation

(Goldstein, Browne and Charlton)

In another recent ESRC grant (RES-000-23-0140, REALCOM), Goldstein and Rasbash developed flexible but computationally slow procedures for multiple imputation. The statistical methodology is described in Goldstein et al (2009) This work has been integrated into MLwiN ( REALCOM-Imputation). This is a doubly computationally demanding technique. Initially an imputation model must be estimated, this involves taking all the variables (response and explanatory) in the scientific model and fitting a multilevel, mixed response type, multivariate model, from which estimates of the distributions of each missing data value are derived. Multiple data sets are then sampled from the imputation model and the scientific model is estimated multiple times on these sampled data sets, estimates from these multiple model runs are then synthesised to provide final estimates. We will transfer this work to our new system and extend the range of structures it can handle.

Topic 5: Sample size calculations

(Browne, Goldstein and Price)

In a current ESRC grant we have developed a system ( MLPowSim) to construct sample size calculations by simulating multiple data sets similar to the one we wish to collect.

This work will be extended and transferred to our new system in a PhD studentship supervised by William Browne and Jon Rasbash.

It is well known that the dependence induced by clustering in social science datasets means that the sample size requirements for testing hypotheses need inflating to account for the lack of independence. The MLPowSim software has been developed that will generate both MLwiN macro code and R code to perform sample size calculations for a selection of multilevel nested and crossed designs. The software is limited to 2 separate clustering factors (whether they be nested or crossed) and it is challenging to consider how to simulate realistic datasets for specific crossed scenarios.

In recent work Leckie (2008) examined using the National pupil database to look at accounting for the effects of primary school, secondary school, neighbourhood on student achievement while also controlling for student mobility. This database contains nearly half a million pupil records which makes using the whole database rather unwieldy. There is therefore a need to come up with simulation-based approaches for generating appropriate samples from such a large database to establish potential sample sizes and sampling schemes that will capture the basic structure of the data and have appropriate power for testing hypotheses both using the current data and future years of data. This challenge is the motivation for the PhD project.

One also needs to investigate how to come up with parameter values to be used in the simulated designs and how sensitive the sample size estimates are to these values.

Another factor in increasing the number of classifications is that the estimation options for the models become more limited, and in particular Monte Carlo Markov chain (MCMC) estimation becomes more commonly the method used. MCMC estimation is inherently slow as it can involve running for many iterations of the algorithm. When combined with the simulation approach of producing thousands of datasets the time to construct a sample size calculation becomes very large. Here parallel computing can very easily cut the time as it is possible to send different datasets to different processors and link the results together at the end. It is also possible that parallel computing may well speed up individual runs of the MCMC algorithm by parallelizing steps of the MCMC algorithm.