ABSTRACT

In a two-phase study, some variables are measured on all the units in a large initial sample of units drawn from a cohort. Then, based on the values of these variables, a subsample of units is drawn and the values of additional variables are obtained for members of the subsample. The idea was first introduced in survey sampling by Neyman (1938) – he called it “double sampling” – and in epidemiology by White (1982) – she called it “two-stage sampling.” Such designs are particularly useful when the additional variables are expensive, invasive, or difficult to measure, and can result in considerable savings. Xu and Zhou (2012), Chatterjee and Chen (2007) and others point to the increasing importance of such sampling designs in genetic epidemiology, where they can reduce the cost of studies by limiting expensive ascertainments of genetic and environmental exposure to an efficiently selected subsample of the main study. They also have other uses besides reducing the cost of obtaining expensive covariates. For example, adding an extra phase of sampling can provide an efficient way of 220making an after-the-fact adjustment for a confounder that was overlooked and not measured in the original study – in fact this was the motivation for White (1982) – or even an exposure that was not recorded in the original study but has later become of particular interest.