ABSTRACT

There is little doubt that the well-known quote “All models are wrong, but some are useful” also applies to item response theory (IRT) models. For instance, in large-scale educational surveys, such as those conducted in the Programme for International Student Assessment (PISA) project, tests of model fit easily become significant. Since item calibration in the PISA project is typically carried out using samples of more than 17,000 students from some 34 OECD (Organisation for Economic Co-operation and Development) countries, this should come as no surprise. Rather than in testing, the fit of IRT models against unspecified alternatives, our interest should therefore be in assessing specific discrepancies between observations and model predictions, known as residuals, to evaluate whether the intended inferences made from the model are trustworthy. A plethora of residuals can be defined, including residuals that target potential violations of specific well-known model assumptions.