Psychometrics is the statistical process used to ensure that educational assessments are fair, reliable, and valid. In this blog, Dr Tom Benton and Carmen Lim, researchers in the field, explain how psychometric methods underpin every stage of assessment—from test design and scoring to maintaining standards over time. They explore two key approaches, Classical Test Theory and Item Response Theory, showing how these models help identify and improve the quality of test questions and ensure results remain consistent across different groups of learners.
What is psychometrics in assessment?
In educational settings, psychometrics refers to a field of study that applies a set of statistical theories and models to examine the extent to which assessments meet their fundamental principles. Assessments must be reliable, valid, and fair, regardless of their purposes or forms of testing and those that meet these principles are considered high quality. The objective of psychometrics is therefore to understand and maximise the quality of assessments.
How does psychometrics maximise the quality of assessments?
Using psychometrics, we can produce empirical evidence indicating the quality of the tests and identifying specific areas for improvement. Psychometrics plays a crucial role in every stage of an assessment including test construction, performance reporting and standard maintaining.
For example, during test construction, psychometrics ensures that the questions included are valid for their intended purpose. It also helps us make informed decisions about test construction, such as determining the optimal test length to achieve the desired level of reliability. Furthermore, psychometrics ensures that we can accurately infer candidates’ ability or proficiency in the assessed subject by making sure the reported scores are aligned with the test standard. Lastly, using psychometrics in standard maintenance ensures that the grades awarded from a qualification are comparable across different test versions over time.
To achieve these purposes, practitioners commonly use techniques from two families of psychometric models: Classical Test Theory and Item Response Theory.
What is Classical Test Theory?
Educational assessment can be seen as a tool to measure the unobserved, underlying ability of candidates, such as their English language reading proficiency. Classical Test Theory refers to a family of statistical approaches that use candidates’ raw scores as a good enough approximation of their underlying ability. This leads to some fairly simple statistical formulae for exploring the difficulty and quality of items as well as the comparability of different tests.
What is Item Response Theory?
Item Response Theory is more complex and became prevalent much later compared to the Classical Test Theory. This family of models is based on the idea that we can infer the relationship of each item with the unobserved, underlying ability of candidates based on the relationships between scores on items themselves. Although more complex, the approach allows a much more detailed assessment of how different combinations of items might work together to help create reliable assessments.
How does Classical Test Theory compare to Item Response Theory?
Classical Test Theory (CTT) tends to be applied in a localised way – that is, with a single analysis relating to a single set of candidates and a particular test version. In contrast, Item Response Theory (IRT) can draw data from different candidates taking different test versions into a single overarching framework. In practice this means that, if the same items have been reused within many different test versions, CTT would usually require us to analyse one test at a time. In contrast, IRT might allow us to summarise all our data on item performance from all our test versions in one go.
Similarly, classical item statistics are influenced by the sample of candidates used to obtain them. Take the measure of item difficulty as an example. In CTT, item difficulty is measured using the facility value, which for one-mark items, it is the percentage of candidates who answered the item correctly. If the item is analysed using a group of candidates with higher ability than the average candidates that the test targets, then the item will appear easier (i.e. have a higher facility value) than it would for the intended candidate group. This dependency limits the generalisability of CTT-based item statistics to different populations. In contrast, with suitable approaches to calibration, IRT can produce item difficulty estimates that are sample-independent, meaning it can be made to remain stable across different groups of candidates. This makes IRT particularly useful for developing assessments that need to function reliably across diverse groups of candidates.
However, IRT models tend to rely on more restrictive assumptions and are methodologically more complex than CTT. CTT methods are generally simpler and easier to apply, which is why both approaches are often seen as complementary in practice.
If you're interested in exploring psychometrics further, we recommend our A104 course, which delves deeper into the concepts discussed in this blog and offers a more comprehensive understanding of how assessments are designed and evaluated using statistics.