When we invited colleagues to ask questions of the Digital High Stakes Assessment Programme, a theme arose about the comparability of paper and digital assessments. Comparability, understandably, is a concern: if assessment outcomes are used interchangeably then comparability is a requirement. One colleague asked:
"High stakes exams are necessarily conservative. How can we ensure that the constructs are not radically changed and that results remain comparable with paper-based exams whilst at the same time being innovative?"
In this blog we will describe what we mean by comparability when talking about digital assessment, and how we balance introducing innovation with comparability.
How we think about comparability in educational assessment
When we talk about comparability, we mean the extent to which standards are similar. Previous Cambridge research[i] identified four types of standards – reflecting those widely accepted in educational assessment[ii]:
- Content standards – the content assessed and its value and relevance
- Demand standards – the demand of knowledge, skill and understanding required
- Marking standards – how marks are assigned
- Grading standards – the kind of performance that should receive a particular outcome (e.g. a grade)
We (and any other assessment provider) must be clear about:
- What we claim about the comparability of our assessments (this could be in the form of which of the different standards we claim are the same, for example that an exam in 2018 was of the same demand standard as the exam with the same title in 2019)
- How we evidence those claims (a Special Issue of Research Matters[iii] is dedicated to comparability and described many ways that claims can be evidenced)
In the Digital High Stakes Assessment Programme we work in two different ways which require separate approaches to comparability.
Approach 1: Lifting and shifting – assessments based on existing paper exams and migrated to screen
When paper tests are simply migrated to screen the expectation is that a certificate gained through the paper route is of the same value as a certificate gained through a digital route.
The claim: The comparability claim is that the paper and onscreen versions will be comparable in all four of the standards.
The evidence: Researchers devised and piloted a framework[i] for describing and recording comparability claims and evidence between alternative assessments. This framework can be used during assessment development to guide assessment design and monitor success in achieving comparability claims. The framework describes comparability features for each of the standards and asks three questions in relation to these.
- Is it intended that there should be comparability between tests in terms of each standard?
- What are the differences between tests, if any, in terms of these features?
- For the standards where comparability is intended, are you satisfied that there is sufficient comparability?
For example, we are trialling an onscreen Mock Service and have carried out research on how the constructs assessed compare across the onscreen and paper versions of the assessments, so providing evidence of the content standards. We are planning research which will look at how learners perform on the online mock compared to the paper live exam.
Migrating to screen comes with comparability challenges, research shows:
- There are differences in performance in paper and digital versions of exams. The research shows a mixed picture – in some studies the paper version was easier, in others the digital version was easier.
- These differences are greater when the responses are not text based (e.g. mathematical equations) or where learners need to show evidence that they can do something (e.g construct a chart).
But these challenges can be mitigated to some extent by matching the assessment mode with the teaching and learning.
Approach 2: Developing ‘born digital’ assessments
We are developing digital assessments from scratch to assess content and constructs not assessed (or assessable) on paper and which meet teachers’ and learners’ needs.
Examples from the Digital High Stakes Assessment Programme include
- A digital computing qualification for 16 year-olds which will be an alternative to existing IGCSE and GCSE Computer Science. The digital qualification will assess different things from existing qualifications, for example communication and collaboration. And it will assess in different ways from existing qualifications, for example, using interactive problem solving items to assess computational thinking.
- A digitally-enabled research project for History learners which aims to do two things not achieved in the paper assessments:
- To encourage iterative and social learning through continuous upload and sharing of work
- To explicitly assess skills alongside the subject understanding.
When this approach to digital assessment is taken, the expectation is that a certificate gained for this qualification will be of the same value as other parallel qualifications. External agencies, such as Ofqual, UCAS and Naric, may have a role in accrediting that this is the case.
The claim: The content is different and so the marking standard is different, but the demand standard and grading standard are comparable. So, we are radically changing what we’re assessing in this context.
The evidence: Will be gathered using comparability research methods, some of which can be used before an assessment goes live and others are suitable only once we have data from live assessments.
- When we trial these new digital assessments and before they are used in a live context, we can iterate the demand and grading standards as necessary. We can select from a number of existing comparability tools e.g. The CRAS scale of demand allows experts to gauge the demand of tasks; content mapping tools provide frameworks for comparing coverage, and Comparative Judgement puts learners’ responses to different tasks on a single scale of performance to allow us to compare across tasks (find more details of these methods in Research Matters Special Issue 2 2011).
- Retrospective evidence can be gathered after the assessment has had one or two live series, e.g. comparisons of student performance in the digital assessment with performance in other qualifications.
- Longer term data will also be useful in adding to comparability evidence, e.g. data on the relationship between performance in the digitally-enabled research project and measures of university success. This data will take longer to emerge.
Whilst we are in a transitionary period, with paper and digital assessments being taken in parallel, comparability is especially important to ensure fairness to all candidates, whether they are taking a digital or paper-based assessment.
To allow users of the qualifications to understand what the qualification means (e.g. for a university to understand what this learner knows and what this learner can do) we need clear comparability claims about the content, demand, marking and grading standards and robust research evidence to support those claims. Both of which we are designing into our assessment development process.
Blog by Sarah Hughes, Research and Thought Leadership, Digital High Stakes & Gill Elliot, Deputy Director, Research Division
If you would like to learn more about the work of our research team, and the development of new digital high-stakes assessments, you can sign up to join our seminar, Developing Research Informed Digital Assessments on 24 November.
[i] Shaw, Crisp and Hughes (2019) https://www.cambridgeassessment.org.uk/Images/579399-a-framework-for-describing-comparability-between-alternative-assessments.pdf
[ii] For other definitions and categorisations of comparability see Crisp 2017 https://uclpress.scienceopen.com/hosted-document?doi=10.18546/LRE.15.3.13