The recent launch of ChatGPT has accelerated the conversation around AI and assessment. We spoke with Nick Raikes, Director of Data Science at Cambridge University Press & Assessment, on the key opportunities and challenges assessment practitioners should be aware of, the power of data, as well as his own career path.
This discussion first appeared in Perspectives on Assessment, the Cambridge Assessment Network member newsletter, which features key voices from the assessment community along with other member-exclusive content.
What opportunities are presented by developments with Artificial Intelligence (AI) and assessment?
"I guess the first thing to say is that people disagree about what AI is, so let’s sidestep that discussion and talk about the opportunities made possible by developments in data science, machine learning and AI, without worrying too much about how to classify them. Many of those developments will depend on increasing digitalisation of assessment, which provides many opportunities.
For example, it’s easy to see how extending the range of question types that can be accurately marked automatically would be useful for formative assessment and revision. We’re a long way from a general-purpose AI that could mark questions it hasn’t been trained on, but we can get quite accurate marking for specific questions by training a machine learning model on a sample of free text answers that have been marked by human examiners.
Essentially the model learns the relationship between features of the answers and marks – for example, the relationship between the presence or absence of words of a particular meaning and the mark awarded – then uses this information to predict the marks that a human examiner would give to answers that the model hasn’t been trained on.
“In our research into automatic marking of short, free text answers to science questions, we found we needed several thousand marked answers for training.”
Similarly, if we ask teachers to label common misconceptions or mistakes in a sample of answers, we might be able to train a model to recognise these so that it could give useful feedback to students attempting the question in future. Machine learning requires quite a lot of training data, though.
Or to think of a different example, imagine having a digital practical assessment, where we set students tasks and use cloud computing to provide them with standardised tools, resources and workspaces. We could capture the detailed sequence of actions each student takes, as well as task outcomes. Potentially we could extract features from all this process data which provide assessors with far more direct evidence of a student’s practical skills than can be inferred from their finished product alone.
With enough data and assessor judgements, we might be able to train a machine learning model to make the judgements – not for use in high stakes assessment, necessarily; but this could be great for providing students with instantaneous, individualised feedback in low stakes contexts."
What challenges should assessment practitioners consider and how can we mitigate against risks?
"I think the first point to make is that just because a computer can do something, that doesn't mean we shouldn't still ask students to do it in an assessment or learning context. For example, even if a computer does eventually learn to be able to write perfect essays to order, it still can be legitimate to ask students to write essays in an exam. That's because we're mainly asking students to write essays because they give us a “window” into their minds so that we can assess their understanding, skills and knowledge – we don’t particularly value the essays themselves.
Sometimes people argue that if you can look something up then you don't need to know it. But that misses the point, because you need to really understand things. For example, I did a physics degree, and of course there are explanations of general relativity in textbooks and so on. But as a physicist I need to understand general relativity if I'm going to be able to use it. I need to know about it in order to think about how I might use that knowledge and so on. And there were lots of other things I needed to know and understand before I could understand general relativity. So students still need to know and understand things, and we should use the most effective methods for eliciting evidence of what they know and understand for assessment.
So that's the first area, which means that, you know, we don't have to abandon everything that we've done before just because there is some new technology. The fundamentals of what makes a valid assessment haven’t changed.
Another issue is of course bias. Machine learning is trained on data, it will replicate everything - if there's bias in the data, then a machine will learn to be as biased as the people were who produced the data. So we still need to be very aware of looking for bias.
We also need to be careful with automatic marking – there’s a couple of issues here. Firstly, if the computer comes up with the same mark for an essay, let's say, as a human does, does it matter if the computer is actually looking at completely different things from what the human looks at? For example, the computer might actually be looking at quite superficial features of the writing which correlate very well with the mark, but aren’t the kind of things that the human marker is looking for.
And I would answer that yes, it does matter, because while it may be true at the moment when we haven't got automatic marking - and students know that we haven't got automatic marking - that there is that correlation with those surface features. But if we introduce automatic marking and people know that these are the features that are counted, then obviously that relationship may break down and we may end up assessing something different from what we intended.
So we still need to consider the validity of our assessments and we can't just focus on the reliability of our automatic marking. There is also the risk that the scope of subject areas could get reduced to what can be assessed in this way - we need to avoid those kind of unintended consequences.
Another really important issue with some of the more complex machine learning methods is their lack of “explainability” – they can essentially be “black boxes”. At the very least in the context of assessment we need to investigate the validity of their outputs and confirm that the outputs are unbiased, but I’m uncomfortable about us using machine learning models that we can’t explain. In any case, we always would need to ensure there is a human in the loop for important assessment judgements, or at least an opportunity to appeal to a human."
Could you tell us a little about how you came to work in educational research and now data specifically?
"After temping at Cambridge Assessment over summer breaks from University, I started in quite a junior position as an educational researcher – I had some knowledge of statistics from my Physics degree which helped. At the time I was more interested in Computer Science, but once I’d been in the role for a while I started to find it really interesting. I gained a qualification in educational assessment, and the role has really grown with me. Now I’ve moved full time into data – I’m responsible for developing Cambridge University Press & Assessment’s capabilities in analytics and machine learning, and I also founded and lead our central Data Science unit.
“Why is data so important? Well it’s partly because we now have far more data. With our digital lives, there is so much data collected. Plus we also have the processing power and the analytical methods to be able to extract useful information from all of that data and learn from it. So there's much more opportunity.”
This means that there is more competitive pressure for us to make good use of data to help us make better decisions - so that we're basing decisions on analysing data, expertise and judgement.
Organizations that aren't able to do that well are at risk of being outcompeted by those that can. More fundamentally, there is the imperative to make the best decisions we can – the information we get from data helps us make good decisions. And as we’ve talked about already, there is so much value we can squeeze out of data with machine learning.
Cambridge University Press & Assessment is in the business of providing standardised information on what learners know and can do. It’s really important that we provide valid, reliable, useful information. Data is absolutely fundamental to our business.
Data is also of fundamental importance for how we run our assessment processes. For example, we have a process called statistical process control of marking quality. We have a number of essentially statistical indicators of marking quality. One of these, for example, reflects the extent to which an examiner’s marks agree with marks from the principal examiner on a subset of scripts.
We continually calculate and monitor these metrics automatically as examiners are marking online so that we can raise alerts if it looks like something's going wrong – and then a team leader can do an investigation and decide whether they need to provide some additional guidance to an examiner and whether some scripts need to be remarked, and so on.”