Maintaining comparability of results when GCSEs go digital - A reflection on lessons learnt from PISA

19 May 2022 (01:25:46)

Every 14/15-year-old in the country should sit a mock online GCSE as soon as next year, an academic has told a Cambridge Assessment Network seminar.

Video transcript

Tim Oates: [00:00:12] John. It's just such a great pleasure to have you here today in this room. Actually, we had a big inaugural lecture which actually sort of opened up the building, inaugural for the building. And it was it was Bill Schmidt. And, you know, it's that great guy, you know, towering presence in the big transnational surveys. And actually, you're becoming a towering presence in the big transnational surveys as well, I think. And we're going to deal with a really important issue, which is the question of mode effects beginning to move to digital measurement or online measurement of kids’ performance and the kind of issues that gives us in a load of settings. But it will be explored, particularly through the implications for measurement within PISA and the lessons learnt from PISA. And I'll introduce you briefly. I'll raise a couple of issues and then invite you to start presenting. We'll go for a couple of minutes for me about 45, 40, 45 minutes for you. And then we'll go to questions. We'll take them from the floor. Penny will be managing the live stream of questions. By then there'll.

Tim Oates: [00:01:29] By then there'll loads of questions, but we will, we'll box and cox that and put them to you John. Oh. John's got a website, so, you know, so I just looked at that and John said, Oh, God, it's out of date. Never mind. I'm going to read it anyway. So John's research interests include the economics of education, access to higher education, intergenerational mobility, cross-national comparisons in educational inequalities, all things which we're profoundly concerned with here. We're profoundly concerned with equity and attainment and trying to achieve both of those things and education systems. Now, John's worked extensively with the OECD Programme for International of the PISA data. With this research reported widely in the British media. John was a recipient ESRC research scholarship in 2006 2010 and notably awarded the prize as the most promising PhD student in the quantitative social sciences at the University of Southampton. And that's why you're professor in Educational and Social Statistics Institute. You know, that promise has been realised actually. So they're quite good at awarding prizes. Actually, their criteria must be pretty robust. In 2011, he was awarded a prestigious ESRC postdoctoral fellowship to continue the research into education and labour market expectations of adolescents and young adults. We've sort of mapped onto the kind of work which Anna Vignoles is doing in the faculty here. Since then, winning the inaugural ESRC, early career Outstanding Impact Award and an ESRC grant to study cross-national comparisons of educational attainment and social mobility.

Tim Oates: [00:03:16] But now, as I've said, Professor in Educational and Social Statistics at UCL Institute of Education and recently actually invited to support Ofsted in terms of managing better and improving research processes around Ofsted. So that's great. John. I’ll add a few personal notes, I mean I really I like working with John and we, you know, it is a great pleasure because for the following reasons, you know, John tells it like it is. And I've been in meetings where he would say, Well, that's wrong. And that's wrong. That's okay. That's right. And that's wrong. He is not afraid to talk truth to power. And. It's quite easy to upset policymakers. It's quite easy to upset people who run the big transnational surveys. But, John, despite saying that's not right and that is and that's not right, has maintained the respect of over the top people in OECD. So continues to work with people like Andreas. And I think that's no mean achievement. Actually. You not only are prepared to say why things are problematic, you are prepared to support people in making them better. And that's something again, we put great store by in terms of things that, you know, we've said, well, that's not right. You know, we hear things like there are no private schools in Finland. Well, you know, apart from the 85 private schools in Finland, which service the educational needs of 3% of the population, and then the other 9% of private upper schools, upper secondary schools in Finland and the 54% in the vocational sector.

Tim Oates: [00:05:02] These are truths which are really important, actually. And when you have untruths in policy statements about systems, it's really problematic to construct fair and analytic pictures of the performance of education systems. And that's really what you strive to do. I think, John, you focus on things which enable us to put together the jigsaw pieces about educational performance. We continue to do that. We've again tried to do it with saying responding to people who say no other nation has GCSEs. Well, when we looked at 24, high performing, 21, high performing jurisdictions, 14 of them had the equivalent to GCSEs and all of them had high stakes assessment at 16. Linking to PISA, you see statement after statement that England is not improving. Well, they are improving in mathematics. We are improving in mathematics in terms of the PISA data and in terms of other sources. We're flat in literacy. So we're not improving. Actually, we are, because so many other developing nations are declining in literacy. So our international performance, our international standing is improving. More recently from your very own institution, actually, I think I said I wouldn't say that, but never mind. I just thought.

John Jerrim: [00:06:16] You’ve started now, so you finish!

Tim Oates: [00:06:16] I'm going to finish. Recently there was a big research review by a colleague at your institution that said that phonics isn't working again. It was extremely selective work, which doesn't do the kind of overview work which you do, John and is really problematic. I mean that claim that phonics isn't working is a very, very restricted and a claim which doesn't accord with the international evidence. PISA went online having done a big field trial and John will deal with that. I was in Oxford when TIMSS announced that it would have to do the same because of what PISA had done, but they were not able to do the same kind of deep background work that PISA did. And well, we'll hear how that deep background work was managed. It has been a great pleasure to work with you, John, over the years and come together when we're anxious and worried about things. And this work on mode effects is very close to our own hearts here in Cambridge University Press and Assessment because we're doing so much on trying to utilise the digital world in an appropriate way, both within assessment and learning. So we're very anxious to hear your reflections and your learnings from the way in which PISA responded to the challenge of going using digital. I will just mention one thing you talk about in your opening slides. You talk about the chaos of 2000 and of course, here we were desperately trying not to make it chaos. But of course I think that probably the judgement of the nation and the judgement of young people and parents probably was or is that it was. So I think it's very important to again talk about the public reaction to what's happened in public examining and the implications that has for people's perspective on the role of digital in education, which I know you'll do with precision and care. Thank you very much indeed, John. It's a great pleasure to welcome you today.

John Jerrim: [00:08:25] My thank you very much, Tim. What an introduction that was. I do like how everybody laughed when you said, I like working with John. The audience burst into laughter. All 650 people at home burst into laughter as well. Anyway, hopefully this presentation will go down well. As Tim was saying, it's about my reflections of being involved in the PISA 2015 study. Little bit old now, you know, seven years ago. But it was one of the first big important nationally representative studies to make that big transition to kind of an online assessment. And I do think that's interesting and I do think it holds some lessons that we can learn for if and when other exams such as GCSEs go digital as well. Why now? Why are we talking about this now? Well, why are we talking about anything now? It’s the COVID pandemic right? It suddenly got a lot more attention in the media, in public policy, in government. This is what Tim was mentioning when I talk about the grading chaos of 2020, 2021, you know, not just grading chaos. Everything was chaos. When that first lockdown hit, everyone's lives were upside down. And March 2020 exam grades came out. It's August 2020. You know, wasn't a lot of time for things to happen to get those kind of grades out there. What did happen, it did bring into question how robust the current system is to shocks, just having assessments at the end of a two year course without anything like coursework or midway kind of examinations in between. There's an argument that some people have put forward that, well, perhaps exams could have still taken place in some form if they were digital hence the renewed interest in digital assessment.

John Jerrim: [00:10:17] I actually personally questioned that. I'll say a bit more about that in a kind of a few slides time, but I think that's majorly underestimating the challenges of there being digital assessments, the idea that, hey, things are digital, the pandemic wouldn't have mattered for assessment, boy, it still would have. Anyway, whatever, the pandemic has led to one thing, and that's renewed high level interest in this idea about digital assessment and in particular high stakes digital assessment such as GCSEs and A-levels. And of course, where there's interest or there's policy interest, the media will always follow the exclusives from the TES. The DfE wanted to move 2021 exams online. I wanted to win the lottery in 2021. It was pretty wishful thinking. It didn't happen, right? This was something, Loic, your face there was brilliant. You're like, did they. Did they really? My God! It's so easy to say, right? But massively underestimating the challenge. This seems like one of those things that someone would have said in a meeting and 30 seconds later would have said, yeah, actually this isn't going to be possible, but it shows the media interest around everything kind of assessment-wise over this period and this is looking backwards is continued rumbling on now looking forwards around this idea of digital assessment. So computer-based exams, a matter of not if but when said by Colin Hughes, the head of AQA, and this idea that online tests from this point forward are possible within the next three years. So the May/June 2025 exams, it's possible that we're going to have digital assessments.

John Jerrim: [00:11:59] The claim is, or that at least it's possible. And I’m going to come back to this idea at the end of my talk saying, okay, if we're going to have that deadline May/June 2025, how should we be approaching things up to then and beyond with some kind of personal ideas? I certainly realise when I wrote this slide I'm going to Cambridge Assessment, I've just quoted AQA. I Jesus, I need to put the balance in there to make sure there's some work reported by OCR by Cambridge Assessment. Tell me about trials of your own digital mock service reported in Schools Week being taken entirely online in nine countries and in three subjects. And apparently you guys are also making our high stakes exams available on the screen by 2025 and conducted a nine month trial of digital progression tests. So I'd be very much interested to hear what the results of that showing what that's saying. But of course, yeah, this is big picture. Lots of people are interested in this issue now. Why are people so keen in this idea of moving to digital assessment? And again, why now? Well, like I say, the narrative at the moment is will be more robust to shocks. When we have another pandemic, we'll be in a great position. I think this is a terrible reason, right? Or that that’s the least important reason. That horse has bolted and you're now shutting the stable door, right? When was the last pandemic before the one we had? It was the Spanish Flu from 19…

John Jerrim: [00:13:34] Thank you. I knew you would know, Tim. That's why I didn't have the date up there. So Tim is going to be here. He'll be able to tell me off the top of his head. They come around every hundred years. Do we? Is that really our number one motivation? Shouldn't be. It seems that got a lot of high level attention, but it shouldn't be. I think there's other important benefits as well. Efficiency of delivery to schools. You don't have couriers taking the exam papers or whatever unsecure you’d be able to in theory, to do it over the over the Internet, over the cloud. Efficiency in marking depending upon question types. So obviously you've got the potential right for multiple choice items, be able to score on the fly automatically, but potentially short constructed response items as well. The next one could be a major thing looking down the line, the idea of there being adaptive testing. So at the moment, right, you could argue that some kids are wasting a lot of time in the exam by answering some questions that are far too easy for them or far too hard for them. If we had some form of adaption in there, we would be able to get either a better measurement precision, particularly around key grade boundaries, the four/five boundary, the three/four boundary or whatever, or for the people that don't like assessment out there, don't like exams and think, well, kids take too many GCSE exams, it could end up with shorter testing time, right? If you have adaption, you may get to the solution, the measurement precision you need for each individual kids quicker.

John Jerrim: [00:15:05] So that could be a major advantage of our digital assessment. Related to that basically we've got a bit of a rubbish form of kind of adaption at the moment, right, in terms of in some subjects selecting kids into tiers. They take the higher tier in some subjects or the higher tier or the lower tier which has to be preselected and taught that kind of content in the curriculum. Well actually that would be superseded by that. Something that I personally think is really interesting and really exciting where the future is in digital assessment is this. The new question types that could come in, including things like interactive science, quest… I pick science in particular, interactive science questions. So, for instance, one thing that's often said about science, you know, what would be the ideal assessment where you might get them on a computer to simulate doing an experiment? Right. You can't do that easily in a paper and pencil test. I can see people's faces in the audience wincing, wincing at that idea, not saying this is easy. I'm not saying that should happen immediately.

John Jerrim: [00:16:09] But in theory, this is something that could come down the line. Right. Getting people to do an experiment on a computer screen. And the reason final reason I'm going to give you is all my friend said to me, John, you did pretty well in your exams. We think you did pretty well in your exams because your handwriting is terrible. They couldn't see what you were actually saying and they just gave you the benefit of the doubt. Right. Reading typed responses is a lot better than reading my terrible handwriting. But there are challenges to this as well. And actually, I think a lot of the potential really exciting stuff and the real interesting stuff that could be done with digital assessments comes later. It comes down the line, we get those potential benefits later, and we have to put up with a big lot of challenges first to be able to get there. Well, firstly, people just don't like change, whatever. Whenever this happens, people are going to turn around and say, no, what's wrong with what we have always done previously? It's always worked well. This isn't the right time to do whatever. People just don't like the change. People don't like any change to assessment. There's always going to be information and data, security issues, hacking concerns, whatever. I remember having some conversations around this after doing the kind of PISA study where I think it was with the kind of Welsh government where they were talking about some of their assessments.

John Jerrim: [00:17:34] And one of the big barriers that they were concerned about was these kind of things around data security kind of and hacking and cybercrime issues, major disruptions around assessment time, logistical issues. So drawing upon my experience at the PISA 2015, I wasn't involved in actually physically doing the data collection. But obviously I worked very closely with our data collection partner and I saw what a hell of a time they had navigating schools IT systems because some schools use some anti-virus or whatever software, some use others. The software, the testing software will be liked in some, won't be liked in others, has to be very thoroughly field tested within each school. All types of logistical issues to be managed. Well, here's a secret computers crash. Computers can crash mid-assessment. And, you know, in a low stakes setting that may not be such a big deal, but in a high stakes setting like GCSEs, that's a big, big challenge that needs to be kind of considered and overcome. Differences in young people's ability with computers. So do you end up measuring their skills in mathematics or English, or do you end up measuring their skills in computers? Could be an issue becoming less and less and less of a concern over time now that everyone's getting so used to iPhones, tablets, whatever.

John Jerrim: [00:19:00] So actually, I don't think that's such a big issue, I think, as big an issue as it has been previously. But what I'm going to focus most of my discussion here around today is this idea of mode effects and the comparability between paper assessment and computer assessment. So to what extent do they give you essentially the same results anyway? All this together potentially adds up to another media examination storm because we know now that the media loves a good examination storm and you could just see how some of these challenges could all play out through a future exam season that kind of went fully digital. So if I'm going to talk about the comparability between paper and computer assessments, why might this be important? Well, firstly, comparability of results over time, some people are very interested in knowing about standards over time, how differences in achievement differ between socioeconomic groups and genders, between ethnicities or whatever. If you move from a paper based assessment to a computer based assessment, are you comparing apples with apples or apples with oranges? Do you necessarily measure the same thing over time? I put comparability across the UK here, so of course England might make a change in GCSEs or A-levels or something to digital, but other parts, the other devolved nations might not.

John Jerrim: [00:20:27] How big an issue is that? Well, both nations already have quite different examination systems, so it may not be a major concern, but it might also play out into things like A-level results. Right. If, for instance, computer assessments young people struggle with tend to get worse marks on that and only kids in England are doing them, that plays out into A-level grades or whatever. How might that then feed into university decisions? Not sure, but I think it's kind of worth thinking about. Digital assessment could penalise certain demographic groups so it could lead to, say, wider socioeconomic inequalities. Depending upon what the mode effect looks like, we might see a bigger mode effect for disadvantaged young people then for socioeconomically advantaged young people because of various things like experience with computers. And we would want to know about that. A question would be, well, if we did move to digital GCSE assessments, would all schools move at the same time? You could instantly see a policy right where it would be optional. You can opt into doing a paper assessment or a digital assessment. Then if you have that optionality of doing digital or paper, this issue of mode effects becomes really, really important, right? Because you need to maintain a quality of comparability of that assessment between those separate schools. So it depends upon how any rollout is handled and something that also I think we're thinking about, although I don't know how important it is or not, is impact upon accountability measures such as progress aids.

John Jerrim: [00:22:05] So your baseline will be a paper based test in Key Stage 2, your outcome will be a GCSE, will be a computer based GCSE, at least for a kind of at least a couple of years. Don't know what the implications of that are, but I think it's kind of something that's worth knowing about, worth thinking about. This talk is mainly going to reflect upon my involvement in PISA of 2015. I'm assuming everyone here knows PISA about the overview is it's been conducted since 2000 across many countries across the world, reading, maths and science. Between 2000 and 2012 it was conducted paper and pencil. 2015 it moved to digital and has been digital ever since. I say it moved digital. As I say here, there were 73 countries that participated in PISA 2015. Only 58 took the digital assessment. The remaining 15 took the old school kind of paper based assessment. Now why is that important? PISA’s really got two main reasons for being a thing. One is to measure trends. Trends over time. Trends in performance over time. And the other is to compare performance across countries. Well, if you used to do paper and now switch to computer. That could impact your measure of trends.

John Jerrim: [00:23:21] And likewise, if not all the countries are moving to computer based assessment, but some are still doing paper based assessment that could potentially bring bias into comparison across countries. So in PISA terms, although it's a lower stakes assessment obviously than GCSEs, this issue of comparability across modes was really, really important. And I'm going to reflect upon what they did and what I think they got right and what I think they got wrong and could have improved on. So what actually happens in PISA 2015 when I say digital? Well, what they did on the most part was they took the old Pisa paper based questions and shoved them on computer. That was mostly what they did other than in science, where they did introduce some limited number of new interactive test questions. So there was interaction in the science they made, but mostly it was taking the old paper questions and shove them on a computer screen. There was no kind of adaption. It wasn't an adaptive test. It was a standard linear test progressing through. It’s changed in PISA 2018 and 2022, where there is now some adaption in it. But when they first made the move, no, it was a standard linear progression test. Computer assessment. What are you all thinking? The cloud. The Internet right now. Don't be stupid. This was delivered on a USB sticks. USB?

John Jerrim: [00:24:52] When's the last time anyone used a USB stick? A long time ago. I've got one hand. One hand. Today. Did you take the PISA test?

Audience member: [00:25:16] In the public library, rather than your own and you want to save something the likelihood is you will have to save it to a USB stick.

John Jerrim: [00:25:25] There is still a reason for USB sticks outside of PISA for existing. Anyway, this is what happened in 2015. I'm not entirely certain. I think in PISA 2018 it was still delivered by USB stick. And in preparation for this talk I googled about what's happening in 2022 and there was still quite a lot of talk about USB sticks for my liking and some very interesting things said about well, you know, it seems to be quite a lot of test taking behaviour related to how good the USB stick is because the software seems to work better on some higher quality USB sticks than others, which made me think, oh god. But anyway, this is delivered by USB sticks in school. Not any. Sorry. This talk is not just about USB sticks, but I love this point. These weren't any USB sticks. These were military grade USB sticks. Why? Because you got pupil data on them and say they cost 30 quid a pop. There's 12,000 kids in the UK took PISA. I'll let you do the maths. This is not a cheap thing to do, right? And it's a complete logistical nightmare in many countries, including this country, because you've got to get these secure military grade USB sticks to schools and upload stuff and get it back. And then you've got all the computer IT software, the computers hated the USB sticks, security software kind of kicked it out or whatever.

John Jerrim: [00:26:49] Like I said, I wasn't in charge of kind of or led the kind of data collection, but I saw years taken off people's lives who were in charge of this. Anyway, so that's what they did. How was this kind of issue of mode effects, this change tested? Well, they do a field trial in PISA about a year before the main study takes place. And I'm going to focus on one part of the field trial they did in 2015, and they basically run something like a mini randomised controlled trial in the 58 countries that took the move. So they boosted up the sample size and there was about 1200 pupils within each of these 58 countries that were given the trend items either on paper or on computer, and they were randomly assigned to do so. So 600 got it on paper, 600 got it on computer. The idea then, because you've got that random assignment between getting it on paper, a random, random computer, you could look at the mode effect, you could investigate. What impact does that have on kids’ performance, how they answer test questions or whatever? This was a smart thing to do. I stand by that. The OECD made a very good call in doing that, and it was a good decision to do it.

John Jerrim: [00:28:07] There were, however, some issues with how I think it was implemented and being Captain Hindsight here, being able to kind of say, well, what things could have gone better? Well, firstly, this there were anecdotal reports of implementation issues. So this is actually my recall of conversations I had with other kind of national teams, project managers, seven years ago when slightly hung over some of these international meetings in a hot, stuffy room. But I do recall there being very interesting conversations about how the computer assessment went so I've spoken about the logistical issues, it wasn't just in this country, it was in all countries across the world. The software had a bit of a kind of annoying, a tendency to crash, annoying tendency to crash halfway through the assessment, not great for the test taking experience and, you know, affected some people in at least some kind of countries. There were some weird things about the assessment as well. I remember kind of having conversation or anecdotal stories about where you got the question being asked that pupils could copy and paste from the question into the answer box if they wanted. So some of the unmotivated, lower achieving whatever test takers would just copy and paste from the question into the answer box, which was slightly strange.

John Jerrim: [00:29:25] And one of the interesting things I think when you think about digital assessments and this would impact GCSEs, particularly if there was any adaption brought in, was the case even in PISA when it was a linear assessment. What are you taught is the number one piece of exam technique? Get to a question. Oh, that looks tricky. Don't know the answer. I'm going to skip onto the next questions, come back to the tough question at the end. Exam technique. What I one can't do that can't do that. Once you get past the question, you can't go back and try it again. That's a particularly an issue when the kids don't necessarily know it. Right. As well. So there were people skipping past the questions, thinking, oh, I'll be able to go back and they couldn't. So with the various kind of implementation issues with it, secondly, in this field trial and I still stand by this, the results never really got reported. They never really got reported the countries in a very clear and transparent way. And if you even dig through the technical reports, I still don't think the results were that kind of transparently reported, either ETS who were the assessment company running it.and the OECD never made clear in advance how they would actually use this data from the field trial.

John Jerrim: [00:30:40] Now, obviously the hope would be if I was running this study, I know what I'd be hoping for. I'm hoping for no difference between the paper and the computer assessment. Right. And if there is. Oh, we've got a bit of an issue here. That was clearly the hope, but I'll show in a minute. Spoiler. It didn't happen. The results were never actually reported to the countries in a way that was easy to understand. So before all the kind of main study happened, I remember again sitting in one of these rooms with kind of the big wig assessment people at the front telling countries and they weren't academics like me, they weren't quantitative. It was mainly a group of civil servants from various different non-research backgrounds where you've got people coming up with kind of some crazy IRT model saying, well, we've done this and this and it's all how we expect you or whatever. Never kind of a very kind of clear message about what actually happened. What actually were the results between the kind of paper and the kind of paper and computer test? There was very reassuring noises, but not much evidence behind it. And I can back that up actually by the fact that I got three countries to work with me to produce some evidence, mainly because these three countries like me were getting annoyed with what we were being told, saying, kind of give us the evidence.

John Jerrim: [00:31:56] So I kind of buddied up meeting these people from Sweden, Ireland, Germany through my links with PISA 2015 to say, Okay, will you guys share your field trial data with me so I can look at what happens in this kind of random assignment, work out the results, see what the aggregate impact was. No, the data for England isn't up here, although we did actually have the data for England as well. That's because I wasn't sure completely whether I was allowed to publish it or not, whereas I did for Sweden, Ireland and Germany. I could publish it if you care that much put in an FOI to the Department of Education. I'm still there. Still have it somewhere again what you're seeing up here. It's not too different from what you would see for England. What do you see up here? Well, what you see here is the results in terms of effect sizes. So standard deviation differences between paper and computer test scores, with the negative figures indicating that kids tended to score lower on the computer assessment than they did on the paper based assessment. This doesn't include the interactive questions or anything like that. This is just when they've got the same questions on computer versus paper.

John Jerrim: [00:33:07] How big are these effect sizes? Well, it's about 0.15 ish. Some variation around there of a standard deviation. Quite big confidence intervals within each of the countries, reflecting the limited sample sizes. But when you aggregate the data, pool it across, you can see quite a clear kind of overall pattern. Of course, ETS would have had this and the OECD would have had this data for 70 no, 58 countries. But we never got to see that. How big is 0.15 in the standard deviation? Well, PISA scores are have a standard deviation of 100. So 0.15 as a standard deviation, a crude conversion would be 15 PISA test points. Tim was saying earlier about how we've gone up in kind of PISA maths, we've gone up by about ten PISA test points, I think it is. So on that scale, the difference of 15 PISA test points is pretty big. For another piece of context behind this. You guys would have heard about the Education Endowment Foundation, I assume the kind of organisation that's running loads of education experiments in England currently over the last ten years who look at intervention effects. They've tested various things like tutoring programmes, giving kids cash to go to school in preparation for their GCSEs. Different teaching methods. The best and brightest ideas be brought together. The average effect size there is 0.06 of a standard deviation with very few reaching .15 for the standard deviation.

John Jerrim: [00:34:32] So the effect sizes there are actually, you know, pretty big in those contexts. What was the other limitations around what was done? Well, there's no evidence about differences between groups. We've got limited sample size at the country level reflected by these whacking great confidence intervals. The black lines running through the centres of the bars. So we don't know if the effects, the mode effect is bigger in, for instance, Sweden, Ireland, Germany or wherever. And we also don't have data for differences across key subgroups. So by, we did do a little bit by gender, couldn't see anything. But again, too small sample size and what we had really but we don't know things like ethnic differences, socioeconomic differences or whatever. All we know that there was a pretty substantial and important mode effect, but not exactly who was affected by it and how. On top of that, there was no qualitative work alongside this. So if you were doing a proper randomised controlled trial. You would have a process what's known as a process evaluation alongside it, i.e. qualitative work, digging under the under the skin of what's happening to work out well if we're seeing an effect, why none of that really happened. So I mentioned these kind of logistical issues, computers crashing. But, you know, there's other explanations as well, such as kids not being able to go back and retake the test questions.

John Jerrim: [00:35:59] We don't know the mechanisms. We don't know what was driving this because there was none of this kind of rich qualitative work happening alongside. We have hunches about what may have happened, but we really needed to know a lot more. And the final point I'll raise about this is one of external validity. How far can we generalise results from this PISA 15 PISA 2015 field trial? Not even as far as GCSEs going digital, but even through to kind of the main PISA study that happened a year later. And I think this is a real issue in all work around mode effects and the mode effects literature because it really depends what's driving it. Knowing more about this kind of qualitative evidence alongside before we know a bit more about external validity, for instance, just say software crashing and bugs and things like that were driving this mode effect that we saw. Well, you know, any good assessment company would then fix that before the trial, before the main study. Sorry. And so you would expect these differences to potentially disappear or substantially reduced. But if it's things such as just kids find it hard to solve problems on on computer than paper, or if it's to do with this exam technique, technique issue, these effects would transfer over to the main study.

John Jerrim: [00:37:29] But we just don't know about the generalisability of these results from this field trial through to the main study. And going obviously taking this evidence to GCSE assessments would be one step even further and it's going to be very context, very software dependent in my view. Okay. So the study had a problem quite clearly, right? How did they go and try and fix this problem? Because they had to do it ad hoc, right. Basically, the intuition behind their approach is as follows. And I kind of give a very simplistic kind of overview of it. They basically treated this mode effects issue as a question level problem. So the idea behind this is that some questions suffer from mode effects on the test, other questions don't. We're going to use these questions that don't seem to suffer from mode effects as the basis for linking and making the comparability of the test across paper and computer assessments. So these questions that don't suffer from mode effects have the property of strong measurement invariants, and we can use them to form the link between paper and computer assessments. That was the idea of their approach. In a nutshell, I have some problems, obviously, with this approach. There's some unstated assumptions behind it. Is this really a question level problem? I think probably not.

John Jerrim: [00:38:54] But hard to say without knowing more about the mechanisms. But underpinning it, there's an assumption there is indeed a question level problem. A common mode effect was specified across groups. So countries, gender, socioeconomic status. We didn't have really evidence in support or favour about that, but it was an implicit assumption. Basically, there's one common mode effect for all. And how do we know and can we actually identify questions that do and don't suffer from mode effects a priori? So could we sit down with a test without any data and say yes, yes, yes, no, no, no, I'm not so sure. And I think that's where one of my big concerns with the approach that they used, kind of that's my big concerns. What I think should have happened was they should have gone through all the PISA trends, test questions beforehand and pre-specified very clearly and had a rationale saying, I think that question or we think that question is going to suffer from mode effects. We think that one is that one is that one is. But these ones, we don't think we think these ones will be fine and we're going to use those for the basis of linking. And then they should have tested that using the field trial data saying, okay, these are our hypotheses. Does that hypothesis kind of hold? That's not what actually happened. And what they did was very much a data driven approach using data not only from the field trial, but also the main PISA 2015 data as well.

John Jerrim: [00:40:32] So basically they got the data from the field trial, the data from all previous PISA cycles data from PISA 2015 and went through and say, Oh, well, that one looks pretty similar across, then let's have that one as no mode effect. And that one has no effect. No one has no effect, but without any rhyme or reason to why you pick those kind of specific questions, which I kind of know have quite a kind of a problem with to the point that I remember after the field trial was done and I went to them with my paper saying, You got a problem here, haven't you? What are you doing about it? And they kind of said about these strongly invariant items, I asked them, Right, tell me now, which are your strongly invariant items that you're going to be using to make the basis of this linkage, which are the questions you don't think would suffer from mode effects? And I remember getting the answer of, no, we don't want to tell you now. We can't tell you now. We want to wait until the after the main study results is in. So they quite clearly wanted to look at the data before making kind of those decisions. And to me, the cart ended up driving the horse. And what we did was eventually when we found out what the no mode effect questions were, we ran our analysis in the field trial data and we got these results.

John Jerrim: [00:41:48] And of course, things seem to look better. It does seem to have kind of helped solve the problem, but of course, they peeked at the data. So they've gone through partly this field trial dataset locked out the outliers. But that hasn't been based on theory or anything other than kind of the data that they've seen. And oh, you see things kind of get a little bit better, still not zero. And you still see you know, it was kind of non-trivial kind of differences, perhaps with a lot of uncertainty around it. But, you know, I think probably what they did ended up having some benefit. I don't think it ended up really resolving the issue. I also think this could be an explanation for why the OECD average in science at least declined in PISA 2015. So you could see in 2006, 9 and 12 it was around the 500 mark. In 2015 compared to 2012, it went down by about eight points and then in 2018 it went down by a further four points. So, you know, from a fairly stable trend, we're starting to see a kind of big downward trend after this change to computer assessment, which I don't think is genuine, I think is partly linked to the changes that have been made to the test.

John Jerrim: [00:43:01] So I don't think actually they managed to overcome all of the issues. And so, you know, take PISA trends with a little bit of a grain of salt. That's the kind of message. All right. So what the hell do we do with this one figure of our GCSEs is going digital other than let's take our time, okay? Let's think about this. But we've given ourselves a deadline of three years. Crap we need to do. We need to start now. This isn't trivial. We've got a major challenge. Let's not start now. Let's make a time machine and go back two years and start two years previously. This is going to take a lot of time, a lot of thought, a lot of effort. We now need to be getting the evidence base sorted and in place now about things like mode effects and how we might deal with if they do crop up in our assessments. So the time for evidence generation is now. My second thing that I would suggest strongly should happen now to policymakers is the national reference test should be extended from 2023 onwards to do essentially like the PISA field trial did with an extended sample where people are randomly given an electronic version of the national reference test. So we can look at this mode effect issue and implementation issues in our own context, in the context of GCSE or something close to GCSE examinations.

John Jerrim: [00:44:25] And frankly, if this until this happens, digital GCSEs ain't happening. If the government is serious about this, they do this now. They do this in 2023. They put their money where their mouth is. They sign the check absolutely right at this point and beg Ofqual to, you know, go out and do it instead of 450 schools, we double it to 700 schools. We do actually go through the test and pre specify each question and say, how likely do you think this is to suffer from mode effects? We're going to do the random assignment that's done in PISA and we're going to add some questions to the survey they do at the end of the test to kind of dig into this mode effect issues. And we're going to also do some qualitative research alongside this so we can drill into the mechanisms more, which we didn't have previously. And if this doesn't happen at least one year, perhaps two years in advance of when GCSEs do go digital, it's a complete abstract failure of government. If this doesn't happen, this isn't the first thing. Is anyone on Twitter tweeting that? Feel free to. Thirdly, we're going to hold a national mock afternoon for year ten in June 2024, one year before we actually start doing some digital GCSEs and we're going to run it right at the end of the GCSE examination periods that's happening for year 11, hey, because it's a natural time to do so and we're going to ask all schools and all year tens to take part.

John Jerrim: [00:45:56] Why are we going to do that? Because we want to make sure every school and every pupil as far as we can has had some experience of navigating GCSE, electronic GCSE assessment before we do so, so we can work out where the problems are. Is this going to work or not? We're also going to randomly assign, which they do first, the paper or the computer, if we can get that to flow the schools because hey, that's going to give us more evidence kind of about these mode effects. This is hopefully going to flag up any logistical issues that there will be. It's going to help us understand if this is going to be a complete disaster and can back out one year in advance. Right. And if there's going to be an absolute media storm around this, I want a media storm around the mock. I don't want a media storm around the live GCSEs. It's also going to create a huge dataset and evidence base for kind of people like me to kind of play about with and also help schools maybe understand some of the benefits of digital assessments as well. We then get a rollout in 2025 and hit our deadline. God, the first bullet point here is a bit controversial, isn't it? Not English and maths.

John Jerrim: [00:47:06] They're too high stakes. Let's go for subjects that's slightly less high stakes. Maybe humanities, geography, history. Sorry geography and history teachers I really don't to muck up a GCSE in maths assessment. For some reason I'm willing to throw geography and history under the bus and I'm not too sure why. It feels like you need a subject that's kind of important, but not maths and English because it's too important. Oh, I'm going to get in so much trouble around saying it. Anyway, I've said it now. Nothing fancy. We're not going to deal with that. We're going to get some paper questions. I'm going to shove them on screen because adaption has to come later. I'm saying not A-levels. We're going to do this just for GCSEs. Why? Because everyone does GCSEs, at A-levels, people specialise. If we're only going to do a few subjects, you end up starting to worry about comparability across subjects, right? So it ends up becoming even more tricky given high stakes decisions around university entry. We're going to do the whole country at once, for these kind of geography and history exams don't allow opt in opt out because of these potential mode effects issues. I might be willing to back down on this one depending on what my two giant data sets prior to this show. But I think this is one of the biggest questions that will come in in implementation about how we do it.

John Jerrim: [00:48:26] Do we allow opt in or do we force everyone to do it at the same time? I can see there being big pros and cons and Minister for Education sorry you’re going to have to accept there's still going to be problems. This isn't going to be absolutely perfect everywhere. There will be computer crashes. There will be some schools that get their data wiped or some pupils or whatever, and there will be some media fuss around it. You need to just be prepared for it and you're going to have to be brave about it because there will still be problems and the media will find those kind of people. What should we then do looking beyond 2025, slowly roll out to other subject areas and GCSEs and A-levels? Let's start thinking about introducing interactive assessment questions, field trialling them way in advance before actually introducing them to kind of get the benefits of digital assessment. We'll start introducing adaption later and removing the tiers. And I think this is a key point. We need to overcome these challenges first and the benefits are going to come down the line, right? We're not going to get the benefits instantaneously. We're going to get them, you know, five years, ten years into the future. We need to be in this for the long haul. A caveat to all these brilliant policy ideas that I'm going to give I just gave you.

John Jerrim: [00:49:48] The last time I made such a suggestion. It didn't work out too well for me. 19th of March 2020, the day after they kind of said, Let's cancel the exams. What should we do? John Jerrim came in, let's use predicted grades, centre assessed grades. They're the things that we should be doing. Yeah, this wasn't my best moment in academia. Perhaps not my kind of best idea. No one's picked me up on this point, actually, until now. Yeah. No, I feel after giving such great advice, I should fall on my sword and say not all my ideas are right. Anyway, some final thoughts. Despite what's been a fairly negative point in places, I think moving to high stakes digital assessment should happen is a logistical nightmare. There could be mode effects, potential major storms, but there's good reasons for a well-planned, gradual transition of these high stakes assessments to happen. And I do think in the medium term, there's potentially some really exciting assessment opportunities. PISA made a very brave decision to be one of the first big high profile studies to do this. The OECD and the ETS got many things right in my view. I do think they should have been more transparent. I should have. I do think they should have done more to get people to understand the mode effect issue and what was done about it.

John Jerrim: [00:51:14] And Captain Hindsight, we should have known more about cross group and cross country variation. We need to appreciate that was hard as well. So I think they did an okay job. I think they made a brave decision. I think they did a lot of good stuff. I think those things when they came up with a problem, they could have done better. And the final point would be the government should be doing a lot of work on this now, it's going to take time to realise any benefits. I've said probably ten years now. It's probably a bit too negative, but it's going to take a while. We've got to bite the bullet at some point, right? At some point is it going to get stupid us doing it all kind of paper and pencil and seeing there's a lot of momentum behind it now. It does feel like now is the time to start thinking and gradually making that transition. And therefore we need to get pupils and schools used to making such digital assessments. At which point that's my talk over. I think I've gone over time. Not entirely sure. Happy to take questions. So let's look.

Tim Oates: [00:52:24] Not significantly over time, really, really rich and bang on bang on the topic that we ask you to go into in considerable detail, which you did. I mean, by rich, I mean it dealt with all of the issues of item and instrument design, but then the management challenge of actually doing that kind of stuff. When you're under pressure to deliver an international survey, you've then extrapolated from that to when you're under pressure to do high stakes exams in national settings. You dealt with some of the very practical issues of the skills of young people and their response to these kind of tests. It was really elegant and very, very detailed and extremely insightful, John. So thank you very much indeed. I've got a couple of questions. Perhaps I'll kick off with those and then people can gather questions in the audience here and you'll be How many have you got? Oh, plenty. Okay, that's great. Thank you. Couple of things Gabriel Sagan looked at because he was concerned about the mood effect and actually was presented with a bit of a wall of denial, actually, when he started to ask questions. But then maybe it's the way that Gabriel asked questions. That was a joke. He could be quite intense.

John Jerrim: [00:53:55] Are you watching, Gabriel?

Tim Oates: [00:53:56] Yeah.

Tim Oates: [00:53:58] I mean, he looked at data across various nations in terms of the degree of computer familiarity of young people. That seemed like a very sensible thing to do and links to some of the things that you did. So I've got that worthwhile considering that, I think.

John Jerrim: [00:54:13] Yeah.

Tim Oates: [00:54:14] You mentioned the decline in science and you know, Occam's razor point which was. Well, it really was a matter of fact, because there aren't other plausible mechanisms. I think there's an interesting thing. And the final thing I was thinking of, which I think you might want to consider, is that when you do try and focus on items which don't have the mode effects, they're probably quite restricted in their characteristics. And that can then have quite a quite an impact on what you currently do and that can be quite narrowing. We've seen that in the US and we've got a long period of time. So those are three things that occurred to me as you were talking about.

John Jerrim: [00:54:50] Yeah. And all excellent things to point out on Gabriel hitting kind of a wall of denial. Well, maybe that's I ask questions in a similar way to Gabriel because I got exactly the same kind of wall of denial. Like I said, the kind of when it was presented to us, there was never kind of this kind of clear laid out evidence, such as I hope this is, presented to countries and presented to people. You can't find something like this in the technical report, which has obviously bugged other countries as well, hence why we went out and did it ourselves. So I feel that there could have been more transparency around this issue. There was quite clearly a big issue how it was handled or whatever. So yeah, there was a wall of denial. There was kind of a tendency to be like, Oh, it's fine, it's fine to sweep under the carpet when actually I don't think that's true. On the items, I think we I agree. I think we just don't know because part of the issue is we know now that there were supposedly strongly invariant non-mode effected questions within PISA, but you can't see the PISA questions. They're not public, right. For good reasons because comparability gains over time. So you can't really marry up. Well, I know that that variable, that question code is strongly invariant, but I can't go and actually see what that question looks like. And does that make sense to me that that one is and that one isn't? It could well be that it ends up being very restrictive if it is even is kind of a question level problem.

Tim Oates: [00:56:25] But it is but it is a bit as a researcher with permissions, you could do that without disclosing the items.

John Jerrim: [00:56:32] As a researcher, you could do that if you still have access.

Tim Oates: [00:56:35] And negotiated with.

John Jerrim: [00:56:37] The access to the PISA items. You could do that. But I always find this when you're talking about in a paper, test questions, and can't show people the questions, it always feels a bit naff. And so I always try and steer away kind of kind of from that. And I can't remember your second question Tim.

Tim Oates: [00:56:57] Well, just the issue of the plausible mechanisms around declines.

John Jerrim: [00:57:02] So, yeah, I can't think of any other plausible mechanisms given you've seen such stability. I believe there's been stability in TIMSS on average over time as well. So yeah, I can't think of a plausible mechanism.

Tim Oates: [00:57:12] And a quick footnote, I mean, the strong differential effects in this data across different national settings, which seems to kind of I mean any need to explain those.

John Jerrim: [00:57:22] Tough, because if you did the significance test they're not significantly different you don't it's so hard you just don't know whether that's noise or not, right? And so you end up doing this. You end up almost. Thinking there's a difference, making a difference, and ex post coming up with a rationale for why it's there. So yeah, there could be cross-country variation. I don't think we have enough evidence on it.

Tim Oates: [00:57:47] Okay. Thank you very much indeed. You've all had time both during the presentation and during my questions to think about questions. Jo, how did I know that it might be you with all the work that you've done in this area? Rosie, do you want to dive in with a microphone that is switched on? That's it. I like my incompetence.

Jo Williamson: [00:58:07] Thank you.

Tim Oates: [00:58:08] Do you want to say who you are, Jo?

Jo Williamson: [00:58:09] Yeah. I'm Jo Williamson. I'm in the educational measurement team in the research division. And yeah this talk could not have come at a better time. So very, very interested. I wondered if you could comment on. So I know that TIMSS team did do some work pre-emptively screening items saying we think these will be strongly invariant. These won't. And it seemed to me. It sort of got the direction, right, but they couldn't really call it. And I wondered if you could comment on were they on the right track and what do you think their results showed?

John Jerrim: [00:58:40] I certainly think that they were on the right track doing that a priori and trying to kind of say trying to kind of make the view beforehand. And I think that should happen in all mode effects studies. I know a lot less about TIMSS actually and what they actually did in terms of the transition to the computer assessment. Partly because I was so burnt after and out of energy after dealing with these, I couldn't kind of get my head around all the kind of nuances and details about what they're doing in TIMSS. I do think they are building what TIMSS said they couldn't do a field trial, but they have now got their bridging study right where they've still got kind of again, a pool, a random pool doing paper assessment and computer assessment as far as I am aware. And actually I think that is sensible and I think that's what the NT should be doing over the next five years. If we are going to make the transition to digital is that you have a random pool doing paper, have a random pool doing computer. So I think that is sensible and I think they are on a sensible track in doing so.

Tim Oates: [00:59:47] Okay. Thanks, Jo. Taking the questions coming in from the online audience. Jason is actually asking the issue of what's the degree of societal push for the move towards online digital? And I think I think in considering that, also pick up some of the things you said as you were presenting, because you did talk about handwriting your own.

John Jerrim: [01:00:12] Yes.

Tim Oates: [01:00:14] And legibility. But, of course, actually, although everybody says, well, kids will be, you know, just using a typewritten script, many schools aren't teaching them to touch type or write with the same facility as they do with handwriting as they did in the past. So to repeat the question, where's the societal push? And perhaps, you know, to think about the nature of that push and the composition of kids’ performance.

John Jerrim: [01:00:43] Yeah, I mean, how much of an actual societal push there is? I'm not sure. I think most people in broader society are probably fairly ambivalent towards it think, Yeah, it should. I think the push and the emphasis will get stronger and stronger over time as we use digital devices more and more and more and more in everyday life as we have been. I don't feel there's a huge push from broader society there at the moment. People aren't jumping up and down about saying we should be moving GCSEs digital. As you know, the kind of conversation at the moment is, well, what should our examination system look like? You know, do we need GCSE and A-levels? And that's been the kind of broader fixation of kind of at the moment, rather than there's this kind of idea of it being digital. I don't think there's actually so much there behind it. I think a lot of that drive is becoming from a higher level, from kind of policymakers, from politicians. Again, like I said, the kind of crazy idea of the pandemic. We need a more robust system. Digital assessment will give us that more robust system, which I don't think is the reason that we should be doing it. But I think there are other reasons we should be doing on top.

Tim Oates: [01:01:58] I mean, if we if we look at the item types, I mean, it's quite clear that we need to rehabilitate multiple choice because of the complexity and sophistication of constructs, you can explore with multiple choice. But, you know, will it drive towards convergence on items which just require a poke or a swipe?

John Jerrim: [01:02:19] You'd hope not. Right. And I think that's where the really interesting assessment possibilities come in. Right. In terms of the longer term, the interactive questions. And I think if people understood that and people understand that, that could be something that comes in. I think there would be kind of quite a lot more broader range interest if people understood some of the benefits of the digital assessment and some of the possibilities more, on top of the challenges. So people understanding the possible benefits of the adaptive testing and it potentially making for a fairer system or shorter examination time. And like I said, the interactive questions, testing different types of skills that we think are particularly important in certain subjects. I think you would get more important, more people on board and thinking, actually, this is a good idea. But I think across wider society, even amongst the education community, those thoughts, they haven't occurred to a lot of people. I think I think within assessment land their probably there. But in terms of even the wider education fields, perhaps not so much.

Tim Oates: [01:03:24] Okay. Thank you. Other questions?

John Jerrim: [01:03:26] Oh, God, I've said something controversial.

Tim Oates: [01:03:28] And all simultaneous. I'm just going to take the nearest. Lee.

Lee Davis: [01:03:33] Thank you. So, Lee Davis, I'm director of Teaching and Learning within our Cambridge International stream. I'm interested in your plan and thank you for outlining the trial plan. I thought that was really interesting. What will you change and what will you keep fixed in that? I'm interested in the test taking environment. For example, 200 kids in a hall or in a classroom, at home. And the input devices, screen, keyboard, mouse, mobile phone. Can you tell me a little bit about that?

John Jerrim: [01:04:01] God, it's now you're making me flesh out my half baked thought through plan. Just going to leave that on there while I discuss this.

Tim Oates: [01:04:12] It was a brave slide.

Tim Oates: [01:04:13] Welcome.

John Jerrim: [01:04:14] But the trial, so for of my big giant mock day, I'd be tempted. Well, gods keep it in a big, giant hall. Give people laptops. Give people laptops work in those literally very same examination settings that you would always have, just instead of an exam script, have a laptop. Don’t know if that’s the right answer or the wrong answer? It was an off the cuff answer that I've clearly partly thought through, but I think. that's what I would try. In fact, I'm going to make a better suggestion. You've added to my plan as part of the NRT they should be trialling different ways and testing it in different schools and different settings of how that should happen. And that could be worked into my mock day idea. Right. Where you might let schools do as you wish. We'll record how you've done it and will see the evidence around that about how it would fly. Pretty good question.

John Jerrim: [01:05:22] Yeah.

Tim Oates: [01:05:22] Good. Okay. Well, we'll take some online questions in a moment. But Sarah first.

Sarah Hughes: [01:05:30] Thank you. I'm Sarah Hughes. I'm a researcher here in Cambridge. Focussing on digital assessments. One of the potential benefits of digital assessment is that it can help us to match with effective, possibly digital teaching and learning, I think. And I'm really interested in the idea that we start with what we call kind of this lift and shift. Just take what you've got, put it on screen, and then later on we can introduce maybe some interactive stuff, some questions that are assessing concepts we don't or constructs we don't currently assess and so on. And I'm. My question is, is that first thing the lift and shift actually a preparation for the second thing are they even on the same trajectory, given what's happening in teaching and learning or might be happening in teaching and learning?

John Jerrim: [01:06:22] I would see it as a first preparatory step, just lifting and shifting, like you said, just because. I guess partly that response from me is being tainted to some extent by this mode effect issue where we're seeing big things when we lift and shift even. And if we're lifting, shifting and adding in things that are interactive. So what I don't capture in this kind of graph is the new interactive questions that were brought in. I think that's what's that's brought in my view that it’s lift and shift and then we bring the other stuff later. So I think that has to happen and I think it's for that reason of mode effects. It partly depends again upon if taking this from a concept to actually rolling out, how are we going to roll it out? If we do roll out and make it optional to some schools to opt in, it has to be lift and shift. You can't then bring in interactive questions. So I take your point. I think it does link nicely to the teaching and learning point, how much that moves and stays digital as well. I think a bit of an unanswered or unknown at the moment, right?

Tim Oates: [01:07:35] Thanks so much. So online from Samantha, a question which actually is reflecting the content of quite a few questions and saying doesn't all the things that we're saying here today add up to a substantial rethink about what examinations and qualifications are actually measuring. I mean, we have, at the same time, a lot of call for rethinking the purpose of curricula in various places of education. We were commenting a lot on that, on the fact that some of the statements are not underpinned by evidence. Some of them are. Some of them aren't. But there is a lot of call in all this future of education assessment stuff, a radical rethink about the purpose of assessment. And you raise sort of different sorts of items tapping into different sort of constructs. Where do you think we are with this? I mean, it goes beyond, obviously the topic for today, but what's your reflection on that?

John Jerrim: [01:08:30] Yeah, I mean, I think it does feed into that wider debate about the rethinking assessment. How do we assess things, what we assess, how do we measure it? And I think the stuff that I've seen with computers or thinking about the potential, it does show you how things in examinations can evolve, right? It can evolve further. We can test different more varied skills with a computer than we can with paper and pencil. And so I think it does feed into that debate. And I think the broader issue around exams is we don't want to throw the baby out with the bathwater. Right. So what we've been doing previously. Yeah, we might rethink about kind of changes. Do we want to keep with just the, for instance, GCSE end of course examinations or do we want some percentage coursework in there or do we want some other things in there? I think that's a legitimate kind of interesting debate to be had. I think the worst thing that we could do, pandemic is go for the kind of nuclear big shift that a lot of people are talking about in terms of, well, let's get rid of examinations, blah, blah. We've seen actually what's the counterfactual to that through the kind of pandemic around teacher assessed grades, loads of workload issues, various things like that. People want some stability, the teacher workforce needs some stability. After what even pre-pandemic were quite big changes to the assessment system, as you all know, kind of happen to GCSEs. And I think, you know, we need now in this kind of this year and looking into the future to get back to some normality, but an evolving normality where we kind of look more at things like digital assessment as well. Think about how that feeds into measurement.

Tim Oates: [01:10:18] Can we just add something to that because we discussed prior, just in the hour we had before we started the presentation, we were discussing with Daniel Morrish the issue of competence-based curricula around the world. And you said something very interesting, I think, which was that in the commitment to deliver that and assessing high level synthesis skills and analysis and so on where those have been deployed in a particular country, what you saw were very strong floor effects in terms of the measurement.

John Jerrim: [01:10:45] Yeah.

Tim Oates: [01:10:45] And you were advising, well, why don't we just throw in some TIMSS style items in mathematics actually dealing with for operations and numbers.

John Jerrim: [01:10:53] Yeah. And you can see assessments, particularly those kind of competence based assessments where an assessment goes bad, essentially where you're trying to assess competencies by asking a really convoluted kind of question that you try and look at and you're like, Oh my God, what on earth are they trying to get me to kind of answer here? And I think that's where assessment goes bad and kind of gets a bad rap and gets a bad name potentially. Right. But if you're asking these kind of crazy competence based things and you just can't work out what it's what it's trying to get at, what's that actually testing. What skill’s that actually testing. I think. Yeah, the potential around that.

Tim Oates: [01:11:34] Eckhard Klieme did similar work on the you know it has got a similar eye on the PISA items then you have as you have and what he found in respect to mathematics in the early in the for first four sweeps was that the best prediction of performance on problem solving in mathematics was a high level of abstract understanding of mathematics.

John Jerrim: [01:11:57] Yeah, and my own kind of reflection is when I've seen the PISA items or again on the advantage of being involved in PISA. As you do get to see how the actual kind of all the questions and sit through the kind of computer based software and then it crashed and then go back and restart it and do it again. Anyway, you do get to see all those various questions and I was split actually my view of them was quite a few of them, you see. Actually, that's a fair enough question. That's kind of a completely reasonable thing. There were a few where you're just like, Oh God, this is this is a bit out there. This is kind of a little bit left field. I felt most like that for mathematics, actually, where the mathematics tests, I felt these are all generally pretty fair questions on the most part. The reading ones I kind of came up and looked at on the other hand, I was a bit like more thinking. These are slightly more abstract. These are kind of slightly weird, if for want of a better word in places. So I think it's mixed. I think it differs. I think there's some that are fine and some that's more challenging and I think it varies across the domains. Okay.

Tim Oates: [01:12:57] Let's go on for another five, 10 minutes. There was a sea of hands in the middle. Oh you’re nearest the microphone.

Jo Tisi: [01:13:04] Hi, Jo Tisi. I work in Cambridge International. I love how enthusiastic. Enthusiastic you are about your big mocks.

Jo Tisi: [01:13:17] Brilliant idea, but it sort of implies a single platform for it all to be delivered on. So are you suggesting a single government built and mandated system?

John Jerrim: [01:13:32] Buying time.

John Jerrim: [01:13:41] Yes.

John Jerrim: [01:13:45] I don't know. Honest answer I haven't thought that far through my slightly crazy plan. So for the field trial purpose, not necessarily. I think you would actually not want that. I would think you would want to see how various different systems made by various different organisations end up working out. And I think that's what I'd want to see trials on, at least a big scale, because it would give you again a bit more handle on generalisability or whatever. So I think you probably would want it. The competitive pressure might be quite useful as well if I'm thinking as a government minister, sorry, but I’m thinking that's probably the route that I would go down. Well done, everyone, though, for picking holes in my kind of off the cuff five minute idea.

Tim Oates: [01:14:40] Sylvia Vitello is cheating a bit, actually, because she's remote, but actually she's part of the team here. So she could be here.

John Jerrim: [01:14:46] Actually might skip the queue there.

Tim Oates: [01:14:49] What about all the other things that are an exam candidate does because they've got an exam paper in front of them making, you know, marginal notes, thinking things through which where the mode effects actually cross over into the, the, the whole nature of the transaction activity.

John Jerrim: [01:15:10] Yeah, I completely agree. And I remember again going back, stretching my mind back several years, hung over when I had these conversations with various people that there was some talk about kids were just acting differently when they were solving maths questions that they had paper and pencil by them I think to jot stuff down and figure stuff out. Oh did they have a calculator on the screen which they used instead? God, I can't quite remember. But there was definitely they were approaching things in different ways. And actually I can massively sympathise with this because when I write my academic papers I don't write them down on a computer to start with. I write them down on paper and pencil first. I do a brainstorm just like you would exam to a bit kind of mind map, write them down pen and paper, and then transfer onto a computer. I'm massively old school. You know, and there's all those kind of types of things that kind of feed into this. So I do think there's big things around test taking behaviour more broadly and it's why we need that qualitative research. Alongside that I kind of mentioned because you need to understand all these various nuances about what's actually happening to really get into this kind of big mode effect issue in some respects. That's a big overarching numbers on the screen. The devil's in the detail behind and I just don't think we know how much those other things are contributing to this. I mean, for different.groups.

Tim Oates: [01:16:39] I mean, the richness and width now of, of experimental work that's going on in respect of interaction with the digital. Recent study on textbooks where looking at what kids were learning from a textbook. And, and of course what we did was they had one group with the textbook and another group with the textbook and with a mobile phone that was switched off alongside it. And the level of retention of material was very strongly different. Very, very interesting.

John Jerrim: [01:17:10] Very interesting.

Tim Oates: [01:17:14] Jackie. Sorry. Were you trying to get in at one point? We'll come back to you.

Jackie Greatorex: [01:17:20] Thank you, so I'm Jackie Greatorex and I'm from the research division here in Cambridge University Press and Assessment. So to pick up on a theme really. To what extent do you think policymakers, decision makers are thinking about. What would we like to test? And what would be the best use of digital in that situation. So for example, I think there are some things which are already videoed for the purpose of marketing like drama and PE and so on. I think they're they I know, I guess ten years ago, handheld computers were being used in D & T. in some exams, I believe, and data was collected that way. So, yes, I just wondered to what extent you thought that was going on. And I know I'm stretching you beyond your title. I know, but thank you.

John Jerrim: [01:18:28] Yeah, I guess the honest answer to that is I just don't know. I don't know how that's kind of being thought through at the moment. We've done quite a lot with Ofqual previously. There will definitely be people in their research group that are thinking like that, how much that ends up filtering up through the rest of the system. Assessment people will certainly be thinking about it because it’s assessment bread and butter, right. You think that way whether the people at the top, you know, are thinking about it or more generally in education more broadly, for instance, in the Department for Education, I'm not so sure and I wouldn't be so convinced that they are thinking about it. So I think people in assessment are, elsewhere probably not so much, but that's a hunch rather than anything firm.

Jackie Greatorex: [01:19:19] Thank you.

Tim Oates: [01:19:20] Thanks very much, John. Sally, can we put the microphone forward for Sally and then we'll take the last question from online.

Sally Brown: [01:19:28] Sally Brown from the Assessment Network. I was interested in whether we could fast track our knowledge by looking at other educational institutions like higher education, who have been looking at digital assessment and putting in place digitall assessment. I just wondered whether there was things we can learn from that setting.

John Jerrim: [01:19:49] Possibly. Again, I go back to the external validity point that I kind of raised earlier, how much I feel that the external validity point is always an important thing in educational kind of studies. Whatever I feel, it's particularly important when we're thinking about mode effects and whatever, I think it is so assessment. software whatever specific that I think that jump might be too far to learn that much very specific knowledge from the kind of higher education sector. I know UCL are talking a lot about delivery and kind of online exams, I believe, next year. And I think let's see how that goes. Let's see how universities go first. And that will be, I think, the logistics. The logistics, I think we can learn a lot. I think partly maybe we can learn something. We'll see if it's a complete disaster, I think.

Tim Oates: [01:20:47] Yeah.

John Jerrim: [01:20:48] But how yeah. How much we can take and bring that over.

Tim Oates: [01:20:51] Well, the key thing there, of course, is purpose. So one of the things you emphasise right at the beginning of this was the importance of PISA in respect of comparability because of the nature of the inferences that you want to make from PISA. Yeah. And that that is a big issue in higher education at the moment, of course, which is the comparability of standards over time.

John Jerrim: [01:21:13] 39% gained a first last year.

Tim Oates: [01:21:15] Yes, indeed. And then back to your very brave slide about predicted grades. So I think a final question, bearing in mind the time, from Sarah, which is I think probably an unfair question in its scope. But never mind. We'll go with it. We've done a lot of the we've done a lot of very detailed stuff on the management of trials and the nature of items. It's about funding for schools. So a lot of drive towards the use of applications that digital assets in in learning. This is talking about really really driving national assessment and in school assessment by digital approaches but it will have funding implications. And that's really the question that Sarah has asked.

John Jerrim: [01:22:05] What are the funding implications or do we need to devote the money to it? I mean.

Tim Oates: [01:22:09] I think it's probably both. I mean, she said, what are the funding implications for schools in relationship to this? But it is a wider. Explore the question more widely as well I think.

John Jerrim: [01:22:17] Yeah, well it's quite clear that if we are going to have digital assessments that kids need to be well equipped and using digital technology and have access to it. So there needs to be kind of funding made for make sure the kids have their own digital devices, which happened to some extent because of the pandemic, although there was more funding puts into it, it's got implications for schools in terms of the quality of their IT hardware, making sure it's kept up to date, the kind of software, the security issues and things like that. So there are big kind of funding things how that funding? I think the key thing there is how much does the Department for Education or whoever's driving this want it because they are going to have to stump up money. I've stocked up a bunch of money already in my kind of plan, and they going to have to stump up a bunch of money for schools. Yeah, there's potential kind of great assessment possibilities. There's potential savings at various places and that feeds into it, but it's going to take upfront investment and there's no getting away from that. And I have no idea what that figure is.

Tim Oates: [01:23:20] Okay. So usually propositions around major changes in the education system because of government requirement need to go to Treasury. This is one that should go to Treasury quite systematically, do you think?

John Jerrim: [01:23:37] The scale of the amounts of money that you're talking about probably would have to go to Treasury. Right. And I think that's the that's the conversation that would have to be had. And I guess you always end up thinking, what's the counterfactual if you're going to pay for this? What could be cut elsewhere from the education budget to make room for it if the Treasury says no? And would you make that kind of sacrifice for it? That's an open question and I don't quite everyone's going to have a different opinion on it and I'm not going to say what I'm going to cut so that I won't get in trouble with that. Let's leave it there. Cut me off.

Tim Oates: [01:24:16] And I thank you. And in that thanks also, I think draw a couple of things together. So first of all thank you John mean it's great. What you brought today was engagement with the realities of running something really important and reflecting on its nature, how it operates and how it probably can and should be improved. How policymakers within OECD handled something that they put in place quite responsibly. And this is the point. It sort of all adds up, I think, to assembling. But I think it's probably craft knowledge actually that you described into a set of quite systematic approaches to what is a major change in the education system. What I kept on thinking was, hang on a minute, these have got massive implications for our practices as researchers and developers and it stimulated a lot of thought about further discussion about what the implications of all of this is for how we do things in terms of what we think we should be doing in assessment and how we manage that transition. I think it's been most enlightening. So this is a real thanks. Thank you very much indeed.

John Jerrim: [01:25:35] Thank you all for coming.

Return to top

A101: Introducing the Principles of Assessment

CPD accredited online courses

First cohort receives advanced award from the Assessment Network

Become a Member and join the debate

Our publications

Maintaining comparability of results when GCSEs go digital - A reflection on lessons learnt from PISA

Maintaining comparability of results when GCSEs go digital - A reflection on lessons learnt from PISA

Video transcript