Developing research-informed digital assessments

Video transcript

0:00 - Juliet:

Hello everyone and a very warm welcome to people who are here in Cambridge and the audience with us, and to everyone who's joining online. I'm Juliette Wilson. I'm the director of assessment and customer support in Cambridge International. But I'm here today as the sponsor of Cambridge's Digital High Stakes assessment program, and we'll tell you a little bit more about that in a minute.

As the sponsor, I'm responsible for supporting and helping the team, making sure that we have a good strategic direction for what we're trying to do and generally just being there for them. And they've asked me to introduce today's event and give you a little bit of background into the digital high stakes assessment program.

Why have we got a digital high stakes assessment program when we already have digital assessments within English, we have vocational digital assessments and we've got formative digital assessments. So what's different here?

Well, with thinking here about general school qualifications, High Stakes GCSE, International GCSE, A levels where there are so many more challenges in trying to digitize these.

So we realized in Cambridge that we actually now have to really get a good program together, a multidisciplinary team of assessment experts, curriculum experts, teaching and learning experts and digital product experts to work together in a new way to really try and launch some high stakes digital assessments for Cambridge.

We obviously had to make some strategic choices about what we mean by digital high stakes assessment. There's many of them, but the ones I probably highlight particularly are that digital is not the the be all and end all. It has to be the servant to the to the assessment. And the assessment can't stand alone. Solutions only work when teaching and learning and assessment work together.

We're thinking about lifting and shifting, so, you know, taking a history A level and putting it on screen. But that's that's not really going to be the groundbreaker. What we also have to do is think about where digital will really help us to assess things that we can't currently assess, and the team will be telling you a little bit more about these things. 

So I'm really pleased that we've got a lovely lineup of speakers all waiting there to tell you about their research projects. Sarah Hughes is going to report on the evidence base we're using to effort inform assessment design and evaluate the quality of the assessments.

Then Ed Sutton, our digital product owner will describe the approach we've taken to developing digital assessment and how research has impacted that decision making. And obviously that's a different way of developing assessments than we have traditionally. 

Sophia Vitello's work helped us better understand constructs in historical research so that we can make some real evidence based decisions about our historical research assessment. And she'll tell you more about that. 

Vicky Crisp is going to describe how we have adapted the Cambridge approach to validity to build validity into our digital assessment thinking right from the start.

And finally, Martin Johnson will describe how we're applying previous Cambridge Research on computer based testing washback to evaluate what impact digital assessment is having in the classroom.

So it's going to be a very interesting session with our speakers telling you all their different research projects and how we've used that to really inform some groundbreaking research and assessment design. So over to Sarah.

4:26 - Sarah:

Thank you, Juliet, and welcome everyone in the room and everyone online. Great to have you here.

So a quick outline of the session today, about 10 minutes from me and 20 minutes each for everything else you see there.

And I'm also planning I think after research example 2, a little 5 minute comfort break so you can all take a moment to breathe and do what you need to do in that moment. So yes, after I've introduced the program with the support of Ed, Sylvia will be talking about the constructs, Vicki validity, Martin washback, and then we'll be discussing what the implications have been, and hopefully plenty of time for your questions and to get into some discussion.

So the Cambridge digital assessment program. Often you'll hear us referring to this as the high stakes program because although at the moment lots of our assessments were formative or low stakes, the intention is that we're building up to scale this up to live high stakes general qualifications.

Um, so why are we doing this? And what have our drivers been, well, Juliet introduced some of the ideas around this, but I just want to talk about some more general drivers before I get into these specifics about our intentions. So there's this kind of inevitability about moving towards digital assessment. Which we've been talking about, when I say we, the industry education's been talking about maybe for 30 years now. 

I know in in the late 90s, Randy Bennett at ETS in the US was talking about Generation R assessments and what he meant by that was an assessment, which was digital and the teaching and learning also being digital and the learner's experience being that they don't really, they can't, they're not aware of when they're being assessed and when they're teaching and learning. It's such an immersive, joined up experience. We haven't got to that stage yet, even almost 30 years on.

But you know there there's there's there's a vision to to work towards our key customers and our key markets have a massive appetite for digital assessment and other driver for us and of course we we're in a position you know to want to defend our market and to be at the forefront of this for for our markets to bring them digital assessments that they have the appetite for. 

But more specifically, we have the opportunity to match teaching and learning, effective teaching and learning through the digital route. We can use learners digital literacy and exploit that and make use of that in our assessments and we can assess constructs that are authentic to those skills, knowledge, understanding, mindsets, behaviors that learners will need in higher education or in employment.

So those are our key drivers. We're taking two approaches. The first is to migrate our existing assessments. We've got a massive archive. We're backed up with good quality assessments that have been quality assured and validated in the last 150 years. 

So we're often asked why don't we just shift those onto screen. And there are certainly benefits in doing that and I would warn some potential risks and missed opportunities, but there's an appetite for that and we are doing that with our current curricula and with existing technology.

But we're also taking the second approach, which personally I find much more exciting and really takes the opportunities that digital offers us.

What we call born digital, that is to transform the assessment by using technology, um, so we can develop from scratch. Look, talking to our teachers and our learners about what the problems they are addressing, what are they coming up against, how can we develop something that solves those problems for them and with them?

So some examples of our migrated assessments and we can see these are generally formative. Um, and these lift and shift are paper assessments to screen?

So the digital mock service has been trialled in the last year. And we nearly one and a half thousand tests were taken earlier this year in the in the in three of those mock subjects. We're about to embark on a pilot in January of these four subjects looking to take about 10,000 capacity for about 10,000 tests over in about 100 schools.

So we are starting that in January and sign up for to get involved in those mocks if you're in a school. Or you're interested in in supporting us and shaping what those assessments are like, then do get in touch or they will contact details at the end about how you can get involved in those, in those pilots.

Oh, so so just some examples of what those mocks look like. This is our English as a first language which looks quite similar to the AS history in that there's quite a lot of stimulus material on the screen and an extended response.

As an example, um, English ais a second language. The items are much more structured. This is the listening item, so learners have control over when they play and listen to that audio file.

And we're also looking at GCSE in the UK, Computer Science, for our for our UK schools, just a glimpse of of what those look like.

So what about these born digital assessments then, where we're really taking the opportunity to to mediate what's happening with technology? Well we're working in three subjects at the moment, computer science, historical research and data literacy. Our hope is that next online will be the tricky subject of mathematics, which we're very much looking forward to getting our teeth into.

Um, and for these approaches we use very particular working practices which Ed is going to describe.

10:43 - Ed:

Good afternoon everybody. Yep, so our born digital assessments are being developed using agile working practices. And as Juliet said, we're working in multidisciplinary teams. We're engaging with our customers early and often. 

We're taking an iterative approach, including incorporating the research findings as they become available. And we're producing little things often to show our customers and I'm saying their feedback.

And in practice, that means breaking our work into two week long sprints, constantly reviewing the progress we're making, replanning after each and every sprint. And we're using design thinking methodologies, so design thinking is what's known as an outside in design methodology. It helps organizations see their products and services from a user perspective and this approach enables us to balance the needs by users with the needs of our business and aims to create value rich products and services.

And as you can see from this diagram, it's a nonlinear, iterative process that teams are using to understand our users, to challenge our assumptions, to redefine problems and create innovative solutions, to prototype and test with our users before building complicated technological technology solutions.

Essentially, we start by identifying problems. What do our assessments look like from a user's perspective, and what problems do they have with them? And that forces us to explore our underlying assumptions behind what we're assessing and why. We then engage early with our customers to validate that problem to gain the evidence we need to be confident that's a problem that customers want us to address.

And we can develop solutions and then validate these meaning both understanding the customer appetite and market for them, but also the questions which my colleagues from our research teams are going to talk about about validity and other things that we need to validate as part of our product concepts. So I'm going to hand back to Sarah briefly.

13:00 - Sarah:

Thank you, Ed. So let's get our teeth into this evidence base then that we're here to talk about. So of course there's a number of different types of evidence all inputting to our design decisions. What we're focusing on today is that academic research area, although of course there's overlap with other things, particularly I'd say maybe the UX research and the customer research.

But the focus really and my responsibility is around the academic research evidence that inputs to our products to have. For our product colleagues. So I see the academic research as having two purposes, 1 to inform design decisions and the other to build in from the start and to evaluate quality. Of course these aren't entirely separate, and one piece of research or one data set can fulfill both of those purposes.

So I'm going to the next four slides, basically a list, lots of lists of research that we're doing, but I'm trying to give you an idea of the kind of quantity of work that we've been doing in order to support our colleagues. So the first slide talks about, you know, at the very beginning of one of those born digital developments using the methodologies that Ed described, the questions often arise from our product teams. What is this thing we're trying to assess or what do we have the opportunity to assess here? 

So in all three subjects that we're working on at the moment, we've been answering questions. These are the questions that were raised by our product colleagues that research has gone in and started to answer or has answered. 

Historical research, Sylvia is going to talk in detail about that, so I won't go into detail. Computer science questions came up, from talking to HE for example about people keep talking about the need for creativity, collaboration, communication, resilience as a programmer.

What does that mean? What does it mean to be good at these things or not so good at these things? What are the skills within those? Are they assessible? Who's assessing them, and how? So these are questions we answered up front to support our colleagues.

Um, and data literacy is a bit behind the other two subjects. The team hasn't been together that long, but again, similar questions are being answered by our research teams to support decision making there.

And then there's some wider questions that apply across subjects, for example around comparability.

Around item type and what that does to demand. That particular question, the second in the list there arose because some colleagues were thinking, could we automark this test? And in order to automark really reliably, we might need to change the item type. So what does that do to the demand? What would that do to the standard?

Um, work around, you know, what effect does translating to screen have on the constructs? Are we introducing construct irrelevant variants? Are there things that we don't intend to assess that we end up assessing, important questions to answer.

Um, and then a wider set of questions which aren't specific to a product maybe at the moment to a specific assessment, but we think we need to address and start to think about around for example, general barriers and benefits. Why isn't, why haven't digital assessment been taken up despite all this talk about it and all this potential opportunity?

Um, I'm not going to talk through all of these, but just to draw your attention to the kind of second one, which is a piece of really interesting work, I think, underway, asking the question of what are the affordances of digital teaching and learning? What is that? What does it look like? How does it benefit learners and teachers?

Similarly, what are the affordances of digital assessment? Because I can't help thinking that those two aren't always as quite as aligned as we imagine or would like. So hopefully that piece of work, which is a literature review underway, will help us to understand that much better.

So the evidence that we are collecting in order to build in and evaluate the quality sits in these different areas. Today we're going to go into detail, Vicky talking about validity and Martin talking about washback, but happy to take any questions on those other aspects as well because they all together. You'll notice fairness is not on that list, but nonetheless I see fairness as some kind of linking of these things and bringing together of these things.

OK, that's me. I'm going to hand over to Sylvia now to talk about the historical research constructs. Thanks.

17:30 - Sylvia:

I'm going to talk to you about this program of work that looked at really essentially what Sarah was saying, kind of what things should we assessing in history, in the subject of history. That's essentially what this question is asking. 

My name is up there, but I want to say that I'm going to be drawing on work that both Jo Ireland and Emma Walland did. So two other colleagues in the research division and so I'll probably mentioning their names at various points during the presentation. So thank you for that.

So yes, as Sarah pointed out, there are kind of these three questions that the team had come to us with in terms of the subject of history and research skills. They've been speaking to teachers and higher education and they've identified that there is some kind of shortcoming or a weakness, if you like, in terms of students, research skills and history by the time they leave school at the age of 18. And the idea was to be assessing that, better, more authentically and could digital assessments enable us to do that? 

So the team were really interested in primarily the kind of 16 to 18 age range, so kind of the a level space. So what I'm not going to do is going to go through these questions. Instead, what I'm going to do is draw on kind of the research that we did.

For these different questions and what they tell us about kind of the constructs in history. So that's kind of how I'm going to frame this this presentation for you. And so as kind of Ed was saying really different approach in terms of research, we were working with the team for a couple of months, but to fit in within the kind of two week sprints. So working for a couple of days each within kind of a set of two weeks. 

And we're trying to bring together really quickly like piece of evidence from the existing literature essentially from academic research, from looking at qualifications and existing assessments, from looking more deeply into guidance documents about teaching. practices going on in history and loads and loads of thinking all within this really short space so that we could go to the team every two weeks and have conversations with them essentially in terms of what does the literature telling us, what is the evidence pending us and let's talk about it. 

At the same time, the product team also going away and having their own conversations, talking to teachers and the idea was to come together and to have a conversation to see how they kind of fitted together. And we were providing as Sarah said the academic kind of side of it. And so these are the sort of things that that we focused on. So really kind of a desk based rapid review of existing literature or information.

And so the first big question was what are historical research skills when really that was the foundation of a lot of the work that I was involved in with the team. So really looking back at the key academic literature, looking at kind of key journal articles that we often would look at if we're looking at educational research. We looked at programs of study and so we looked, Joanna did a bit of this to start off with. So she'd go with the team and identify with the product team a set of qualifications or assessments that related to the kinds of assessment that the team wanted to develop. 

So these are kind of post 16 history qualifications and they got a list of these ten programs studies or qualification or courses if you like, depending on how you refer to them really they either had a substantial research project elements or extended project qualification or worse, specifically history kind of qualification. And they come from different countries and essentially what we wanted to do was find in them. What kinds of skills are they talking about in terms of research skills? What was the literature saying? The academic and applied research literally in terms of research skills and the history domain.

And what were these programs of study also mentioning in terms of these research skills and the idea was kind of extract all that information out. When we're looking at the academic literature, there's quite, there's not that much on history specifically when it comes to research skills. So we extended our net and looked at kind of models or papers that had looked at kind of a research skills in other fields like science or ones that had kind of brought together evidence from a whole range of different fields where research would be used, like maths, including is one of them as well. 

Right. So essentially, I just want to give you this, but no need to kind of spend time reading it because you're going to see this list throughout my presentation. But these are the 17 research skills. I'm using the word skills in a very, very broad sense as well as the word construct, very broad. 

It's about basically like have some kind of activity that students who are doing research skills would have to undertake. And then it was up to kind of the later stages in the product development where they'd really pinned down exactly what they meant by contract. Exactly. And that wasn't my job. Our job was to think about kind of what do you? What we want our students to be doing as part of a research project and what skills? And so, yes, I know those of you who like specific terminology, skills of course are very broad. What do they have to do? So you can see probably very familiar the research literature, the academic literature, looking at the progress and study enabled us to identify kind of these 17.

So you know, identifying a problem, doing some background research, reading or exploring research to inform the research question, formulating the research question itself, making a plan, research ethics gathering, selecting information, organizing this material, hypothesis, generation. #10 to 3 to 13, a sort of all to do with analyzing, evaluating data. 

In brackets there we've put stuff that's specific to do with history. So the word sources comes up because that was relevant to kind of the history domain. But really you can see this relates to quite a lot of different research areas. Actually what we found might actually apply for other subjects as well down the line. 

Drawing conclusions, evaluating the the research itself developing researchers or the students own personal point of views and then kind of communicating findings and then references and citations.

So kind of where did these 17 sort of activities come from really? Well, the one on the right there is a paper by Stokking et al 2004 and they looked at the Dutch examination requirements across a whole load of subjects in secondary education and they essentially identified these what they called steps, we called them skills, what they called them steps, these 10 steps and essentially thought this was a really good way of forming our base. 

So there's quite like I've color-coded it, there's a one to one relation pretty much with all of the Stokking ones and then we sort of filled in the gaps with things that we thought also needed to be included in white background reading, research ethics and kind of evaluating sources and they don't use the word evaluate we thought it was important to include in our seventeen. 

So kind of that that's where we went with it. It's always good to focus on one framework and then work around it or build into it and this is kind of what really transparent about where this came from. 

And again in terms of transparency and creating a kind of like research audit trail for all of us involved us including and Ed and the team. This is essentially what we tried to do to record our research evidence. We have this table which is success with our 17 skills. And then we just made a note of which post 16 qualifications for the ten programs study that we look at referred or mentioned in some way these skills and which was the key literature that had made us think that we needed to include these skills in our framework. 

This is an incomplete table. Intentionally incomplete with gaps because it's just to signify how we were creating our research audit trail. So that if later down the line when we have conversations with Ed and the team, we could go back to it and say, well, this is where it came from. And I think it's sometimes we forget about making note of our research evidence and where that comes from, we can forget about it. So having a really research informed process, you need to always be able to go back to where that research was, so I wanted to take a step back to say this is the way that we try to keep track of that evidence base.

But really the question was about assessment, wasn't, it wasn't just about skills, it was about can we can we assess them. So again, another table trying to record our process very transparently for everyone, which is that if you remember I said that when we were looking at identifying the 17 skills, we looked at both the programs of study and the academic literature. So it could have been the case that we'd identified research skills that had never been assessed in any of the programs of study that we looked at. And what we needed to do is really understand how were they being assessed? Which ones were specifically being assessed in the ten programs slide that we looked at and and with what method? 

So again, we've got this table, I'm only showing you two of the skills, the first two formulating a problem and background reading and what Emma and I did in this instance, we went into the specifications, the qualification assessment criteria, and we just copied and pasted. We literally extracted the statements that to Emma and I thought they related to these different skills. And so that meant that we hand over to the team, the team could look at them and they agreed. Do they not agree? Like very kind of transparent way of looking at looking at research and and we color-coded it to see if those came to international and if it came up again in a different one, so lots of different techniques that we used to try to synthesize the evidence.

 But what I'm going to show you is this graph really, so you're seeing all the 17 there and 10 and then you're just seeing how many of the programs that we looked at mentioned referred to either explicitly or implicitly in some way, these skills. And what you're seeing is kind of skill from 10 downwards to about 14 and even 16. So analyzing data, evaluating the source, drawing conclusions, communicating findings, pretty much all of them, all the post 16 qualifications that we looked at, these are kind of like really core research skills that kind of were being assessed in some way or another within all of the ten programs of study that we looked at. 

Very much fewer for the top one. So formulating actually a problem, research ethics, even doing background reading. Now, this isn't to say that they're not part of the process. It's saying that there wasn't like an explicit, or even really an implicit statement in these programs that we were looking at. So perhaps they had less priority when it came to the assessment process. That again something just information for the team to reflect on really in terms of what they wanted to do with the product they wanted to develop. Um, so these are all post 16. Then the next slide is looking at what happens if you look lower down in the education system. 

So again similar kind of color coding, so the next one along, so the green or blue was the post 16 one that I showed you exactly the same numbers. Then the next one dark purple I guess is are the courses that we looked at in the 14 to 16 age range, the one next 11 to 14 age range and primary we looked at, so end of primary age 11 kind of assessments or programs of study. 

And well, you can see we're just quite interesting. It's quite a lot of similarity. There aren't some skills that are being assessed only at upper secondary school. They are being assessed even even at primary school. So like interestingly there's just lots of it, which meant that there would be some students who would in some in some specifications or some curricula that would have been exposed or had experienced these research skills really early on in their educational system. So there was a lot of coherence in some, so for example, Cambridge International perspectives, a lot of coherence when it comes to research skills and and targeting a lot of these ones. Hong Kong lot of coherence, extended project qualification, EPQ, we saw a lot of kind of coherence there. 

England's system not so much so much, the GCSE, a primary school in the National curriculum and he says two key stage three at A level. They touched on quite a lot of these things, including kind of thinking about getting students or learners, children to think about formulating a research question in history. You can see evidence of that in the curriculum statements. GCSE history is a bit of a gap there, where there is a focus a lot on analyzing data, analyzing sources, coming up with conclusions, but there's not much of the kind of earlier phases of research. That was sort an interesting one to think about with the team.

But really, knowing what these 17 are, even how we've identified them, isn't enough for developing your assessments. You need to know kind of what are the relationship between these skills.

All the skills separable and we need. That's important because it depends on it's going to affect how we assess them and what factors may affect these skills, both in terms of how students learn and develop these research skills but also how they can demonstrate them in the context of an assessment. Are there factors that even if they have all the skills is something about the assessment that might prevent them from actually being able to demonstrate them? 

And this is where kind of all these conversations need to take place just to make sure that we can authentically and assess, assess student skills. So yeah, a lot more than just what the 17 are but all these interactions? 

So Emma and I had to think about this, went back to the literature, looked at the assessment and programs of study and what we want to do first to see if we can chunk them in a certain way, these 17 into broader phases. Is there a relationship there? And so, so we came up with this looking at kind of other frameworks and there's loads out there in terms of categorizing the research process.

And we thought, well, let's so those are all the 17. And essentially they appear once apart from one in only one of these phases. We thought coming up with questions and then the research process sort of have to plan it, to collect the data, inspect it, synthesize findings, evaluate research, present the findings. 

So you've already chunked them into groups that you might expect would occur in specific research phases during a kind of a typicalish research process. And we and when we looked at them, you know they have a broadly temporal order. So there are dependencies in here. So you know you have to inspect the data before you can draw conclusions about it. You have to have a question before you can figure out which sources you're going to look for to analyze. There are dependencies and you can see that in the curricula. So we're saying primary school curricula, you might, you know students might have difficulty coming up with a question or hypothesis that's the teachers role. So in some of them, the teachers have inputted into those phases. So if you understand the dependency sort of no kind of who's gonna fill those in within a research process and how you might structure an assessment to make sure that kind of it's authentic and we've got all those pieces of the puzzle to fit a research process. 

But as much as they're kind of arrows going down them, it's really not a linear process. So I started putting like arrows in the between these boxes - a mess. So I stopped where the where the arrows are because it's such an iterative process doing research, and there is in the literature as well if you're looking at how historians do research. They also go through quite an iterative process when they're working through these when they're analyzing sources and Robin Conway looked at how sources are analyzed by historians as part of literature review, if he's doing part for his PhD, so citing other other research they've done, you know, and you've got these different experts who doing it in very different ways. One expert who's looking at sources and then when they're looking at sources, they're then asking many more questions that are generating more hypotheses at the stage of actually interrogating the sources. 

So again, you're like right back to coming up with questions, coming up with hypothesis like it's a, it's a non, it's a nonlinear process going on here and it's very different depending on which historian. To ask, so there's not a single process through, so it's quite complicated activity actually to then design an assessment process because there's not standardization and how the process works. 

And of course the underlying cognitive processes in each stages, they overlap, so questioning, critical thinking appear in many of these stages. So they're not really distinct in that sense.

So that's kind of what what we felt in terms of kind of the theoretical framework of how the research process might work in real life as well and how it might relate to history. But how does how does this align with what teachers do in practice when they're teaching research skills? What does the relationship that they see between skills, does that match kind of our framework? And yeah, so if we looked at teaching practices and guidance in terms of that are given to teachers when teaching research skills or all the research process seemed to structure the lessons in a similar process, quite linearly, they might have a lesson that's on coming up with a question, and another lesson on hypothesis, and then another lesson on kind of what to do with sources. So they might do it in quite a linear way, actually. And but actually, when you delve into it, you realize that that's not what's going on underneath the surface, that when the students get to the lesson about sources. So Valerie Thaller's work looking at her own practice, teaching history skills in in a university context, at the stage where students are interrogating their sources, actually, they realized that the research question that they had formulated actually isn't specific enough, or it's too vague. So then they have to requestion and reformulate their question or reformulate their whole reearch product, really. So yeah, there are iterative and phases and repeating, going back to these different skills at various aspects of teaching in ways that are just implicit really sometimes.

But some other workers say they're actually even though there's loads of interconnectedness between these scales, structuring, teaching, research and linear way can be really useful sometimes because it can prevent students from being overwhelmed by the fact that these are so interconnected in real life. And so there's kind of a literature review by by Joe and Melissa from the research division, looking at kind of frameworks for teaching curricula that have complex skills. And there's some suggestion in there that perhaps it can help at the beginning stages of breaking it down for students.

But as I say kind of we want all interconnected, but can we separate them out? Can Ed's team go and say OK, we're going to create an assessment and we want to assess sources for example. Is that, is that possible and is that kind of sensible. So again if you looked at the programs and studies and kind of how teaching is structured but that happens, they are separated out in many instances. GCSE for example, separately assesses kind of source analysis - GCSE and A level history. Butt there is a strong rumbling in the literature that by saying that the research process is really holistic and that should be the only way it should be taught or assessed.

There was a whole rumbling in there and lots of evidence recommending teaching research skills as part of a project or an inquiry based process. And one of the reasons, lots of reasons in terms of risks that come out from that is that the argument that testing specific skills restricts learning, it restricts students understanding of what that stage is meant to do as part of a research process. And so for example, Conway, again, Robin Conway said that, you know, students, for example, may learn to analyze sources in a very formulaic way. If you have questions and exam, they're focusing on sources. So for example, they might treat what does he mention? He said something about the fact students focus on bias if a source is very biased, they then just disregard it because they've been told bias it's really bad. But actually if you talk to historians bias tells you so much information about a source. But it's because they're using that information with regards to a question or a hypothesis. And you know it's not just this is shall I use it or not. 

And I guess this isn't to say that you can't assess skills separately or that we shouldn't and it's about making us aware of how we should be assessing it and perhaps framing in a certain way that removes some of these I guess some kind of like even washback effects in some of these instances. 

And so finally, I said that it's important to know kind of what might affect these skills, both in terms of the development and how students demonstrate within an exam or assessment context.

And I think it's important or nice to think of research skills as complex cognitive skills. That's essentially what we're talking about we're talking about research and Stokking, I put the reference there just because he's got nice kind of description of why we should think about research skills as a complex activity. But when you delve down into it's because in an assessment context, we're asking students to essentially, do some kind of research activity, but that research activities can be affected by a variety of different factors. So including in the literature, there's lots of teachers in a history journal called Teaching History which say that actually students demonstration of the research skills is limited by how much subject knowledge that they have in history. So they might struggle to analyze a particular source if they don't know very much about the historical context surrounding it. So kind of yeah, understanding the relationship between subject knowledge and research skills is really important for assessments. 

As I said, they have these dependencies between research phases, which means that students can get stuck at different points in the in the research. Process if you're if you're trying to assess it kind of holistically. For example students might not be able to find a good research topical question and teachers have found that which is then have hindered which sources they go away and try to collect as part of their process as part of their project. Or another one is that teachers have said that students just can't find good quality sources, so then they're limited in how they can demonstrate how well they can analyze a source because the source itself isn't very good quality and so they just limited in how much. So teachers are, there are lots of examples of ways that teachers have constructed for example constrained web pages with a select number of sources that all students have access to, and it's up to the students then to be able to select that. But they start in the same position that they're all going to this particular page that has all the sources. And so it's kind of creating quite a level playing field in terms of assessing them in more of a fair way. 

And this is just because there's just loads of interactions. The task features affect how well students can demonstrate their research skills, the student characteristics, so they're subject knowledge being one, and the teacher in school environment in which they're in can affect the extent to which they can do both, develop and demonstrate. So there's kind of, if I have all this paper to hand it, probably something like this one. Here you go, Ed there's lots of information for you to take away, and really it was about understanding what kind of skills we're talking about. But it's not for us to say kind of which one should be assessed.

The team ongoing, will have all these conversations about what they want to assess and how, but taking hopefully some of this into account. So I'll leave that there. And references.

40:20 - Vicki

OK, so I was asked to get involved in terms of working with the teams in relation to validity. So I'm going to start with a bit of definition because it's good with validity to start with that, and then you'll know where I'm coming from with this. So as you'll know, traditional definitions of validity, talk about validity being about making sure that we are assessing the things that we want to assess. 

So we have this quote here from Kelley. The problem of validity is that of whether a test really measures what it purports to measure. But more contemporary definitions, and Samuel Messick has been really key here. Um, take validity further and define it in a broader way. So um, what Messick is saying here is the validity is about how appropriate it is to make certain inferences from our assessment results. So if we have a hypothetical learner who gets a grade A, in whatever their qualification is, what is it appropriate for us to infer from that? Can we infer that they've got a good level of knowledge, understanding and skills in that area? Can we infer that they will do well in the future in some kind of related future course or career? Can we infer anything about the teacher, for example?

So it's not about the assessment itself, well, it includes about the assessment itself, but it goes much broader than that. And one of the implications of that is that we shouldn't, that the validity is not a property of the test, it's not a property of the assessment. We shouldn't be making statements like this test has high validity. Because the validity instead resides in the scores and for the purpose for which we use them, so we can make statements like the results from this assessment have high validity for certain purposes. So there's just that little bit of a distinction. But although these contemporary definitions of validity are much broader, they still contain that traditional definition. If we can't be sure that our assessments are assessing the constructs that we wanted to assess, then it's very difficult for us to make any of the kind of inferences that we're interested in making. 

Um, so I think someone mentioned earlier, we've done quite a bit of work on this in the past and looking at ways to validate our existing qualifications. So myself and Stuart Shaw developed a framework for conducting post-hoc validation studies drawing on the work of the validity theorist Kane from America, and we developed a set of methods for gathering relevant evidence to go alongside that. To allow us to evaluate the validity of some assessments for whatever the purposes of those assessments are.

And so this was in the context of international A  level initially and then IGCSE and the framework was designed with the purposes of those assessments in mind, the ways that we use the results from those assessments and making sure that we would thought it would be appropriate to make the inferences that we want to make from those.

But note that qualifications such as the international A levels of course, are very well established. These are something that existing they've been running for years. And so we were conducting an evaluation of something pre-existing. And what was good about it was that we could then feed into reviews of those syllabuses, but quite a different context to the context of developing a very new assessment.

So this is the framework that we developed, as I said, drawing on Michael Caine's work, and we have this sequence of inferences that one is kind of working through. We want to be able to infer one thing from the next. So we start with construct representation, so will the tasks that we are giving the learners elicit the kind of performances from them that allow us to assess or to see their work, to see their knowledge, understanding and skills in the areas of interest? Then can we score or mark or grade that such that those scores, marks, grades reflect those constructs that we're interested in? Then can we generalize from that? Do the constructs that we sampled in the particular tests or assessments that those learners happened to take. Do they do those, does that sample represent the fuller whole of all the possible tasks that we could have given those learners that sit within that syllabus?

And can we extrapolate further from that about the learner's abilities of beyond that syllabus, so to the subject, more widely to other subjects? And then is that such that we can make those kinds of decisions that we want to make based on that? And you'll see for each of these, we have a validation question that sits alongside that kind of acts as a research question for gathering evidence in relation to validity.

And this is some of the methods that we were using for these post-hoc studies. So for example, conducting analysis of item level data to look at how questions were functioning, different kinds of reliability analysis, asking some subject experts to conduct some ratings to evaluate what was being assessed and so on.

So with the background of that work, what could we do for the teams who are working on digital assessments? I mean, we could wait until they've created something and it's out there being used by learners and then conduct post hoc validation work. Or would it be more useful to try to support the process along the way, conduct some kind of validity by design, build that in? There is of course a lot of really good practice in normal development processes that relate to validity, but perhaps some of that is sometimes a little bit implicit or a little bit indirect. And here we were hoping to be much more explicit to really bring that focus on validity at various points as the process goes on. 

Um, one of the next questions for me was OK, well, can we draw on the framework that we already have or use the themes from in it, even if we don't want need to use the framework in the form that it was before? And my gut reaction was, well, I hope so, because I don't think validity becomes something different in the digital context. But we would need to be thinking about the purposes of these new assessments and qualifications. And there might be some different kinds of issues that we would need to consider for digital different kinds of evidence that would inform us about validity. So we needed to be thinking about that.

And then? The next thing that we did was think about work. Well, what work is already out there? What have other organizations done in relation to validity for digital assessments. So in keeping with our agile approach, I conducted a small literature review to see what we could find on this. And we're working with the teams using collaborative whiteboard. So this is how I fed back that literature review to them rather than a full blown report. So the little post-its are just a little snippet on what was going on in one study or a cluster of studies that were about validating digital assessments. 

And what seemed to come through from these was quite often these studies, or the ways that a particular digital assessment had been validated would only have used one or maybe two methods. So maybe they'd looked at the correlation between the scores on this new assessment and scores on some other assessment, digital or not, that is supposed to measure the same kind of skills and if you get a good correlation, well, hey, yeah, it's valid. Great. Um. And then some studies were starting to take it a little bit further and think about different aspects of validity and use more than one different evidence type.

And then perhaps the most thorough piece that I've seen is Burstein et al, who work on Duolingo and they have used a very complex framework to think about their development and how validity fits in with that and their validity work there has drawn on the work of Chappelle et al, who have done a lot of validation of the TOEFL assessments and Chappelle based their framework on Michael Caine. So that was quite nice. OK, so everyone's still drawing on the same thing even when we're dealing with digital assessments.

So there were not a huge number of studies that were very theoretical or used a wealth theorized structure to carry out their validity. I mean each of the types of evidence that we saw organizations were using were relevant but sometimes rather just sort of using one area. And equally there was nothing in here that suggested, OK, validity is different for digital. Fundamentally it's still the same kind of thing that we are trying to achieve.

So what did we decide to do in order to support the teams to try to build validity in as they went along? We decided we would use an approach kind of around some workshops and discussion meetings with the teams, and it's been very iterative, though I've set it out as steps because perhaps I was a little bit naive at the start and thought it would work very neatly in that way and it didn't quite, but that's absolutely fine. So firstly thinking about asking the teams to reflect on what the intended purposes are of the way of the results. How would they anticipate the results be used, what are their aims for that and related to that and what does that mean we are trying to interpret the result to mean.

Defining the constructs, of course, which Sylvia's talked a lot about. And so they were able to draw on all that kind of information that Sylvia and others were providing to them. But my job was to just sort of corral that into, OK, so how's your thinking going on this? How are we getting on? What can we put down on paper yet? What's provisional? What's a bit more solid. And step three around part marking, aggregation, grading, any auto marking. So the themes around making sure that the relevant constructs would be assessed. And then administration is an interesting one because with the work that we've done in the past where it's all been paper based exams, we have such an established protocol for lining up learners at tables in an exam room that we already have a way to verify who that candidate is. We already make sure that they haven't got access to things that they shouldn't have access to. But in the digital context, all of that is somewhat different and of course the operational teams are already working on that and we are do hopefully to start having conversations about that side of things soon and how it's going.

And then once there are trial materials ready and that's we're getting closer and closer to that, we're hoping to collect some validity type evidence alongside relating to our validation questions one and two, maybe three, and thinking about what kinds of evidence that we've used in the past or any new kinds of evidence that would be useful to start to think about where things are going with validity. And hence then feed into the ongoing development.

And um, we can then potentially, once the assessments qualifications are all being well up and running, we could conduct some full post hoc validation studies if we felt that was useful. Or it might be that some of the evidence we already have means that we don't need to collect certain kinds of evidence later. That's still a bit to be decided.

So I'm going to show you now a few screenshots from the collaborative whiteboard area, which you won't be able to read. But I really just want to show you the approach which has broadly been that I go off and gather a few bits of useful information and then we meet and have a bit of conversation about it. And I explain this as a bit of a resource for them to work with and then provide a space for the teams to start to fill in. And it's very much been a start and then go away and things develop and then. They fill in some more and then we have another conversation and so on. So this is the area to do with our step one about the purposes, the ways that we expect or would like or hope that we can use results. So here the purple boxes on the left are some kind of rationale around why we need to think about purpose. And then the content in the middle gives some information about different kinds of assessment purposes.

And in the top right of that middle box is the way that Stuart Shaw and I set out what the purpose is of international A levels appear to be from our discussions with colleagues. So that was sort of there as a template, and then with some prompt questions and so on. And then the space on the right was for the teams to begin to work in and there are boxes there for them to start to record what uses they expect stakeholders to make of the results, what that means in terms of what we want the results to mean, a space for ways that we wouldn't want stakeholders to use the results and in our conversations with the teams it was quite clear that there are, sort of purposes in terms of the way that we want to use the results, and there's also purposes on the more educational side, like what do we want learners to get out of this in? 

In our conversations we sort of kept going, fluctuating between the two. And although the slightly more educational side is a little bit less of political and some bit more about the wash back side of things because we ended up talking about it at the same time, we gave them a space there to record those as they were thinking about purpose.

And this is our area from that board relating to constructs. So here I provided them with some definitions of construct, for example coming from the US standards document. And then a little bit on the different ways that we can operationalise, sort of how we set out what our constructs are and there's some space in which the teams could work, and this has been very iterative. And then also on constructs I provided them with some additional prompts of areas to think about, and each of these boxes here was connected with a sort of blank space box for the teams to work in. So one area was to ask them to think about are there things that are important within this construct domain that actually you think are going to be really difficult to assess? And are there ways around that? What are we going to do about it? What does that do to all the planes that we're going to be able to make about the assessment? Um, uh issues around the sort of balance and coverage of skills and content and what that might mean for us in terms of representativeness and being able to generalize. And then if one of the purposes of the assessments and qualifications is going to be that we can use results as some kind of indicator of learners likely future success in other areas, that the teams need to be thinking through how the constructs that they are going to place within these assessments relate to those wider kinds of skills and then we have sort of similar areas within the collaborative work whiteboard for the other themes or we are continuing to develop those other areas of the whiteboard. 

So a few sort of reflections here. What's been interesting for me is I don't do that much work that's so closely tied to development, and it was not really my role in this to be telling the teams what they should and shouldn't be doing. It was really to make sure that along the way certain things were being thought through and that there was a good rationale for things. And quite often we'd start a conversation and of course they've already been thinking about it, but I think sometimes the conversations that we had then helped to develop that thinking, hopefully.

The overall process of the development of these assessments is very much an ongoing iterative process. And as has been my contributions along the way, as I said, it was not as simple as we do one of the steps on my list and then we move on to the next because the development sort of evolves and things change and then that affects something else. And then one of the things that Sarah and I have been reflecting on recently is how we deal with starting to gather some provisional validity evidence alongside the trials and the extent to which we do that and how much resource we put in, in these very early trials, for example, when we know quite a lot is still going to change. And so how do we prioritize which are the key areas that are going to be most useful in informing the next stages of development. So we're still kind of working through that as the Nature of what's going to be involved in the early trials develops.

And then I just want to finish with this quote from Stephen Cirici, a validity theorist from America. Evaluating test validity is not a static one time event, It's a continuous process. And he's here talking about evaluating validity for something that already exists, that's already up and running and how that has to be a continuous process and it feels very much in the world that we're working in where the development of the assessment is going along the the validity. It is also part of that continuing process. So that's all from me. And should we break Sarah? OK, so 5 minute comfort break. I make it 4:15, so 4:20 please.

59.00 – Martin

Thanks everyone, nice to see you all. I’m going to talk to you about washback. And, or is it? There’s a question. We’ll get to that one in a moment. The thing that we tried to do, this is work that I did with Stuart Shaw back in 2019, we published a paper. We were really interested in trying to encourage schools to think about what was going on within their own classrooms when new developments were initiated. And we felt like, as researchers, we could perhaps provide a framework to support that, not necessarily do that work, but actually provide a framework which would allow schools to self-reflect. And so it was really nice when this initiative came around that actually it was an invitation to operationalise the stuff that we put out there in a journal paper, in the Journal of Further and Higher Education back in 2019 aimed at teachers and school managers to actually think about that in our own institution.

So that’s where this comes from. And as you can see at the bottom, there’s the reference to the paper that this is all based on. But really what we were trying to do was to drive what we felt washback was around. There’s a lot of writing around, especially in the English for second language area, English language assessment and so on. And so it’s very strong there, but what we were trying to do is bring it into the computer-based assessment world and to try and distil some of the key elements of that.

So when we did that we really felt there was probably three or four key things which denote what washback is around. And before we get to our definition, they fall into this phrase. So washback itself is really something about anticipation. So something’s going to happen, there might be a test being introduced in the future that you’re going to take and what changes before that happens. So it’s the anticipation of the thing happening which actually spurs changes in behaviour. So that’s what we were coming to.

And then the second aspect is this notion of impact, and John Gardner, I think it was at Future Lab talks about, actually impact on washback are pretty synonymous, they are elements of the same thing. So you can get, you know, mixed up if you try and derive particular views on that, so actually don’t get too hung up on it.

But the next thing was about effect and this was Tony Green, and that was from English language assessment, and he talks about there’s aspects of impact in washback and what those, sorry, the effects which happen. And those effects can be either positive, but they could be negative. They could be anticipated so you can see them coming and you’d expect them, but they could be unanticipated. So you’ve got to be open to things. So when you are developing some sort of tools to try and find out what’s happening in relation to washback to a thing that’s going to happen in the future, you have to leave space to find out what those things may or may not be. So you can bring of yourself to that, but it might be new.

And then finally, and the most important thing for me and Stuart was, thinking about the stakeholders, who are the most important people in this process? And we were very taken by a talk by someone from the, who’s the Swedish National Testing Agency, and they were talking about how they use student voice in the process of developing their national tests, in Sweden. And we felt, actually, that’s probably one of the hidden voices that is often, you know, something that we can easily miss in test development. And so that is one of the key elements that we felt should come through in a washback study. The voice of the student, but also the voices of the teachers. The people who are most directly affected by the initiative being introduced. So there are many other stakeholders, of course, and you can ask them too, but actually don’t forget the teachers and students. So that was where we were coming from with that.

And so, really, we came up, you could distil it all into one phrase, and this is in the paper, but there are three aspects we think that are really key. It’s a longitudinal process, so if you’re trying to evaluate washback you have to take some time. It’s not a one-off snap decision, you’ve got to, you look at it over different occasions. And by doing that you’re actually, the second thing, thinking about comparisons. So you’re comparing somebody’s state of behaviour, beliefs or whatever at one point with another. So what’s changed? Because without comparison you’re not going to get change, or evidence of change.

And then thirdly you need to plan it. So it needs to be planned into the process of development. So it’s not something you just tag on, it’s something that you build in into the process if you possibly can. And that’s what’s been happening within the development here which has been really nice and it’s been great to be part of.

So in a nutshell that’s sort of it. I’m not going to finish now, though, because my big question is, actually, is it washback or is it washforward? Because I’ve not been really aware of the concept of washforward but I had a sense of, I don’t think we’re doing washback here because the thing that we’re looking at isn’t developed yet. So it’s not necessarily that we know what that is and we can start to anticipate what those effects are going to be. But there is a concept called washforward, so the idea that you can think and plan into your development the idea of there is going to be change and let’s be open to that and try and capture evidence of it. And then you can have formative influence because you can actually feed back into that development and maybe make it better.

So I think we’re probably in the territory of washforward here. But does it matter? I don’t think it matters because the same framework can be used for both and that’s really the key point.

So the work that’s Stuart and I were doing was really to create this, I suppose these key dimensions of what a washback framework would need to consider. There are four of them in blue, there’s one at the bottom which is probably the thing that people always capture anyway, so the change in outcomes. And so we were less bothered about that because we felt that’s the thing that people will look at. How do people’s performances change? How do their grades change? Do they do better or worse? Is it girls? Is it boys? All that sort of stuff. But actually what makes the change is what’s really important to focus on.

So the blue dimensions, which we outlined in the paper, are really what drives this initiative. And so constructs or concepts, content, whichever you want to call that, I mean we talk now about constructs but I think in the paper we talked about content. You know, is it the right sort of content? Is it appropriate? You know, are you okay with that? Resources, are they the right sort of resources? Are they appropriate, usable? Those types of things.

For the interaction, that’s about quality and type. So what changes in the relationships between teachers and students and those around learning? That’s worth looking at. And then finally the affect, and that’s the idea of how does engagement change? How does motivation change? That heavily sort of falls upon trying to get the student voice in there.

So if anything else, you know, that’s the really big takeaway. That’s the framework that we’ve started to work around. And we think that all those things will affect outcomes in some way.

But the other key aspect, which I think we need to consider, is this longitudinal element. And to do that we sort of started to break it down into three phases. So if we were to design the ideal washback study, or washforward or whatever, we’d probably think of potentially having three phases of data collection. So at the first stage, when you’ve got a minimal viable product, something that a teacher can take away and start to consider how they might use it in the class, that’s your anticipation phase. It’s very speculative, it’s very forward looking in terms of, I think this is how it might change what I’m doing, this is how it might go down. But it's also looking backwards at how it compares with other products or other things that the teachers and students have worked over the past, so they’ve got something to base that judgement on. But it’s very, very speculative. So you can get something out of it, it has a formative potential for the developers because they can use that input, you know, to use for their further development.

But then you’ve got the second phase, which is the observed phase. So when it’s actually bedded down, what could you find out? So this is very based on real practice, it involves teachers and students. Whereas the anticipation is really teacher-led because the students probably don’t have much to say about it. But this one, this phase, the observation is very teacher and student-led and, again it will have a both forward and backward-looking dimensions. Still formative, but it’s very in the moment. So it means that, you know, they’ll be reflecting on what’s happening at that time and that’s, you know, that’s time limited in a sense. So, you know, you might not want to stake too much on that because it’s still in development, so it’s the thoughts of the moment.

And then finally you have the evaluation phase which is where you have this like summative moment where you gather everything, you look retrospectively over the process of learning, over the period of time. And that’s probably the most robust aspect that you have in terms of data gathering where you can come to some sort of judgement about where you think you are.

What I’m very clear around is that there’s no time span on this. We haven’t specified anything. We haven’t even specified to say that whether you have three, or you could even have two elements of data gathering. I think you have to have more than one because we need that kind of comparative element of change. But you don’t necessarily need all three. And how far you’d spread that out, it sort of depends on the nature of the development itself. How long have you got? So that’s something that you’d need to negotiate with the developers.

But the two things that really do stand out in this is that if you’re trying to operationalise this you’d need to think about what are the constructs that you’re really considering. Those conversations that, you know, we’re having now with the developers and with Vicky’s work and Sylvia’s work is about trying to get those things out into the open so that we’re very clear, and then we can start to ask questions about those.

And then the second aspect is really about what’s timeline. So working with the developers to say, “What are we working with here? And when can we stage these different interventions in terms of data collection?” So that’s the framework as it stands, in sort of theoretical perspective, and a little bit more about what each one of those dimensions involves.

So, as I said, the constructs, content, is very important, that drives that first aspect, but also the notion of progression. So asking the teachers to reflect on how are students improving, or are they improving at all. So we have that notion of progression and I think that one links very clearly with the notion of outcomes. So those two things really, really heavily correlate.

You might not have a way of assessing these things because they’re so new. You might have to rely on teachers’ views, teachers’ perspectives and experience. And that’s okay. You might have standardised measures that you can take, but that’s something that you’d need to do.

And then the second one, resources, there are really four subdimensions. A lot of this came from people in New South Wales who are working with this in Australia, or they’re working on these thoughts as well. So, alignment, do the resources align with the content and the constructs? So, again, you need those constructs really clearly embedded. Do we have a sense of flexibility built in with the resources so teachers can use them in whatever ways they would like to use them so that they are fluid? And they allow some sort of feedback to students within that, but that’s more in the next one. Engaging, are the resources engaging to students? That means are they age appropriate? Are they suitable for the students that you work with? And then rigour would be about whether you think that the constructs are represented in the right sort of balance in those resources.

Interaction, again there are four subdimensions that are identified with the literature. One, Steve Higgins in Durham talks about the affordances of technology to bring out more collaborative processes of interaction between teachers and students, or students and students. So that’s something worth exploring.

Student centredness is really about does the technology allow teachers to feed back to students? Does it allow sort of both way communication if you like? But certainly can you focus on what students know and what they need to know next? And then you have the notion of learner participation. Can students give back? So can they, you know, can they instigate learning moments with teachers, so it’s not so one-way. So I think what we’re talking about here is opening up the dynamic of learning through technology.

And then finally the idea of learning organisation, how does that change? How that might be influenced. And a lot of that, I think, focuses on the idea of how does it change the types of questions that are potentially used. So in terms of, and Simon’s in the audience here, we talked about this ages ago, but the idea that you can change a learning dynamic by having different types of questions, and who asks the questions and how many and of what quality. And that could be opened up through technology quite happily.

And then finally affect. So the idea of behavioural engagement, so the idea of people putting in more effort, but you can also have engagement measured through the idea of emotional engagement, so enthusiasm from learners. Agency, so, again, initiation from learners. And finally the notion of cognitive engagement, the idea there that you can have strategies as a learner. So learners sort of take control over their learning and can they control that and take it on.

And then more general notions of motivation. And John Marshall Reeves worked on self-determination theory, there’s a lot around the idea of, you know, does this intrinsically motivate at all.

So really what I’ve learned from thinking about the framework in relation to the development is that there are a few things that come forward. So one is the idea that if you want to collect data you have to be flexible. All of those things that we’re trying to gather data on could be done in various ways. So you’ve got to make tools which are capable of maybe using in a survey, if you’ve got masses of students and teachers to work with, could be through interviews, focus groups, whatever. But it needs to be tailor-made to the context of the development and that would be different each time. The probes that I’m looking at using are largely based on pre-existing work. So a lot of this exists elsewhere, we’ve just put a framework around them and then tried to bring those in in a way that makes sense. So you can see sort of examples, they’re types of questions that I started to develop based on other people’s work. And they can be forced responses, they can be open, whatever.

The main burden of the data gathering falls on teachers because they’re involved in every stage, but then students are very importantly involved too, but yeah, teachers are really key to this.

And then I think the future aim is, the ideal would be to be able to create a set of tools which could be used by any developers so that it’s a package and people could take that and tailor-make it. Tailor it to the context that they’re working in. So I think that’s, you know, what we’re trying to get to, but this stage is very dependent on the researchers because we want to try it out and see if it works.

And so finally just a few takeaways. And the first one is the actual idea that outcomes will be influenced by the dimensions of the washback framework, so that’s what we really want to focus on is those four areas, the dimensions and the subdimensions within them.

The second thing is that we’ve got this longitudinal sort of process where you can use repeated questions, and in fact that’s really important. If you change the tense of the questions you’re sort of moving things on and you can learn along the way, so it’s not three times the work in a way, as long as your framework’s robust.

And then finally the longitudinal nature of it is a process of developing confidence, so you start off quite tentative, you get a bit more confident and then by the end you hopefully have something that you think you’ve got a good picture of, but it’s actually got the voices of all the important people in there which helps you to design better. And that’s where I finish.

75.37 – Ed

Thank you. Okay, so I’m going to talk about the implications for assessment design of all that you’ve heard so far. And before I start on my material I’ve been sitting in the audience kind of nodding frantically, it’s been reminding me of how important all of this research has been. And how much it overlaps with some of the other evidence that we produce, and hopefully I can talk to that in a moment.

So, the research that you’ve heard described here has impacted our product designs in kind of three key ways. Firstly it’s provided us with parameters, as you would kind of expect. You know while the process and the methodologies we use are customer-led, to a degree, we’re not in a consumer business. We’re not just producing something that the customer wants, yet, we also have to understand how we need to assess historical research components and how validity works. So we need to be kind of critical in terms of understanding what the purpose of the assessment is as well, we’re not just producing sweets or candy or something like that.

So, yes, the work you’ve heard described has given us those parameters, it’s given us an understanding of what we need to do to meet particular standards. But it’s also helped us to challenge or sometimes confirm our assumptions. So, again, going back to the example of the historical research, we kind of had a hunch that the skills weren’t linear and that they could perhaps be assessed separately, or indeed holistically. And we kind of figured, based on what we were hearing from our customers, that they weren’t all being assessed at the moment, or not necessarily being assessed in an authentic way. But having that confirmation from our research colleagues really helped us to ensure we were on the right track. And the validity by design work gives us, or rather forces us to have clarity on our intended outcomes and prevents us making really rookie errors in those kind of areas as well.

And I was struck, as well, when I was just listening to Martin how the framework he’s describing for washback is really similar to some of the other things that we do in product development, like UX research. You know we anticipate how somebody’s going to use something and then we’re confronted with the reality of how they use something once we get a prototype in front of them. And obviously we’re going to be going forward using the framework that Martin’s described to think about how these products actually do influence classroom practice. We’d like to think that we’re building on good teaching and learning practice, now, but the process Martin’s describing will tell us whether we are or not.

And lastly, and this is an ongoing process as you can imagine, the research is critical to understanding and identifying the key issues we needed to resolve, be that the elements of teacher training necessary to support our new products in order to ensure they have validity. Or focusing us on how best to develop actual accessible assessment criteria. So there’s lots of key issues, we have long lists of them and we haven’t resolved them all yet, but that’s part of the iterative process.

And I wanted to just stop and kind of link back to something that Juliette said earlier which is that our overall approach for these born-digital assessments is not digital for the sake of it. A lot of the current work on digital ends up focusing on the limitations of on-screen assessment platforms. What can’t they replicate from how we do things currently on paper? And we’ve had the luxury of taking the opposite approach, if you like, which is to say, what are the limitations of our current assessment practice? What are the limitations of our traditional assessment methods? What kind of evidence can’t we collect using pen and paper exams or written course work? And how could digital help with that?

So, what does all this mean in practice? So the first example I’m going to talk to, and then Sarah’s going to pick up another one from computer science, is about what we’re proposing in the field of historical research. So our initial engagement with our customers and our wider stakeholders identified issues with the current claims that we make about how well history A Level prepares learners for independent historical research, you heard that reference in some of the material earlier. And with this problem identified we began developing product propositions to try and address this.

So at the moment our product proposal is an A1 or AS Level made up of two components which we’re looking to align with the current offering to allow learners to take this in combination with current A2 components to deliver a full history A Level if they desire, or to be a stand-alone AS Level.

And the first component is conceived of as an on-screen exam using an online source bank to support the assessment of source evaluation. We heard earlier a reference to some practice where that was being used, it’s an idea we’re quite fond of, it creates a level playing field having sources that candidates are familiar with, allows us to be clearer about what we’re assessing and how well they’re demonstrating that. One little kind of sliver of way in which we think digital can help here is that we know now that candidates annotate sources on the exam papers when they do a source evaluation question in history. But those exam papers are picked up and thrown away at the end of the exam and the answer booklet comes back to Cambridge for marking. We think that by having candidates do that annotation in a digital environment, well we know we can capture that annotation but we think there’s real value in being able to use that evidence, being able to capture that evidence and use it. We think we can learn a lot about the processes that candidates are going through by looking at those annotations.

The second component is the one that a lot of the conversations earlier in the presentation talked to, which is our digitally enabled research component. And alongside Sylvia’s research other evidence had indicated that we currently only imply evidence of process from final submissions. We don’t really capture evidence of the process of historical research. And we know that if we’re implying it from a final piece, that can be taught, going back again to washback, we know that people are basically being taught how to look as though they’ve conducted authentic historical research in an essay when that essay is submitted to the exam board.

In the past, and this is not a new issue, it’s a known issue, in the past efforts have been made to supplement the evidence, the final piece, with paper research diaries and the like which have proved both bureaucratic and a huge overhead for teachers, but also equally easily gained by learners in schools. Always reflect back on when I did GCSE art and I did two days of exam conditions in an art room producing my final piece and then the next two days in my art room with my teacher doing my preparatory sketches which were submitted to the exam board as if they had been done in the opposite order. So one of our key insights here was that we could use digital environments to collect this kind of longitudinal evidence automatically. If candidates, if learners are uploading materials reflecting, engaging with their teachers in a digital environment we’re capturing by just basic time stamping what that journey has looked like and evidence of those skills along the way. We know from Sylvia’s work on aspects of historical research that what we want to assess isn’t linear, we want to be able to capture all of that information and assess it holistically.

And so again, just another little sliver of where digital might help is that we think there’s an opportunity for teachers to tag particular interactions or materials that students have shared against the mark scheme, and against different levels in the mark scheme, as they go along so they’re developing a picture. And we can take that evidence from different places on that learning journey rather than it all being taken from the final product. With the potential, therefore, for students to get credit from various different places on their journey in a non-linear way.

Now there’s lots more work to do to develop this concept into a real product. The bulk of next year is going to be spent trialling with schools using prototypes, be those end-to-end prototypes or very specific parts of technology that we’re looking to trial.

We’re going to be learning what works and doesn’t work in practice and what elements are key to this product versus the things that are nice to have, the things that aren’t really adding value. And we’ll be collecting evidence of washback along the way. We’ll be working with Vicky and team to ensure that we continue to collect evidence relating to validity, and making sure that our assessment criteria, the things we’re trying to measure, those aspects of historical research that we want to be able to measure are capable of being measured and can actually distinguish between candidates.

So, that’s a little insight into what one of our potential products might look like, or looks like today, but may iterate away in different directions across time. I’m going to hand over to Sarah who’s going to talk about the computer science proposition.

86.40 – Sarah

Are we back on? Yes. Yeah, thanks Ed. So I just want to add a little something for two minutes before we go into the Q&A in relation to similarly how research has impacted on the other born-digital assessments that are under development and about to go into schools, protypes are about to go into schools to get feedback on that, and that’s in the area of computer science.

So there are currently, we’re looking at these three areas within the assessment, a fundamental concepts digital test, which also assesses computational thinking. So the research that we’ve done, that I mentioned very early on in my massive lists, which was about what is progression in computational thinking has inputted very much into what that assessment looks like and what the assessment criteria in that area looks like. The second aspect is a practical programming project and this is where the work that we did about, you know, what is collaboration in this context, what’s communication. What about resilience? People are talking about, are programmers needing to be resilient and needing to be creative. What does that really mean? And what’s happening currently in those areas? And how’s that being assessed? So those bits of research have been applied in that area. For example, we found in relation to creativity in programming that, yeah, we can describe it, it’s essential to be a good programmer, but once we start assessing it and describing it in that minutiae it will take the creativity out of creativity, so let’s not assess it. So learning things like that along the way.

And also in terms of the programming project, the issue of communication and collaboration. In industry it’s absolutely normal for a coder to take a piece of existing code and adapt it. Yet in a current exam that’s cheating. So in this part of this assessment adapting someone else’s code is absolutely part of the deal, to make that much more authentic to how coders really work in practice. And all of that’s supported by formative assessment and teaching tools.

The assessments that Ed described in historical research and the computer assessments, the prototypes are going out to schools in January for feedback, so anyone out there who’s working in a school who wants to get involved in helping shape what those are like, do get in touch with us. I think that’s pretty much it in terms of content from us, as if that wasn’t enough. We do have lots of links to these blogs that we have been putting out over the last year or so which point to how we’re working and what our products are like. I think Jonathon or Penelope are going to pop those in the chat, wonderful, thank you, so you can access those.

And finally, do get in touch with us if you want to work with us, if you’re a school, or anyone who wants to work with us in terms of shaping these and giving us feedback on the assessments. And this is the kind of webpage that you probably went to to engage with this event, over time that will evolve and that webpage will now be a place where we put outcomes of the research that we’ve done, reports, blogs, all sorts of things that share with you the research that we’re doing. So that will evolve over time so keep an eye on that space.

Right, we have got some time for Q&A, I know that we’ve got questions coming in from the chat, from those of you who are joining us virtually. And I also open up in the room as well for any questions for anyone of us. – We’ve got some coming in from online.

90.33 – Q1

Thanks, yeah, just got a question online. It strikes me that, given the complexity and interdependency of different historical research skills identified in this project, they lend themselves more to some form of non-examination assessment, rather than a conventional timed examination. Is the work you’re undertaking focused purely on high-stakes digital examinations? Or looking more broadly at other forms of digital assessment, such as controlled tasks and course work type activities?

91.02 – Sarah

Very good, I think that’s one for Ed.

91.11 – Ed

Thank you, yeah, that’s a really good question. I mean I think, I’m probably not breaking any confidences by saying that we’re not looking at timed kind of on-screen assessment for that historical research piece, it will be a digital research project. It will be something which will be done over time. It’s probably at least two terms’ worth of work. And we’re going to be absolutely dependent on kind of capturing authentically the teaching and experience, the interaction between the teacher and the learner as well.

We’re probably implying that this is going to be a teacher-marked kind of component as well. But those decisions are still to come, but yeah, absolutely, this is something which is not going to be a three-hour on-screen exam. When we talked to customers they were all telling us that they really value the focus on process and capturing evidence of process, but equally they really value the final product. And particularly, I guess, they were saying in history that’s because they want to see evidence of the historical knowledge and understanding as well, it’s not just about the process, the final piece is important because it’s where you get to bring all of those pieces together and demonstrate that knowledge and understanding, as well as the skills and behaviours that we think are important.

92.47 – Sarah

And I think the same goes for the computer science, as well, when that second component is a kind of project related component, yeah. Simon?

93.04 – Simon Q2

My question was about the, I suppose, the next few years in terms of high-stakes exams in UK system and so on. I remember there was an article a couple of years ago from Saul, our former chief exec about that he didn’t envisage high-stakes digital exams happening, I think it was around early Covid that this interview came out, he didn’t expect it for at least ten years. And there’s some speculation as to where that would be. And I was wondering where you thought that prediction was, in terms of its accuracy, looking at it today. And also if you feel where and when that kind of critical mass of research and validation evidence will give Ofqual or other regulators around the world, perhaps at ministry level and so on, the confidence for them to say, “Okay, yes, we can make the switch with confidence, without the inherent risks to the drastic nature of that change.”

93.56 – Sarah

Yes, thank you for the tricky question I was really looking forward to getting. I think loads of things going, fireworks going off in my head and I won’t keep up with all of them. One of them relates to, you talk about a switch and that’s got a bit of an assumption, I think, that you know, if we are lifting and shifting from paper to screen, and all the countries that have done that so far have done that much quicker than countries that are trying to do something a bit more born-digital, digital first. And I know that we’re talking about an ambition of getting live lifted and shifted assessments out there in the general qualification space from 2025. That seems very ambitious. And I also don’t think we need to think about a shift as in a whole scale shift because I would advocate that there are parts of assessments that are best suited to paper. And some particular skills, areas, mindsets, behaviours that are best suited to digital. So, I don’t think thinking about things as a wholesale shift is really going to happen because I don’t think that’s an appropriate approach.

In, I told you that I’d lose those fireworks that were going off. That was it, the other thing, the relationship with teaching and learning, it’s been said that this is like the magic thing that no one’s quite managed to really get their hands on yet, is really, we can work on the assessments and that will do something to teaching and learning. But lots of kind of criticisms of the current assessment system are related to criticisms of the current education system, which is all about what are we preparing our learners for. And actually a reform of the wider system is necessary in order to really be preparing our learners for the future that they need.

So, you know, in those kind of regulated contexts where we’re looking at general qualifications and school systems, we’re a bit beholden to whatever ministries and regulators are doing in terms of reform. So we are getting ready, now, by developing what we’re developing to influence the regulators and to influence ministries and we are talking with the other exam boards in the UK, for example, about a joint forum at which we can do that. So I think we’re preparing, but generally the barriers to the uptake of digital assessment, the biggest barrier, the literature says, and all our literary reviews point to, is a lack of policy steer. And I think that’s really what we want to influence and what we’re kind of waiting for. Another question from our online colleagues?

96.54 – Q3

Yeah, this one’s from David. How confident can you be with the digital literacy of learners? And what aspects of this will you exploit? What are the potential risks of relying on perceived digital literacy?

97.09 – Sarah

Very good question because I talked about one of the drivers very early on as being we’re able to, you know, embrace now that our learners have these digital literacy skills. But I think that comes with a lot of assumptions. And I think the particular area that we cannot assume learners have, in terms of digital literacy skills, is something we rely on. And we saw this in the examples from the mock service, touch typing and keyboarding. Actually, learners, other research that I’ve seen, not research we’ve done here, points to is learners don’t have those skills to the extent that we can often assume they do. So I think that’s going to be the key area where we need to really understand what we’re expecting of learners, what construct or relevant variants, i.e. unintended demands we’re bringing in by expecting them to touch type. I think that’s going to be a key area we need to focus on. Anything else from my co-presenters? I’m hogging the stage.

98.23 – Q4

Could we see a divergence between what we offer internationally and domestically? So obviously domestically we are dependent, a lot of regulatory approval and governmental policies, but internationally if we can make a digital qualification or part of one that schools want, could it succeed on its own two feet?

98.42 – Sarah

Yeah, very good point. And actually it’s interesting that as an organisation that has both UK facing and internationally facing exam boards in it, what the international board has generally done is reflect what’s happening in the UK. But I think in this context it’s actually the other way round in that we are finding making these developments, designing these assessments and the place to try them out is the international audience because trying to get some accreditation in the UK, particularly through our regulator, is going to require a lot of data, a lot of evidence behind things, and we’re lucky in that we’re able, we have our international audience who have a massive appetite for these digital assessments. And we can go out there get data, get feedback and then that will influence the UK. I think the kind of tables are turned in that context, relative to our usual way we operate.

99.45 – Q5

Yeah so we’ve had a few questions online around the implications of high-stakes digital assessments for equality, equity and inclusion or exclusion, yeah, how are these issues being considered and addressed through the research?

100.04 – Sarah

Very interesting. And, well, number one let me say we’re developing our five accessibility principles, which will support our learners and all of our developers and developments through the process of ensuring that our digital assessments are accessible. But on a kind of more theoretical basis, very early on in our developments we were talking about how, in order to support learners who are less familiar and have less access to technology, for example, we talked about digital always being an option alongside paper. And I think at the time that felt right, but it’s very interesting, I think, to think about what does that do, these two related approaches, and which is the poor cousin to the other and at what point will it flip whereby, you know, those learners who might be left behind because they’re working on paper, are, you know, there’s a detrimental effect. So there’s issues around comparability of those two, and fairness, and the point I was going to make has gone.

Yeah, and that actually what we’re thinking, South Africa have made taking an approach whereby they are not creating parallel versions of their paper and digital assessments. They are choosing to shift, this assessment is shifting from paper to digital. So there’s not these two routes. So that, in many ways, seems fairer than having these two routes where particularly digital poverty will impact. So it’s a really interesting decision to be made about whether the parallel route is fairer, or whether just saying, no this one is digital is fairer. It does rely on the provision of hardware and broadband which I think in countries where that’s worked it’s been where there’s been a ministry or a governmental initiative to provide that for learners. And I think we’re quite a long way from that in the UK. So there’s lots of factors inputting there. I don’t know if anyone wants to add, Ed?

102.37 – Ed

Yeah I mean I don’t have a huge amount to add to what Sarah said, except to say that we, obviously as organisation, are drawn in different directions. We have customers who are telling us they want digital assessment tomorrow, or yesterday in some cases, and we have customers who are telling us they are either not interested in digital assessment, or don’t see that they have the capacity to implement digital assessment for practical and economic and other reasons. And I think we have to, as an organisation, be prepared to tackle the questions about how we serve both sets of customers. And it throws up all sorts of awkward questions about comparability and other things, but we can’t afford, I think, to say that we can’t engage with digital assessment, for example, because those customers will go elsewhere, where people are prepared to engage with it. But equally we can’t be mandating digital assessment because we know there are broad swathes of our customer base who can’t do so.

And one of the things that I really like about the born-digital assessments is that they step away from kind of one of the most oft quoted kind of constraints around digital assessment which is the kind of UK view, if you like, which is they how do I get 300 kids sat in my sports hall all sitting a digital assessment at the same time. And so I think that’s at least one of the drivers why, when we’ve started looking at what form these born-digital assessments might take, we’ve moved away from the idea of timed, one time, kind of activity, final exam kind of model, if you like because digital can offer a, not only because there are constraints around how practical it is to run that at volume, but also because digital can offer us so many other ways of capturing evidence and thinking about how we assess skills, behaviours and knowledge and understanding.

104.52 – Martin

Yeah I think there’s a reflection in washback there as well. So the idea that if you’re asking teachers to reflect on how is this initiative impacting different types of students within your class, that’s a key way of trying to pick up on differentials which actually are at a very formative stage. So actually you can evaluate that and then help to stop it becoming sedimented and implemented on a full scale if you take the washback as a serious contention. So I think that’s a very important contribution that that element of research can have on that issue.

105.25 – Sarah

Right, we’re over time. So I’m going to just thank everyone in the room and everyone who’s logged in to join us today for your input and your attention. Thanks very much, and my co-presents, and say keep in touch and keep an eye on what we’re up to because we’ll be sharing a lot of the evidence that we’re using in this context. So I don’t need to go any further. Stop. Thanks everyone and thanks for a stimulating afternoon.

Developing research-informed digital assessments

25 Nov 2022 (106:09)

Learn how research is influencing the design of digital assessments at Cambridge. The Cambridge Digital High Stakes Assessment Programme is using research to inform assessment design and ensure the quality of new digital assessments.

  • 00:00 - Juliet Wilson - Director of Assessment and Customer Support, Cambridge International
  • 04:26 - Sarah Hughes - Research and Thought Leadership Lead
  • 10:43 - Ed Sutton - Product Manager
  • 13:00 - Sarah Hughes
  • 17:30 - Sylvia Vitello - Senior Research Officer
  • 40:20 - Vicki Crisp - Senior Research Officer
  • 59:00 - Martin Johnson - Senior Researcher
  • 1:30:09 - Q&A

Keynotes and speakers

Our research

Sarah Hughes reported on the evidence base we are using to inform assessment design and evaluate assessment quality.

Validity of digital assessments

Vicki Crisp

We asked - can the Cambridge approach to validation be used to validate our digital assessments? Vicki Crisp described how we have adapted the approach to build validity into digital assessment from the start.

Read more about Vicki

Defining constructs in Historical Research

Sylvia Vitello

Research by Sylvia Vitello and colleagues helped us better understand the constructs in historical research so we can make evidence-based decisions about assessment design. Read Sylvia’s blog about this work.

Read more about Sylvia