I first came across comparative judgement and Chris Wheadon’s No More Marking website about three years ago, when it was very much in its infancy. For some reason, I didn’t recognise its potential; I saw more drawbacks to collaborative assessment than benefits. What I hadn’t properly considered were the significant flaws in existing methods for assessing students’ written work – issues of bias, the illusion of objective evaluation against scoring rubrics, etc. I also didn’t fully appreciate the central premise that underpins comparative judgement: that human beings deal more in relative comparisons than absolute definites.
The significant benefits of using comparative judgement are much more obvious to me now, not just for English, but for other subjects areas too. Whilst it is not without its issues (see below), the more I use comparative judgement, and the accompanying assessment tools on the ever-improving No More Marking site, the more I think it can really help increase the reliability of assessing certain pieces of work, as well make a big difference in reducing teacher workload. There are other potential benefits too, such as opportunities for collaborative professional learning, getting better at understanding what makes a good piece of work, and quickly seeing different strengths and weakness across a cohort.
Most of the examples I have read about of school’s using comparative judgement tend to focus on the assessment of writing – facets of effective composition, such as control, organisation and style. An obvious example is Daisy Christodoulou’s pioneering work with Chris Wheadon, which is extremely useful in showing how to use comparative judgement at scale, as well as demonstrating how it can lead to greater reliability than teacher judgement and more conventional forms of standardisation. Comparative judgement of small pieces of written work is also at the heart of the FFT’s English Proof of Progress test that many schools, including ours, are using to measure the progress of their KS3 students and to cross reference against their own emerging assessment models.
This is all well and good, and I would imagine that even comparative judgement’s staunchest detractors can see that it has something to offer the process of assessing for things like style and technical accuracy. What I think is less well documented, though, is how comparative judgement can support the assessment of other areas of the English curriculum, such as longer pieces of analytical writing. This is because it’s much harder to use comparative judgement in this way. Yet, within my department, and probably for other secondary school departments too, this is what we are interested in right now: learning how comparative judgement might support the process of marking ever-increasing amounts of essays that our students are writing at both GCSE and A Level. Essays that we want to assess and that we want to assess reliably and quickly.
Unlike the assessment of writing, though, where it is possible to quickly read a piece of writing and make an instinctive judgement about its relative quality and accuracy, I think that analytical responses are much more problematic. For a start, judges must be well versed in the text or texts being written about. This is not an insurmountable hurdle, since many teachers in a department teach the same text, and one would hope that most English teachers are au fair enough with texts on a GCSE syllabus to pass judgement on a piece of analysis. That said, knowledge of the text and knowledge of the focus of the analysis – such as the extent to which contextual links play a role – are much more of a factor in collaborative assessment of reading than with writing, which makes it more difficult to enlist more judges and is therefore more time-consuming to make comparative judgements.
Trialling Comparative Judgement
We have now used comparative judgement in the English department on three separate occasions, most recently to assess a year 11 literature mock question on Dr Jekyll and Mr Hyde. Whereas last year we focused on experimenting with the process and getting used to marking in such a different way, this year we have increased our use of comparative judgement with the longer-term aim of making it a key component of our overall assessment portfolio. Rather than blindly replacing the old with the new, however, which is certainly tempting when you think you can see the benefits from the outset, we are mindful that we need to tread carefully.
As a result we have set up a controlled trial to try and get some objective feedback to check against our hunches. The trial essentially consists of splitting our GCSE cohort into two groups. All students will sit 5 literature assessments throughout the course of the year, with one group having their assessments marked using comparative judgement, and the other through the more traditional method of applying a mark scheme followed by a process of moderation. Using a combination of quantitative and qualitative methods, we hope to ascertain the effect, if any, of using comparative judgement on student learning, but also, more importantly, its impact on teacher workload. Admittedly, such evaluation is flawed, but we hope that it will at least make us better informed when we come to make a decision later on about whether to adopt comparative judgement more widely.
Issues and solutions
The impact of poor handwriting on grade scores is not a new phenomenon. I remember when I was a GCSE exam marker: I would much prefer reading legible scripts and curse the ones I had to spend time deciphering. Obviously, I tried to not let students’ poor handwriting get in the way of making my judgments, but the reality was it probably did, even if the only bias was the additional time meant I saw flaws more clearly. When you are marking your own students’ essays – as with the usual way we mark our internal assessments – you get used to those students with tricky handwriting, and learn how to decipher their meaning, perhaps unconsciously giving them the benefit of the doubt because you know what they meant.
It’s even harder to avoid handwriting bias with comparative judgement, particularly when you are encouraged to make quick judgements and you are reading lots of scanned photocopied scripts off a computer screen. Poor handwriting was clearly a factor behind some of the anomalous results from our recent session. Several teachers noted how hard it was to properly read some essays, and a deeper examination of the worst offenders showed that the mean time of judgements on them was much longer than those that were easier on the eye. Most of these essays also scored badly. Conversely, almost all the best essays had the neatest, most legible pen work. Under closer inspection, however, a significant number were clearly in the wrong band.
It would be wrong to suggest that all the anomalies we found after interrogating the ranked order of essays was entirely down to issues of handwriting. There were a number of administrative failures, such as students writing on the wrong part of barcoded paper and some of the scans uploaded back to front, which gave the impression that some students had not written very much at all, or only in fragments. These are technical issues, and can easily be ironed out the more we get to grips with the approach. That is the whole point of taking things slowly and learning from trial and error.
Aside from issues of handwriting and administration a number of other anomalies remained. Some of these apparent errors turned out to be completely right: students that teachers had expected to score highly had not written a good essay, and students who had not really been expected to gain high marks did much better than anticipated. With our usual approach – teachers marking their own classes with some subsequent moderation – I suspect that some of these surprising results would not have been apparent. Other anomalies were just plain wrong, which I would love to illustrate but our uploaded scripts are no longer available on the new No More Marking website. We still haven’t got to the bottom of why a significant number of these scripts were placed in completely the wrong order / bands. Some error is inevitable, of course, but the question is probably more about whether comparative judgement has created these errors, or whether they were always there and that comparative judgement has just brought them to light.
I hope to be able to answer this question as the year goes on.
- Brief teachers on issues of bias with poor handwriting and halo effect of neat work
- Emphasise to students the importance of taking care with their handwriting
- Standardised instructions conditions for all students taking the tests
- Teacher standardisation session using exemplar work from previous session
- Clearer focus on the criteria for judgements
- Previous responses used as anchors in the judging session
- Divide up marking sessions: 1) an initial collaborative judging to iron out issues, identify interesting or salient features of students’ work and to check teacher reliability, etc. 2) Independent judging session/s another time to avoid issues of fatigue and cognitive overload
- Investigate significant anomalies and identify possible factors into judgements
- Use insights into student work to inform subsequent teaching