‘The Best Laid Plans’: 101 Reasons Why Lessons Go Wrong

bestlaidplans

I have been teaching for nearly 14 years, and only recently have I come to accept that are just some things beyond your control.

Entrances and exits

  1. Late from registration
  2. Late from last lesson
  3. Late from PE (standard)
  4. Late from assembly
  5. Late from speaking to another teacher
  6. Doctor’s appointment
  7. Time out card
  8. Toilet pass
  9. PE fixture
  10. Art trip
  11. Intervention

Unwelcome interruption

  1. Fire alarm
  2. Door alarm (sounds like fire alarm)
  3. Car alarm
  4. Tannoy
  5. Someone needs to talk to a student
  6. Someone needs to talk to you
  7. Someone pops in and pops back out but you can’t see who
  8. Someone’s on a learning walk
  9. Two people on a learning walk
  10. Several people on a learning walk
  11. Several people on a learning walk with the head
  12. Several people on a learning walk from another school
  13. Several people on a learning walk from Denmark
  14. Ofsted
  15. Pseudo Ofsted or Ofsted-lite
  16. Student passes wind
  17. You pass wind
  18. Visitor passes wind

Teacher down

  1. Tripping over a wire
  2. Tripping over unusual name – Ha! Ha!
  3. Smacking into a desk
  4. Dropping a book
  5. Dropping a pen
  6. Dropping the clicker

IT failure (high and low tech)

  1. No sound
  2. No visuals
  3. No sound or visuals
  4. An excess of sound
  5. No internet
  6. Internet but no YouTube
  7. YouTube but clip won’t load
  8. YouTube but clip has gone
  9. YouTube, clip there, but blocked
  10. Computer locked from last teacher
  11. Desk locked from last (messy) teacher
  12. Board pen runs out
  13. No board pen
  14. No board rubber
  15. Red or green pens only
  16. No remote

Freudianisms

  1. Accidental double entendre
  2. Accidental rude word
  3. Rude word you never knew was a rude word
  4. Rude word in a text
  5. Rude word in a text you never knew was a rude word
  6. Rude word shouted out
  7. Taboo word in a text you had forgotten
  8. Taboo word in a text you remembered but thought you would discuss

Health and safety

  1. Coffee spillage
  2. Tea spillage
  3. Pen spillage
  4. Someone’s brought something up
  5. Someone’s brought something in
  6. Chair incident
  7. Table incident

Teacher standards

  1. Poor planning
  2. No planning
  3. Over planning
  4. Lost plan
  5. Planning for wrong day
  6. Poor question
  7. Poor example
  8. Poor resource
  9. Poor task
  10. Confusing instruction
  11. Confusing explanation
  12. Generally confusing yourself
  13. You’re tired
  14. You’re hungover
  15. You’re tired and hungover
  16. Students are hungover (sixth form only!)

Pesky kids

  1. No pen
  2. No book
  3. No planner
  4. No homework
  5. Nothing!
  6. Trainers
  7. Earrings
  8. Chewing gum
  9. Fidget spinner (or generational equivalent)

Seasonal

  1. Wet break
  2. Windy break
  3. Too hot
  4. Too cold
  5. Too slippery
  6. Wasps
  7. Bees
  8. Flies
  9. Butterflies
  10. Yes, pigeons!

 

It’s just a bit of fun. None of this has ever happened.

Thanks for reading.

 

 

Advertisements

Show Me: Maximising the Use of Mini Whiteboards in Lessons

Mini whiteboards can be an excellent way to gather information about class ‘understanding’ quickly and efficiently. When used badly, however, they cease to be an effective responsive teaching tool, and they can get in the way of learning and become a distraction. This post draws upon some of Doug Lemov’s ideas in Teach Like a Champion 2.0 (Show Me – technique no. 5), along with my own experiences, to offer some tips on on how to maximise your use of mini whiteboards.

Before the Lesson:

Plan questions in advance

As with most things in life, the better something is planned in advance the more likely it is of being executed successfully later on. In this case, the chances are you will have more success if you map out the questions you are going to ask your students to check understanding in advance. Too often we make the mistake of trying to come up with good questions whilst we teach. Often they are not precise enough to capture the data we need to guide our next steps, or we ask for lengthy responses we cannot possibly see from the front of the class. Well-considered questions avoid this problem and increase our chances of getting the valuable information we need in the moment.

Standardise response format

Format matters. Of all the ideas in Teach Like a Champion, I would say Standardise the Format is one of the most powerful and easiest to implement. I insist that all my students ‘Fill the board’ with their answers so that I can see them clearly when I am scanning the room. It also makes a difference what colour students write in. Blue or black pens have the most chance of being seen and not getting distorted by the play of light from the windows or from the flickering overhead artificial strips.

Standardise show me format

It is not just responses that benefit from being standardised; the format of the reveal does too. I use a simple 3-2-1 ‘show me’, but other instructions can work just as well, as long as they are understood by all and insisted upon in practice. All students should cover their answers once they have written them and raise their boards on the agreed command simultaneously. This approach reduces the likelihood of students being influenced by other people’s responses, which undermines the validity of the check. Wobbling boards the in the air is also unhelpful. And very annoying.

Screenshot 2017-10-21 10.57.14During the Lesson:

Insist on agreed formats

There is no point spending time establishing protocols for recording responses and showing them at the same time, if you don’t enforce them in practice. It is far better to sacrifice a bit of time in the short term getting these basics right, so that in the long term the process becomes so slick you can effortlessly question the whole class and gain immediate feedback on their current understanding.

Scan boards from front of the class

This probably seems so trivial and self-evident it is not even worth mentioning, but you would be surprised how many times I have seen teachers standing to the side or positioned in front of the first row of desks, where they cannot possibly see all the answers. The whole point is that you scan all the boards as quickly as you can and make a decision about whether to move on or to respond.

Approximate class understanding

As far as I’m aware, there is no hard and fast rule as to what percentage of students need to get the right answer for you to feel secure enough to move on. The obvious answer is 100%, but in reality it doesn’t always work out like that. Depending on the teaching point, you can sometimes correct one or two students’ understanding quickly there and then, but at other times you can spend several minutes trying to clarify something only for one individual to still miss the point. I aim for between 80-90%, and then make a beeline for students who got the wrong answer later on in the lesson.

Screenshot 2017-10-21 10.57.33

Mini whiteboards are just one of many tools that can help us respond better to students’ need, but they are largely useless if you don’ think through how to use them and plan accordingly.

Thanks for reading.

 

Why didn’t you tell me? 5 things I wish I had been told sooner

Like many others, there are things I have learned in recent years that it would have been really helpful to have been told about earlier on in my career. Knowing about the relative ineffectiveness of marking stacks of books, the power of retrieval practice and the importance of background knowledge, for instance, would have all helped me be a much better teacher.
But whilst insights like these are crucial to improving learning and managing workload, they are not my focus here. Implementing the principles of retrieval practice, for instance, requires a great deal of strategic thought and collaboration. Instead, I wanted to share a few simple things before the start of the new term that I wish someone had taken me to one side and explained – things I think teachers can take on board relatively easily to improve their teaching.

1. Don’t talk over students whilst they work

Others have written eloquently and in detail about the theoretical reasoning why this is such a bad idea, but in essence it should be pretty obvious to all of us anyway. We can all think of situations where we are trying to concentrate on something and somebody is talking in the background. I hate, for example, the incessant messages given out on trains when you are trying to read. You either ignore the message (and maybe your station) or you get distracted from your book to listen to some tedious automated announcement.

Unless it is critical to the task, once your students are working, just leave them too it. However helpful you might think you are being – clarifying your instructions, giving time warnings, providing further examples, etc. – you are not. You are getting in the way of their learning and being annoying!

2. The whiteboard is your friend: use it!

My handwriting is dreadful. Think a doctor’s scrawl after a twelve-hour shift. Writing on the board was one of the main anxieties I had coming into the profession; Powerpoint seemed ready made for me. And yet, I have come to realise that the whiteboard is in fact the most underused, underrated and most utterly brilliant tool at our disposal. If it were up to me, I would rip out all the ‘interactive’ boards in my school and replace them with good old-fashioned whiteboards. Relying too heavily on prepared slides restricts our ability to respond to learners’ need and runs the risk of turning us into presenters.

Whiteboards allow you to do all of the following and more:

  • record your instructions
  • model and exemplify work
  • track the lesson
  • write down key vocabulary
  • provide prompts for writing
  • provide cues for oral contributions
  • break down tricky concepts in stages
  • sketch little diagrams to explain abstract concepts
  • mock up how you want students to present their work

3. Resist the urge to constantly help 

It is soooooo tempting when you set your class off on a task, to dash from desk to desk to attend to the poor souls who have put their hands up to signal their confusion. I see it all the time: almost immediately a class has been told what to do the teacher scours the room, looking for students to ‘help’. It’s almost as if we need to justify ourselves by crouching down next to a desk with a pen in our hand and a battery of examples at the ready.

And yet most of the time, we are probably not really helping at all. At least not in the long term, where we are inadvertently creating a culture of dependency. If students really do need our help immediately after we have set them a task, then either our instructions were unclear or the task we set was too hard. Both are ultimately undesirable, and both warrant something other than manic fire fighting, such as repeating instructions to the class or modelling examples for all.

4. Don’t try and squeeze things in to the end of a lesson

I really loved Columbo – the scruffy, laconic detective with the dirty mac and the habit of using an apparent aside to check mate the criminal. The ‘just one more thing’ strategy worked for Colombo but it has never worked for me, and I doubt it works for you either. You know the situation: there are still a couple of minutes left in the lesson, and you really want to finish your point, or share one more quick example. You think it will help, but it never really does. No one is listening; minds are elsewhere. Less is always more, and the surest way to create a chaotic ending to your lesson is to try and shoehorn in one final task.

5. Try to avoid saying daft things to motivate

Whilst you may be sceptical of some of the more extravagant claims made about Growth Mindset – I know I am – you’d have to be pretty cynical to entirely dismiss the idea that what we say to students and how we say it can have a significant impact on their self-conception. Praising left, right and centre for even the most modest of responses – or even for just responding – cannot help anyone. Lavish praise sets such a low bar for achievement, and from my experience students know they are being patronised. In a similar vein, spur-of-the moment comments designed to motivate, such as ‘top set students don’t behave like way or ‘A grade students really should know this’ are unhelpful and damaging. Be alert to any coded messages in your motivational aside and reprimands.

I did have a much longer list of titbits to share, but I figured I would heed my own advice and stop here.

Thanks for reading.

Quietly confident (thanks to the new A levels!)

 

Obviously, this is an ironic representation. I much prefer white wine!*

Next week my year 13 class sit their first literature exam – two short analytical essays on Hamlet, and a comparison of A Doll’s House and Christina Rossetti poetry. For the first time in long while – perhaps ever – I have not run any one to one sessions or taught any additional after school revision classes. My students have not written hundreds of essays, or emailed me constantly in my holidays with questions or additional work to mark.

And yet, by Jove, I think they are ready.

Obviously, time will tell, and I am aware of the hubris I am inviting by publicly asserting my confidence in their readiness. It may well be that Kris will underperform, or that Rose will not fulfil her potential. In either eventuality, however, I don’t think I will feel any regret about my teaching or the approach that I have taken. They are all ready; I don’t think there is anything more I could have done!

Things have not always been this way, though, and I have not always felt quite so calm at this time of year. There are probably two reasons why I am feeling sanguine. The first is experience. This is my 13th A2 class and with each passing year, I become a little less caught up in exam season frenzy. I care a great deal about my students, but I care much more about my own children. I do what I can with the time I have available, which has decreased since I have become a dad and get more tired.

The second, arguably more significant reason for my relative confidence is, believe it or not, down to the linear nature of the new examinations, and, in particular, our school’s decision not to bother with any interim AS exams. For maybe the first time in my career – I had two year 11 classes, a year 12 class and a year 13 group in my NQT year! – I have been able to teach the curriculum properly and with fidelity to the principles of how students learn best.

Most years I pick up exam classes and have the (dubious) pleasure of preparing students for exams in only a few months’ time. There are usually stacks of poems to learn and lots of coursework to get through. What I believe about student learning goes out the window, in favour of short-term performance wins. Even with year 12, I am often unable to teach like a research champion because of the reductive nature of unit assessment.

Last year, I wrote of the joy I was experiencing with the greater freedoms afforded by linearity, and this has only continued since. I have been able to properly embed a range of strategies and for once feel like, along with the reduction in the number of texts on the syllabus, there is enough time to properly explore texts, as well as get meaningfully into contextual factors, different theatrical interpretations and theoretical approaches.

Knowledge

Take Hamlet. Under the previous modular system, in one term there would only be enough time to read the text together once as a class, simultaneously trying to get to grips with characters, events and emerging themes, whilst also analysing key passages and relating ideas to contextual details. Talk about cognitive overload.

This time, and with my present year 12 class too, I have been able to read the play multiple times and got to watch several different interpretations. On each sweep, I have been able to focus on particular things: character, plot and basic ideas first time round; close analysis of key scenes the next; wider interpretations and theoretical readings in later readings. We finished the course at Easter, and have been revisiting ever since.

Spacing and Interleaving

As well as being able to return to the texts multiple times, the new linear A level has provided opportunities to space out readings and interleave them with other content. So, for example, after reading Hamlet for plot and character, we were able to study some Rossetti poems and make a start on the coursework. Returning to each set text – with frequent quizzing in between – seems to have strengthened student understanding.

Quizzing

Without the pressure of rushing through lots of content – or worse, missing out swathes – there has been time to build in systematic quizzing. At the start of every lesson I am able to test students on their knowledge and understanding, creating regular retrieval practice as well as opportunities for valuable formative assessment. Crucially, I have had the time to address any misconceptions and explain things again if necessary.

Deliberate Practice

By far the biggest impact the new two-year A Level has had on my teaching is the time it has provided for developing the quality of students’ writing. For quite a while now, I have been delaying getting students to write. Long gone are the days of reading a couple of scenes or a few chapters and then manufacturing an exam-style exam just so students get to do an essay. It’s a written subject, so there must be lots of extended writing, right?

Actually, no. As the experience of the last few years has shown me – particularly with my current cohort – endless essay writing does not maketh the literature student. What it does maketh is a mountain of substandard work for the downtrodden teacher who has to then dutifully mark it, often to little or no avail. Whilst there were in year 12, I hardly set my students any essays, focusing instead on developing their knowledge base and engaging in deliberate practice of specific sentence types, such as thesis statements.

Only in the last few months have my class been writing whole essays. What has struck me is how quickly their essays have developed. Usually, it would be quite a while before I would see an uplift in style, argument and depth of analysis, but this year, my students have made much more progress much more quickly. I genuinely think that knowing more about the texts has increased their confidence and allowed them to articulate themselves more coherently. The depth of their arguments is noticeable.

Final word

I don’t want to overplay things. I am certainly not suggesting my students will get extraordinary results because of anything extraordinary that I have done. Some will do very well; some will do as expected; others may end up disappointed. ‘Twas ever thus.

What I think, and hope, is different this time, is that my students will have got their results without having to complete endless mock examinations, come back every week after school for weeks on end, or knock out an unrealistic amount of essays. I also think that a lot more of what they have learnt will last beyond the exam, which I am not sure I can say, hand on heart, has always been the case.

More than anything, though, the changes to specification and linearity have meant that I have been able to teach in such a way that is efficient and sustainable, for my students and for me. Much of their success will come down to how well they have applied themselves and, of course, to how well things go on the day itself. This things are largely beyond my control, and whilst I will naturally be disappointed for any that underachieve, I will not have any regrets about how well I have prepared them.

I have done my best for other people’s children, without having had to sacrifice valuable time with my own.

This is what teaching should be like for all teachers, whether parents or not.

 

* image taken from: http://www.altonivel.com.mx/42105-13-personajes-que-no-debes-contratar/

 

 

Principles of Great Assessment #3: Reliability

Screenshot 2017-05-03 18.37.45.pngThis is the third and final of my three part series on the principles of great assessment. In the first post I focused on the principles of assessment design, and in the second on principles relating to issues of fairness and equality. This final post attempts to get to grips with principles relating to issues of reliability and making assessments provide useful information about student attainment. I have been putting off this post because whilst I recognise how important reliability is in assessment, I know how hard it is to get to grips with, let alone explain to others. I have tried to do my best to synthesise the words and ideas of others. I hope it helps lead to the better use of assessment in schools.

Here are my principles of great assessment 11-16

11. Define standards through questions set

The choice of the questions set in an assessment are important as they ultimately define the standard of expectation, even in cases where the prose descriptors appear secure. Where there is variation between the rigour of the questions set by teachers, problems occur and inaccurate inferences are likely to be drawn. The following example from Dylan Wiliam, albeit extreme, illustrates this relationship between questions and standards.

Task: add punctuation to the following sentence to make it grammatically correct

John where Paul had had had had had had had had had had had a clearer meaning.

This question could feasibly be set to assess students’ understanding of grammar, in particular their knowledge of how commas and apostrophes are used to clarify meaning, which on the surface seems a relatively tight and definitive statement. Obviously, no right-minded teacher would ever set such an absurdly difficult example, which most of us, including English teachers, would struggle to answer correctly*. But what it highlights is the problems that can arise when teachers deploy their own understanding of the required standards independently.

A teacher setting the above question would clearly have sky-high expectations of their students’ grammatical understanding, or supreme confidence in their own teaching! More realistically, a question assessing for students’ grammatical ability would look more like the example below, which requires a far lower grammatical understanding.

Task: add punctuation to the following sentence to make it grammatically correct

John went to the beach with his towel his bucket his swimming trunks and his spade.

All this is yet more reason why summative assessments should be standardised. It simply cannot be that the questions some students face demand significantly greater knowledge and understanding than others who have been taught the same curriculum. The questions used in tests of this nature should be agreed upfront and aligned with the curriculum to remain stable each year. This is, of course, in practice really difficult: teachers may start teaching to the test, and thus invalidate the inferences from the assessment, or the question set one year is not of the same standard as the ones previously, thus making year on year comparisons difficult.

12. Define standards through exemplar pupil work

Screenshot 2017-05-02 19.49.53As well as defining standards through questions, standards can also be defined through student work. Using examples of work to exemplify standards is far better than defining those same expectations through the abstraction of rubrics. As we have seen, not only do rubrics tend to create artificial distinctions between levels of performance, but the descriptions of these performances are more often than not meaningless in isolation. One person’s notion of detailed and developed analysis, can easily be another’s highly sophisticated and insightful evaluation. As Hamlet says of Polonius’ speech, they are just ‘words, words, words’. They only mean something when they are applied to examples.

Whether we like it or not, we all carry mental models of what constitutes excellence in our subject. A history teacher knows when she sees a great piece of historical enquiry; she doesn’t need a set of performance descriptors to tell her it demonstrates sound understanding of the important causes and effects explained in a coherent way. She knows excellence because she has seen it before and it looked similar. Perversely, performance descriptors could actually lead her to lower the mark she awards, particularly if it is too formulaic and reductive, which seems to be the problem with KS2 mark schemes: the work includes all the prescribed functional elements, but the overall piece is not fluent, engaging or ambitious.

Likewise, the same history teacher knows when something has fallen short of what is required because it is not as good as the examples she has seen before that did, the ones that shape the mental model she carries of what is good. On their own rubrics really don’t tell us much, and though we may think they are objective, in reality we are still drawing upon our mental models whenever we make judgements. Even when the performance descriptors appear specific, they are never as specific as an actual question being asked, which ultimately always defines the standard.

If objective judgement using rubrics is a mirage, we are better off spending our time developing mental models of what constitutes the good, the bad and the ugly in terms of exemplar work rather than our misunderstanding abstract prose descriptors. We should also look to shift emphasis towards the kinds of assessment formats that acknowledge the nature of human judgement, namely that all judgements are comparisons of one thing with another (Laming, 2004). In short, we should probably include comparative judgement in our assessment portfolio to draw reliable judgements about student achievement and make the intangible tangible.

13.  Share understanding of different standards of achievement

Standardisation has been a staple of subject meetings for years. In the days of National Curriculum Levels and the National Literacy Strategy English teachers would pore over numerous examples of levelled reading and writing responses. At GCSE and a Level in other subjects, I am sure many department meetings have been given over to discussing relative standards of bits of student work. From my experience, often these meetings are a complete waste of time. Not only do teachers rarely agree on why one piece of writing with poor syntax and grammar should gain a level 5, but we rarely alter our marking after the event anyway. Those that are generous remain generous, and those that are stingier continue to hold back from assigning the higher marks.

The main problem with these kinds of meeting is their reliance on rubrics and performance descriptors, which as we have seen fail to pin down a common understanding of achievement. The other problem is that they fail to acknowledge the fundamental nature of human judgement, namely that we are relativist rather than absolutist in our evaluation. Since we are probably never going to fully agree on standards of achievement, such as the quality of one essay over another, we are probably better off looking at lots of different examples of quality and comparing their relative strengths and weaknesses directly rather than diluting the process by recourse to nebulous mark schemes.

Out of these kinds of standardisation meetings, with teachers judging a cohort’s work together, can come authentic forms of exemplified student achievement – ones that have been formed by a collective comparative voice, rather than by a well-intentioned individual attempting to reduce the irreducible to a series of simplistic statements. Software like No More Marking is increasingly streamlining the whole process, and the nature of the approach itself lends itself much better to year on year standards being maintained with more accuracy. Comparative judgement is not fully formed just yet, but as today’s report into the recent KS2 trial, there is considerable promise for the future

14.  Analyse effectiveness of assessment items

As we have established, a good assessment should distinguish between different levels of attainment across the construct continuum. This means that we would expect a marks for difficulty assessment to include questions that most students could answer, and others that only those with the deepest understanding could respond to correctly. Obviously, there will always be idiosyncrasies. Some weaker students sometimes know the answer to more challenging questions, and likewise some stronger students do not always know the answer to the simpler questions. This is the nature of assessing from a wide domain.

What we should be concerned about in terms of making our assessments as valid and reliable as possible, however, is whether, in the main, the items on the test truly discriminate across the construct continuum. A good assessment should contain harder questions that discriminate students with stronger knowledge and understanding. If that is not the case then something probably needs to change, either in the wording of the items or in realigning teacher understanding of what constitutes item difficulty.

How to calculate the difficulty of assessment items:

Step one: rank items in order of perceived difficulty (as best you can!)

Step two: work out the average mark per item by dividing the total marks awarded for each item by the number of students.

Step three: for items worth more than 1 mark, divide the average score per item by the number of marks available for it.

Step four: all item scores should now have a metric of between 0 and 1. High values indicate the item is relatively accessible whilst low values indicate the item is more difficult.

This is the formula in Excel to identify the average score of an individual item:

=SUM(B3:B8)/(COUNT(B3:B8)*B9)

screenshot-2017-05-02-19-51-161.pngOn an assessment with a large cohort of students we would expect to see a general trend of average scores going down as item difficulty increases i.e. a lower percentage of students are answering them correctly. Whilst it would be normal to expect some anomalies – after all, ranking items on perceived difficulty is not an exact science and is ultimately relative to what students know – any significant variations would probably be worth a closer look.

How to calculate item discrimination

There are different ways of measuring the extent to which an item distinguishes between more and less able students. Perhaps the easiest of these uses the discrimination index.

Step One: Select two groups of students from your assessment results – one with higher test scores and one with lower test scores. This can either be a split right down the middle, or sample at both extremes, so one group in the top third of total results, and one group in the bottom third.

Step Two: Divide the total of the sum of the range of the chosen high test score group minus the chosen low test score group by the number of students in the high score group multiplied by the marks available for the question

This is the formula to use in Excel:

=(SUM(B5:B7)-SUM(B8:B10))/(COUNT(B5:B7)*B11)

screenshot-2017-05-02-19-51-231.pngThe discrimination index is essentially the percentage of students in the high test score group who answer the item correctly minus the percentage of the students in the low test score who do not. It operates on a range between -1 and +1 with values close to +1 indicating the item does discriminate well between high and low ability students for the construct being assessed.

Values near zero suggest that the item does not discriminate between high and low ability students, whilst values near -1 suggest that the item is quite often answered correctly by students who do the worst on the assessment as a whole and conversely incorrectly by those who score the best results on the overall assessment. These are therefore probably not great items.

15.  Increase assessment reliability (but not at the expense of validity)

Screenshot 2017-05-03 18.45.36

Reliability in assessment is about consistency of measurement over time, place and context. The analogy often used is to a pair of weighing scales. When someone steps on a pair of scales, whether in the bathroom or the kitchen, they expect the measurement of their weight to be consistent from one reading to the next, particularly if their diet is constant. This is the same as reliability in assessment: the extent to which a test produces consistent outcomes each time it is sat. In the same way you wouldn’t want your scales to add or take away a few pounds every time you weigh in, you wouldn’t want a test to produce wildly different results every time you sat it, especially if nothing had changed in your weight or your intelligence.

The problem is that in assessment it is impossible to create a completely reliable assessment, particularly if we want to assess things that we value, like quality of extended written responses which we have already discussed can be very subjective, and we don’t want our students to sit hundreds of hour’s worth of tests. We can increase reliability but it often comes at a price, such as in terms of validity (assessing the things that we believe represent the construct), or in time, which is finite and can be used for others things, like teaching.

What is reliability?

Screenshot 2017-05-03 18.33.39There are two mays of looking at the reliability of an assessment – the reliability of the test itself, or the reliability of the judgements being made by the judges. Reliability can be calculated by comparing two sets of scores for a single assessment (such as rater scores with comparative judgement) or with two scores from two tests that assess the same construct. Once we get these two sets of scores, it is possible to work out how similar the results are by using a statistical term called the reliability coefficient.

The reliability coefficient is the numerical index used to talk about reliability. It ranges from 0 to 1. A number closer to 1 indicates a high degree of reliability, whereas a low number suggests some error in the assessment design, or more likely one of the factors identified from the Ofqual list below. Reliability is generally considered good or acceptable if the reliability coefficient is in or around .80, though as Rob Coe points out (see below), even national examinations, with all their statistical know how and manpower, only get as high as 0.93! And that was just the one GCSE subject.

How to identify the reliability of an assessment?

There are four main ways to identify the reliability of an assessment, each with their own advantages and disadvantages and each requiring different levels of confidence with statistics and spreadsheets. The four main methods uses are:

  • Test–retest reliability
  • Parallel forms reliability
  • Split-half reliability
  • Internal-consistency (Cronbach’s alpha)

Test-retest reliability

Screenshot 2017-05-02 19.50.26This approach involves setting the same assessment with the same students at different points in time, such as at the beginning and end of a term. The correlation between the results that each student gets on each sitting of this same test should provide a reliability coefficient. There are two significant problems with this approach, however. Firstly, there is the problem of sensitivity of instruction. It is likely that students would have learnt something between the first and second administrations of the test, which might invalidate the inferences that can be drawn and threaten any attempt to work out a reliability score.

The other, arguably more, significant issue relates to levels of student motivation. I am guessing that most students would not really welcome sitting the same test on two separate occasions, particularly if the second assessment is soon after the first, which would need to happen in order to reduce threats to validity and reliability. Any changes to how students approach the second assessment will considerably affect the reliability score and probably make the exercise a complete waste of time.

Parallel forms reliability

Screenshot 2017-05-02 19.50.34One way round these problems is to design a parallel forms assessment. This is basically where one assessment is made up of two equal parts (parallel A and parallel B), with the second half (parallel B) performing the function of the second assessment in the test-retest approach outlined above. As with test-retest, correlations between student results from the parallel A and parallel B parts of the test can provide a reliability figure. The problem now is that, in reality, it is difficult to create two sections of an assessment of equal challenge. As we have considered, challenge lies in the choice of a question, and even the very best assessment designers don’t really know how difficult an item really is until real students have actually tried answering them.

Split-half reliability

Screenshot 2017-05-02 19.50.41Perhaps the best way to work out the reliability of a class assessment, and the one favoured by Dylan Wiliam, is the split-half reliability model. Rather than waste time attempting the almost impossible – and create two forms of the same assessment of equal difficulty – this approach skirts round the problem, by dividing a single assessment in half and treating each half as a separate test.

There are different ways the assessment can be divided in half, such as straight split down the middle or creating two parts by separating out the odd and even numbered items. Whatever method is used, the reliability coefficient is worked out the same way: by correlating the scores on the two parts and then taking account of the fact that this only relates to half the test by applying the Spearman-Brown formula**. This then provides a reasonable estimate of the reliability of an assessment, which is probably good enough for school-based assessment.

The formula for applying Spearman-Brown in Excel is a little beyond the scope of my understanding. Fortunately, there are a lot of tools available on the Internet that make it possible to work out reliability scores using Spearman-Brown’s formula. The process involves downloading a spreadsheet and then inputting your test scores into cells containing pre-programmed formulas. The best of these is, unsurprisingly, from Dylan Wiliam himself, which is available to download here. Rather handily, Dylan also includes some super clear instructions on how to use the tool. Whilst there are other spreadsheets available elsewhere that perform this and other functions, they are not as clean and intuitive as this one.

Internal-consistency reliability (Cronbach’s alpha)

Screenshot 2017-05-03 18.35.26

At this point, I should point that I am fast approaching the limits of my understanding in relation to assessment, particularly with regards to the use of statistics. Nevertheless, I think I have managed to get my head around internal-consistency reliability enough to use some of the tools available to work out the reliability of an assessment using Cronbach’s alpha. In statistics Cronbach’s alpha is used as an estimate of the reliability of a psychometric test. It provides an estimate of internal consistency reliability and helps to show whether or not all the items in an assessment are assessing the same construct or not. Unlike the easier to use – and understand – split-half reliability, Cronbach’s alpha looks at the average value of all possible split- half estimates, rather than just the one that has been split in half.

It uses this formula:

Screenshot 2017-05-03 18.36.04

If like most people, however, you find this formula intimidating and unfathomable, seek out one of the many online spreadsheets set up with Cronbach’s alpha and ready for you to enter your own assessment data into the cells. Probably the most straightforward of these can be found here. It is produced by Professor Glenn Fulcher and it allows you to enter assessment results for any items with a mark of up to 7. There are instructions that tell you what to do and are quite easy for the layman to follow.

Make sure everyone understands the limitations of assessment

Given that no school assessment which measures the things we value or involves any element of human judgement is ever likely to be completely reliable, the time has probably come to be more honest about this with the people most impacted by summative tests, namely the students and their parents. The problem is that in reality this is incredibly hard to do. As Rob Coe jokes, can anyone imagine a teacher telling a parent that their child’s progress, say an old NC level 5, is accurate to a degree of plus or minus one level? Most teachers probably haven’t even heard about standard measurement of error, let alone understand its impact on assessment practice enough to explain it to a bewildered parent.

The US education system seems rather more advanced than ours in relation to reporting issues of error and uncertainty in assessment to parents. This is a consequence of the Standards for Educational and Psychological Testing (1999). These lay out the extent to which measurement uncertainty must be reported to stakeholders, which US courts follow in their rulings and test administrators account for in their supplementary technical guides.

A 2010 report commissioned by Ofqual into the way assessment agencies in the US report uncertainty information when making public the results of their assessments showed an impressive degree of transparency in relation to sharing issues of test score reliability. Whilst the report notes that parents are not always directly given the information about assessment error and uncertainty, the information is always readable available to those who want it, providing of course they can understand it!

‘Whether in numbers, graphics, or words, and whether on score reports, in interpretive guidelines (sometimes, the concept is explained in an “interpretive guide for parents”), or in technical manuals, the concept of score imprecision is communicated. For tests with items scored subjectively, such as written answers, it is common, too, to report some measure of inter-rater reliability in a technical manual.’

To my knowledge we don’t really have anything like this level of transparency in our system, but I think there are a number of things we can probably learn from the US about how to be smarter with sharing with students and parents the complexity of assessment and the inferences that it can and cannot provide us with. I am not suggesting that the example below is realistic for an individual school to replicate, but I like the way that it at least signals the scope for grade variation by including confidence intervals in each of its assessment scores.

Screenshot 2017-05-03 18.49.39

There is clearly much we need to do to educate ourselves about assessment, and then we may be better placed to educate those who are most affected by the tests that we set.

The work starts now.

*  The answer to the questions is: John, where Paul had had ‘had’, had had ‘had had’. ‘Had had’ had had a clearer meaning

** The Spearman–Brown prediction formula, also known as the Spearman–Brown prophecy formula, is a formula relating psychometric reliability to test length and used by psychometricians to predict the reliability of a test after changing the test length.