About Phil Stock

Assistant Headteacher, Professional Development and Language. Interested in education, spending time with my family and British Military Fitness - all views are my own. @joeybagstock

On Poetry II – Poetry and the poetic

bob-dylan-not-a-poet-casey-jamesLast year Bob Dylan won the Nobel Prize for Literature for ‘creating new poetic expressions within the great American song tradition.’ Whilst I can appreciate the splendor and immediacy of his lyrics, and the gruff poetic beauty of his rolling voice, I don’ think he is a poet and or that his songs should be considered poetry, at least not in terms of poetry written for the page and for private contemplation.

This probably sounds a bit dismissive of Dylan’s craft and shows a lack of respect and appreciation for all he has done for music over the past few decades. I can already hear the knives sharpening from those who believe that Dylan is a poet, which would only intensify were I to question the credentials of artists like Morrissey, Nick Cave or Jarvis Cocker who are also commonly referred to as poets.

The art of these writers is without question; their contribution to culture undeniable. To say the likes of Bob Dylan are not poets is not, though, to denigrate their achievements or to call into question their artistry, but to recognise the difference between song lyrics and poetry. Many of their lyrics are clearly poetic, but they are not really poetry.

Poet Glenn Maxwell has a simple exercise to make it clear how poetry is fundamentally different to song lyrics. It involves writing out the lyrics of your most cherished song and then reading them bare – just the words on the page. In every instance the effect is striking. As Maxwell observes, ‘if you strip the music off it it dies in the whiteness, can’t breathe there. Without the music there is nothing to mark time, to act for time.’ Great songs need music; great poems do not – they generate their own.

You that build all the bombs

You that hide behind walls

You that hide behind desks

I just want you to know

I can see through your masks

‘Masters Of War’ – Bob Dylan (1963)


Yes, I wish that for just one time

You could stand inside my shoes

You’d know what a drag it is to see you

‘Positively 4th Street’ – Bob Dylan (1965)

This matters to how we approach the teaching of poetry. I’ve often seen students led into poetry through the medium of song. The implication is that poetry cannot be enjoyed on its own terms, only by being brought into the orbit of something more familiar. Nothing wrong with making things relevant, I hear you cry. Well, yes, sometimes. The problem here is the message it sends out about the status of poetry – that it’s just like songs – and the misconceptions about meter it creates further down the line.

Chief among these approaches is the use of rap music. Many a lesson I’ve witnessed with Eminem or Dre used to inspire students to study poetry. Aside from the perennial danger of trying to be down with the kids – it never works – there is the danger of misleading students about the nature of poetry, and setting up problems when we want to turn to the technical nitty-gritty of rhythm and rhyme.

If song lyrics get lost in the white wilderness , then rap lyrics disappear altogether – all the energy, anger and delight of the rhythm and rhyme vanishes. Without the beat, there is nothing. The lyrics look daft; they are not strong enough to withstand the encroaching whiteness. Song lyrics, however slight, need some accompaniment, whether a guitar, a beat, or even the voice itself recast as instrument. Poems generate their own music, but songs need rhythms from elsewhere and the presence of the performer.

Look, if you had, one shot, or one opportunity

To seize everything you ever wanted. In one moment

Would you capture it, or just let it slip?


‘Lose Yourself’ – Eminem (2002

Robert Frost understood this difference between poetry and the poetic. In 1913 when he first met Edward Thomas in Harold Monro’s Poetry bookshop,  he knew he’d come across a genuine poet, even though Thomas had yet to write any verse. Thomas read and wrote prodigiously. By the time the pair met, he had already published some two dozen books and written almost 2,000 commissioned pieces, including a great deal of nature writing.

The next day was the missel-thrush’s and the north-west wind’s. The missel-thrush sat well up in a beech at the wood edge and hailed the rain with his rolling, brief song: so rapidly and oft was it repeated that it was almost one long, continuous song. But as the wind snatched away the notes again and again, or the bird changed his perch, or another answered him or took his place, the music was roving like a hunter’s.

 from In Pursuit of Spring – Edward Thomas (1914)

Thomas wrote poetically, but he didn’t write poems. In his nature writing, Frost saw the potential for Thomas to turn his poetic prose cadences into the music of poetry. He badgered Thomas to take his eye and ear for nature and turn it into verse. The poems came thick and fast, with some 70 or so written in the first six months of 1914. Often Thomas returned to the notebooks he kept from his long walks in the Gloucestershire countryside, or to the published prose pieces that they begat. The result was something fundamentally different. Poetry.

What did the thrushes know? Rain, snow, sleet, hail,

Had kept them quiet as the primroses.

They had but an hour to sing. On boughs they sang,

On gates, on ground; they sang while they changed perches

And while they fought, if they remembered to fight:

So earnest were they to pack into that hour

Their unwilling hoard of song before the moon

Grew brighter than the clouds.

From ‘March’ – Edward Thomas

The poems that Edward Thomas produced before his tragic death from a shell blast on the first day of the battle at Arras are, in my opinion, both beautiful and brilliant. That is not to say that they are necessarily any more beautiful or brilliant than anything penned by Eminem or Dylan, but simply to recognise that they are different in what they achieve and how they go about achieving it. If we fail to acknowledge this distinction and rely instead on seguing from song lyrics to poetry, we are effectively undermining the orientations of both forms.

So let’s not try to call everything that is poetic poetry, or we end up diminishing the rich tapestry of aesthetic expression, whilst devaluing the skill of the poet – the skill to move through word and sound with nothing more than inky black marks on white open space.

Much is poetic; precious little is poetry.

On Poetry I: What is this thing we call a poem?

It was a typical day at university for Professor Stanley Fish. He had just finished teaching his linguistics class. Some of the names of linguists he had discussing with his students were still on the board when his next class started to arrive for their literature seminar. Fish decided to make one small change between classes. He drew a box round the assignment details and wrote p43 at the top.
The list now looked something like this:

Screenshot 2017-05-21 08.15.25.png

Fish’s next move was simple but significant. He told his literature students that there was a religious poem on the board, similar to the ones they had been studying the past few weeks, and he then invited them to interpret its meaning. The students duly obliged and it wasn’t long before they were offering all kinds of interpretations, from initial readings of the poem as a hieroglyph to highly convincing interpretations of the symbolism of the Hebrew names Jacob, Rosenbaum, and Levin.

What Stanley Fish had stumbled on, and what he found on every occasion he repeated the trick, was the reality of how readers tackle the act of interpretation. His little teaching sleight of hand had revealed that readers do not approach literary works as isolated individuals but rather as part of a community of readers. As he writes in Is There a Text in the Class?, ‘it is interpretive communities, rather than either the text or reader, that produce meanings.’

In essence, Fish’s literature students did what literature students do in a classroom situation: they interpret the text put in front of them by looking for allusions and patterns of meaning, regardless of whether they are even there. The more the students interpreted specific parts of the poem, the more they convinced themselves that they had built a coherent sense of its overall meaning. The only problem, of course, was that it was all nonsense. There was no poem and therefore no meaning!

At no point did any of Fish’s students question the validity of the text itself, or whether what they were interpreting was even a poem. Because they were working in the context of a literature class, in the presence of a professor of literature and confronted with what looked like a poem, they assumed it was a poem and without thinking they adopted the rules for interpreting obscure religious verse they had learned – rules they had clearly internalised from years of making inferences about literary texts.

Now, we could lament the way that a bunch of hitherto bright students could be so uncritical in their approach to reading. We could even despair at how cultural relativism has reached such a nadir that a simple list of linguists could be mistaken for a profound religious poem. I think, however, this misses the point. As Fish notes, this is ultimately how we approach reading all texts, literary or not – as a community. Even to interpret a list of linguists as a list requires a shared understanding of the concepts of seriality, hierarchy and subordination. This is the nature of interpreting meaning from text.

I think there are some lessons to draw from Fish’s work in relation to teaching and, more specifically, to curriculum design. The first is to recognise the responsibility we have in selecting the texts we teach. We should make sure that what students will be interpreting has substance, both in terms of its intrinsic value and its utility. Mark Roberts has written about the failure of poems like ‘Tissue’ to do either of these things well. I’ve never taught ‘Tissue’, but as long as I can remember there has always been quite a bit of guff like that in the GCSE anthology, most of it sadly of the contemporary variety.

Don’t get me wrong, I am not against modern verse per se, and I am certainly not suggesting we should avoid all forms of contemporary literature. That said, I don’t think GCSE students should be wasting their time interpreting poems like ‘Tissue’. The funny thing is that most of the students I have taught seem to share a similar view. I always think classes will respond much better to poems like ‘Brendon Gallacher’, ‘Blessing’ and ‘Kid’ but actually when they write about ‘My Last Duchess’ or a Shakespeare sonnet they have much more to say and they say it with much greater conviction.

The second important lesson we can we learn from Stanley Fish’s work on interpretative communities relates to the order in which we teach students the poems that we select. I’m guessing that one of the main reasons that Fish’s students so readily interpreted a list of linguists as a religious poem was because they were used to seeing poems that looked like that, namely without a clear form or discernible structure – they understood the free verse style that characterises much of the poetry of the last century, and which has dominated the contents of many an anthology since.

Whilst Fish’s students may have mistakenly treated his list of names as a poem, they would have probably have understood why a poem that doesn’t rhyme or contain any clear poetic structure could be considered a poem. They would be familiar with poets who broke with formal conventions, like e.e. cummings, Sylvia Plath and William Carlos Williams, and learnt the reasons for these literary developments. In short, they would have in mind some kind of literary chronology, which is perhaps something that we should bear in mind when we are designing the spread of a five-year curriculum.

Perhaps most importantly, I think Fish’s example highlights a need for us to consider how we approach teaching poetry, particularly in a clear and systematic way that builds upon the work of KS2 teachers. I wonder if one of the reasons why Fish’s students were misled by a mere list, is that they had never really been encouraged to take a step back whenever they approach a new text – to appreciate its overall beauty; to consider it at a conceptual or formal level before diving straight in to try and account for it and locate its meaning. Maybe whenever they were ever presented with a poem at school, they were immediately asked to interpret or provide some kind of emotional response.

This is all well and good, and I do this kind of thing regularly. This year, however, I have been teaching a year 7 class for the first time in ages, which has given me the opportunity to begin to think through how I might teach things like poetry a little differently, by which I mean to teach students a conceptual appreciation of poetry as well as an emotional and technical understanding. I want them to be able to infer meaning, but also to comment on different forms and how these might be linked to developments in artistic expression and philosophy. A more holistic approach to understanding.

This is obviously hard. It is so tempting to introduce a poem and start to elicit ideas about its meaning, but this might be putting the cart before the horse, particularly with poems where the structural and/or formal features are absolutely central to understanding what the poem is trying to achieve. I wonder that whilst many of us are reviewing our KS3 assessments, we should recognise that here we have a unique opportunity to influence the workings of literary interpretation from within that interpretative community. There are enough of us and we have sufficient time to significantly improve they way we teach our students to read and approach poetry, or indeed any text for that matter.

Who knows, if we got things right from the off, by the time they were in year 11, our students might even be able to understand the difference between a metaphor and a simile.



very much



Principles of Great Assessment #3: Reliability

Screenshot 2017-05-03 18.37.45.pngThis is the third and final of my three part series on the principles of great assessment. In the first post I focused on the principles of assessment design, and in the second on principles relating to issues of fairness and equality. This final post attempts to get to grips with principles relating to issues of reliability and making assessments provide useful information about student attainment. I have been putting off this post because whilst I recognise how important reliability is in assessment, I know how hard it is to get to grips with, let alone explain to others. I have tried to do my best to synthesise the words and ideas of others. I hope it helps lead to the better use of assessment in schools.

Here are my principles of great assessment 11-16

11. Define standards through questions set

The choice of the questions set in an assessment are important as they ultimately define the standard of expectation, even in cases where the prose descriptors appear secure. Where there is variation between the rigour of the questions set by teachers, problems occur and inaccurate inferences are likely to be drawn. The following example from Dylan Wiliam, albeit extreme, illustrates this relationship between questions and standards.

Task: add punctuation to the following sentence to make it grammatically correct

John where Paul had had had had had had had had had had had a clearer meaning.

This question could feasibly be set to assess students’ understanding of grammar, in particular their knowledge of how commas and apostrophes are used to clarify meaning, which on the surface seems a relatively tight and definitive statement. Obviously, no right-minded teacher would ever set such an absurdly difficult example, which most of us, including English teachers, would struggle to answer correctly*. But what it highlights is the problems that can arise when teachers deploy their own understanding of the required standards independently.

A teacher setting the above question would clearly have sky-high expectations of their students’ grammatical understanding, or supreme confidence in their own teaching! More realistically, a question assessing for students’ grammatical ability would look more like the example below, which requires a far lower grammatical understanding.

Task: add punctuation to the following sentence to make it grammatically correct

John went to the beach with his towel his bucket his swimming trunks and his spade.

All this is yet more reason why summative assessments should be standardised. It simply cannot be that the questions some students face demand significantly greater knowledge and understanding than others who have been taught the same curriculum. The questions used in tests of this nature should be agreed upfront and aligned with the curriculum to remain stable each year. This is, of course, in practice really difficult: teachers may start teaching to the test, and thus invalidate the inferences from the assessment, or the question set one year is not of the same standard as the ones previously, thus making year on year comparisons difficult.

12. Define standards through exemplar pupil work

Screenshot 2017-05-02 19.49.53As well as defining standards through questions, standards can also be defined through student work. Using examples of work to exemplify standards is far better than defining those same expectations through the abstraction of rubrics. As we have seen, not only do rubrics tend to create artificial distinctions between levels of performance, but the descriptions of these performances are more often than not meaningless in isolation. One person’s notion of detailed and developed analysis, can easily be another’s highly sophisticated and insightful evaluation. As Hamlet says of Polonius’ speech, they are just ‘words, words, words’. They only mean something when they are applied to examples.

Whether we like it or not, we all carry mental models of what constitutes excellence in our subject. A history teacher knows when she sees a great piece of historical enquiry; she doesn’t need a set of performance descriptors to tell her it demonstrates sound understanding of the important causes and effects explained in a coherent way. She knows excellence because she has seen it before and it looked similar. Perversely, performance descriptors could actually lead her to lower the mark she awards, particularly if it is too formulaic and reductive, which seems to be the problem with KS2 mark schemes: the work includes all the prescribed functional elements, but the overall piece is not fluent, engaging or ambitious.

Likewise, the same history teacher knows when something has fallen short of what is required because it is not as good as the examples she has seen before that did, the ones that shape the mental model she carries of what is good. On their own rubrics really don’t tell us much, and though we may think they are objective, in reality we are still drawing upon our mental models whenever we make judgements. Even when the performance descriptors appear specific, they are never as specific as an actual question being asked, which ultimately always defines the standard.

If objective judgement using rubrics is a mirage, we are better off spending our time developing mental models of what constitutes the good, the bad and the ugly in terms of exemplar work rather than our misunderstanding abstract prose descriptors. We should also look to shift emphasis towards the kinds of assessment formats that acknowledge the nature of human judgement, namely that all judgements are comparisons of one thing with another (Laming, 2004). In short, we should probably include comparative judgement in our assessment portfolio to draw reliable judgements about student achievement and make the intangible tangible.

13.  Share understanding of different standards of achievement

Standardisation has been a staple of subject meetings for years. In the days of National Curriculum Levels and the National Literacy Strategy English teachers would pore over numerous examples of levelled reading and writing responses. At GCSE and a Level in other subjects, I am sure many department meetings have been given over to discussing relative standards of bits of student work. From my experience, often these meetings are a complete waste of time. Not only do teachers rarely agree on why one piece of writing with poor syntax and grammar should gain a level 5, but we rarely alter our marking after the event anyway. Those that are generous remain generous, and those that are stingier continue to hold back from assigning the higher marks.

The main problem with these kinds of meeting is their reliance on rubrics and performance descriptors, which as we have seen fail to pin down a common understanding of achievement. The other problem is that they fail to acknowledge the fundamental nature of human judgement, namely that we are relativist rather than absolutist in our evaluation. Since we are probably never going to fully agree on standards of achievement, such as the quality of one essay over another, we are probably better off looking at lots of different examples of quality and comparing their relative strengths and weaknesses directly rather than diluting the process by recourse to nebulous mark schemes.

Out of these kinds of standardisation meetings, with teachers judging a cohort’s work together, can come authentic forms of exemplified student achievement – ones that have been formed by a collective comparative voice, rather than by a well-intentioned individual attempting to reduce the irreducible to a series of simplistic statements. Software like No More Marking is increasingly streamlining the whole process, and the nature of the approach itself lends itself much better to year on year standards being maintained with more accuracy. Comparative judgement is not fully formed just yet, but as today’s report into the recent KS2 trial, there is considerable promise for the future

14.  Analyse effectiveness of assessment items

As we have established, a good assessment should distinguish between different levels of attainment across the construct continuum. This means that we would expect a marks for difficulty assessment to include questions that most students could answer, and others that only those with the deepest understanding could respond to correctly. Obviously, there will always be idiosyncrasies. Some weaker students sometimes know the answer to more challenging questions, and likewise some stronger students do not always know the answer to the simpler questions. This is the nature of assessing from a wide domain.

What we should be concerned about in terms of making our assessments as valid and reliable as possible, however, is whether, in the main, the items on the test truly discriminate across the construct continuum. A good assessment should contain harder questions that discriminate students with stronger knowledge and understanding. If that is not the case then something probably needs to change, either in the wording of the items or in realigning teacher understanding of what constitutes item difficulty.

How to calculate the difficulty of assessment items:

Step one: rank items in order of perceived difficulty (as best you can!)

Step two: work out the average mark per item by dividing the total marks awarded for each item by the number of students.

Step three: for items worth more than 1 mark, divide the average score per item by the number of marks available for it.

Step four: all item scores should now have a metric of between 0 and 1. High values indicate the item is relatively accessible whilst low values indicate the item is more difficult.

This is the formula in Excel to identify the average score of an individual item:


screenshot-2017-05-02-19-51-161.pngOn an assessment with a large cohort of students we would expect to see a general trend of average scores going down as item difficulty increases i.e. a lower percentage of students are answering them correctly. Whilst it would be normal to expect some anomalies – after all, ranking items on perceived difficulty is not an exact science and is ultimately relative to what students know – any significant variations would probably be worth a closer look.

How to calculate item discrimination

There are different ways of measuring the extent to which an item distinguishes between more and less able students. Perhaps the easiest of these uses the discrimination index.

Step One: Select two groups of students from your assessment results – one with higher test scores and one with lower test scores. This can either be a split right down the middle, or sample at both extremes, so one group in the top third of total results, and one group in the bottom third.

Step Two: Divide the total of the sum of the range of the chosen high test score group minus the chosen low test score group by the number of students in the high score group multiplied by the marks available for the question

This is the formula to use in Excel:


screenshot-2017-05-02-19-51-231.pngThe discrimination index is essentially the percentage of students in the high test score group who answer the item correctly minus the percentage of the students in the low test score who do not. It operates on a range between -1 and +1 with values close to +1 indicating the item does discriminate well between high and low ability students for the construct being assessed.

Values near zero suggest that the item does not discriminate between high and low ability students, whilst values near -1 suggest that the item is quite often answered correctly by students who do the worst on the assessment as a whole and conversely incorrectly by those who score the best results on the overall assessment. These are therefore probably not great items.

15.  Increase assessment reliability (but not at the expense of validity)

Screenshot 2017-05-03 18.45.36

Reliability in assessment is about consistency of measurement over time, place and context. The analogy often used is to a pair of weighing scales. When someone steps on a pair of scales, whether in the bathroom or the kitchen, they expect the measurement of their weight to be consistent from one reading to the next, particularly if their diet is constant. This is the same as reliability in assessment: the extent to which a test produces consistent outcomes each time it is sat. In the same way you wouldn’t want your scales to add or take away a few pounds every time you weigh in, you wouldn’t want a test to produce wildly different results every time you sat it, especially if nothing had changed in your weight or your intelligence.

The problem is that in assessment it is impossible to create a completely reliable assessment, particularly if we want to assess things that we value, like quality of extended written responses which we have already discussed can be very subjective, and we don’t want our students to sit hundreds of hour’s worth of tests. We can increase reliability but it often comes at a price, such as in terms of validity (assessing the things that we believe represent the construct), or in time, which is finite and can be used for others things, like teaching.

What is reliability?

Screenshot 2017-05-03 18.33.39There are two mays of looking at the reliability of an assessment – the reliability of the test itself, or the reliability of the judgements being made by the judges. Reliability can be calculated by comparing two sets of scores for a single assessment (such as rater scores with comparative judgement) or with two scores from two tests that assess the same construct. Once we get these two sets of scores, it is possible to work out how similar the results are by using a statistical term called the reliability coefficient.

The reliability coefficient is the numerical index used to talk about reliability. It ranges from 0 to 1. A number closer to 1 indicates a high degree of reliability, whereas a low number suggests some error in the assessment design, or more likely one of the factors identified from the Ofqual list below. Reliability is generally considered good or acceptable if the reliability coefficient is in or around .80, though as Rob Coe points out (see below), even national examinations, with all their statistical know how and manpower, only get as high as 0.93! And that was just the one GCSE subject.

How to identify the reliability of an assessment?

There are four main ways to identify the reliability of an assessment, each with their own advantages and disadvantages and each requiring different levels of confidence with statistics and spreadsheets. The four main methods uses are:

  • Test–retest reliability
  • Parallel forms reliability
  • Split-half reliability
  • Internal-consistency (Cronbach’s alpha)

Test-retest reliability

Screenshot 2017-05-02 19.50.26This approach involves setting the same assessment with the same students at different points in time, such as at the beginning and end of a term. The correlation between the results that each student gets on each sitting of this same test should provide a reliability coefficient. There are two significant problems with this approach, however. Firstly, there is the problem of sensitivity of instruction. It is likely that students would have learnt something between the first and second administrations of the test, which might invalidate the inferences that can be drawn and threaten any attempt to work out a reliability score.

The other, arguably more, significant issue relates to levels of student motivation. I am guessing that most students would not really welcome sitting the same test on two separate occasions, particularly if the second assessment is soon after the first, which would need to happen in order to reduce threats to validity and reliability. Any changes to how students approach the second assessment will considerably affect the reliability score and probably make the exercise a complete waste of time.

Parallel forms reliability

Screenshot 2017-05-02 19.50.34One way round these problems is to design a parallel forms assessment. This is basically where one assessment is made up of two equal parts (parallel A and parallel B), with the second half (parallel B) performing the function of the second assessment in the test-retest approach outlined above. As with test-retest, correlations between student results from the parallel A and parallel B parts of the test can provide a reliability figure. The problem now is that, in reality, it is difficult to create two sections of an assessment of equal challenge. As we have considered, challenge lies in the choice of a question, and even the very best assessment designers don’t really know how difficult an item really is until real students have actually tried answering them.

Split-half reliability

Screenshot 2017-05-02 19.50.41Perhaps the best way to work out the reliability of a class assessment, and the one favoured by Dylan Wiliam, is the split-half reliability model. Rather than waste time attempting the almost impossible – and create two forms of the same assessment of equal difficulty – this approach skirts round the problem, by dividing a single assessment in half and treating each half as a separate test.

There are different ways the assessment can be divided in half, such as straight split down the middle or creating two parts by separating out the odd and even numbered items. Whatever method is used, the reliability coefficient is worked out the same way: by correlating the scores on the two parts and then taking account of the fact that this only relates to half the test by applying the Spearman-Brown formula**. This then provides a reasonable estimate of the reliability of an assessment, which is probably good enough for school-based assessment.

The formula for applying Spearman-Brown in Excel is a little beyond the scope of my understanding. Fortunately, there are a lot of tools available on the Internet that make it possible to work out reliability scores using Spearman-Brown’s formula. The process involves downloading a spreadsheet and then inputting your test scores into cells containing pre-programmed formulas. The best of these is, unsurprisingly, from Dylan Wiliam himself, which is available to download here. Rather handily, Dylan also includes some super clear instructions on how to use the tool. Whilst there are other spreadsheets available elsewhere that perform this and other functions, they are not as clean and intuitive as this one.

Internal-consistency reliability (Cronbach’s alpha)

Screenshot 2017-05-03 18.35.26

At this point, I should point that I am fast approaching the limits of my understanding in relation to assessment, particularly with regards to the use of statistics. Nevertheless, I think I have managed to get my head around internal-consistency reliability enough to use some of the tools available to work out the reliability of an assessment using Cronbach’s alpha. In statistics Cronbach’s alpha is used as an estimate of the reliability of a psychometric test. It provides an estimate of internal consistency reliability and helps to show whether or not all the items in an assessment are assessing the same construct or not. Unlike the easier to use – and understand – split-half reliability, Cronbach’s alpha looks at the average value of all possible split- half estimates, rather than just the one that has been split in half.

It uses this formula:

Screenshot 2017-05-03 18.36.04

If like most people, however, you find this formula intimidating and unfathomable, seek out one of the many online spreadsheets set up with Cronbach’s alpha and ready for you to enter your own assessment data into the cells. Probably the most straightforward of these can be found here. It is produced by Professor Glenn Fulcher and it allows you to enter assessment results for any items with a mark of up to 7. There are instructions that tell you what to do and are quite easy for the layman to follow.

Make sure everyone understands the limitations of assessment

Given that no school assessment which measures the things we value or involves any element of human judgement is ever likely to be completely reliable, the time has probably come to be more honest about this with the people most impacted by summative tests, namely the students and their parents. The problem is that in reality this is incredibly hard to do. As Rob Coe jokes, can anyone imagine a teacher telling a parent that their child’s progress, say an old NC level 5, is accurate to a degree of plus or minus one level? Most teachers probably haven’t even heard about standard measurement of error, let alone understand its impact on assessment practice enough to explain it to a bewildered parent.

The US education system seems rather more advanced than ours in relation to reporting issues of error and uncertainty in assessment to parents. This is a consequence of the Standards for Educational and Psychological Testing (1999). These lay out the extent to which measurement uncertainty must be reported to stakeholders, which US courts follow in their rulings and test administrators account for in their supplementary technical guides.

A 2010 report commissioned by Ofqual into the way assessment agencies in the US report uncertainty information when making public the results of their assessments showed an impressive degree of transparency in relation to sharing issues of test score reliability. Whilst the report notes that parents are not always directly given the information about assessment error and uncertainty, the information is always readable available to those who want it, providing of course they can understand it!

‘Whether in numbers, graphics, or words, and whether on score reports, in interpretive guidelines (sometimes, the concept is explained in an “interpretive guide for parents”), or in technical manuals, the concept of score imprecision is communicated. For tests with items scored subjectively, such as written answers, it is common, too, to report some measure of inter-rater reliability in a technical manual.’

To my knowledge we don’t really have anything like this level of transparency in our system, but I think there are a number of things we can probably learn from the US about how to be smarter with sharing with students and parents the complexity of assessment and the inferences that it can and cannot provide us with. I am not suggesting that the example below is realistic for an individual school to replicate, but I like the way that it at least signals the scope for grade variation by including confidence intervals in each of its assessment scores.

Screenshot 2017-05-03 18.49.39

There is clearly much we need to do to educate ourselves about assessment, and then we may be better placed to educate those who are most affected by the tests that we set.

The work starts now.

*  The answer to the questions is: John, where Paul had had ‘had’, had had ‘had had’. ‘Had had’ had had a clearer meaning

** The Spearman–Brown prediction formula, also known as the Spearman–Brown prophecy formula, is a formula relating psychometric reliability to test length and used by psychometricians to predict the reliability of a test after changing the test length.


Visual Learning: using graphics to teach complex literary terms


I have always tried to pay attention to the way that I present material to my students. Don’t get me wrong, I am not interested in style over substance, and I certainly don’t spend hours labouring away over every resource that I use in class. If there is a quicker, equally effective way of teaching something, then I will take it. I’m not a masochist.

Most of my resources now are paper copy quizzes for retrieval practice and elaboration, many of which have proved very effective at A level. I try to use the board as much as possible, whether to post the all-important learning objective model writing, record the unfolding of the lesson to ease the pressure on working memories or as a means of explaining tricky ideas or concepts more fully, often with an accompanying visual.

The problem is that I am a terrible artist. Unlike the wonderfully talented Oliver Caviglioli, whose illustrations and generosity are first class, my drawings are sad and pathetic. I would love to be Rolf Harris a great illustrator, but I can barely write legibly, let alone draw anything beyond a stick man! I remember a couple of years ago I drew a picture of a horse for a year poetry lesson, and the final product looked more like a pregnant camel with IBS than the thoroughbred I’d intended.

Fortunately, in the age of the Internet and Powerpoint (sorry, Jo), I have some pretty decent tools at my disposal to help me to make up for my artistic deficiency. As I have become increasingly aware of the power of combing words and images in boosting student learning, I have spent more of my time thinking about how images, in particular graphical representations, can be used to help with my teaching, such as in my explanations of complex literary concepts.

One of these troublesome concepts that seems to crop up whenever I teach Christina Rossetti’s ‘Goblin Market’ is allegory. ‘Goblin Market’ is a narrative poem with a familiar story: a young girl tempted into sin; her subsequence loss of innocence before salvation through sacrifice. Most people reading will get the allegory to the story of the Fall of Man. There are one or two differences in the poem – there is no Adam, only horny and grotesque Goblin men, and the saviour is a woman, not the Son of God – but the overarching parallels are pretty clear.

The problem is when it comes to explaining the concept of allegory in and of itself to students – in other words outside the context of the specific example – students really struggle. No matter how hard I try to explain allegory clearly, with examples and analogies aplenty, students just don’t seem to fully get it. Now, you might be tempted to say that I should look to hone my explanation. Trust me on this one: I have honed it to within an inch of its life. There is simply no room for any more honing.

So, this year I thought I’d take a different tact and invest a bit of time producing a graphical representation to sit alongside my verbal explanation. I don’t have any hard evidence to show that I what have done has been any more successful than usual. It seems to have made a difference, with more students being able to explain the concept than before, but then again this may well be a case of confirmation bias. Or brighter students. Or chance.

As you can see from below, the slide I have used in the past to explain allegory is pretty contemptuous – an overreaching definition which I expand and exemplify, with a bulleted breakdown of the two main types, political and the allegory of ideas. There are even a couple of token images thrown in, which I am not really convinced add any real value.

Screenshot 2017-04-22 08.32.40

My next effort is, I think, a real improvement. The graphical representations make the points of comparison between in an allegory between Text A (‘Goblin Market’) and Text B (The story of the Fall of Man) much clearer, and they have the added advantage of being able to highlight where the biblical comparison breaks down, in that some pretty big parts of the Bible story are missing in Rossetti’s poem, such as the presence of God.

Screenshot 2017-04-22 08.32.48

I then attempted to flesh out this initial explanation with an amended version of my original effort. This time I added a relational dimension to my diagram which enabled me to visualise the difference between allegory and other related literary concepts, such as fable and parable. The trouble was that whilst I had made some visual links between genres clear, I had lost the power of the previous graphic to embody the workings of allegory itself.

Screenshot 2017-04-22 08.32.56

My final version therefore combines the best elements of my previous attempts, including the graphical embodiment of the concept of allegory, the relational links to other genres and better images to exemplify examples of the different forms of allegory. The visual cues and graphical representations, along with my honed explanation, seem to have been much more successful in shifting my students’ understanding of allegory. At least, I hope that is the case.

Screenshot 2017-04-22 08.33.07Allegory is not the only literary concept I have attempted to represent graphically this way. I hope to blog about others in the future, so watch this space.

Thanks for reading.

Principles of Great Assessment #2 Validity and Fairness


This is the second of a three part series on the principles of great assessment. In my last post I focused on some principles of assessment design. This post outlines the principles that relate to ideas of validity and fairness.* As I have repeatedly stressed, I do not consider myself to be an expert in the field of assessment, so I am more than happy to accept constructive feedback to help me learn and to improve upon the understanding of assessment that we have already developed as a school. My hope is that these posts will help others to learn a bit more about assessment, and for the assessments that students sit to be as purposeful and supportive of their learning as possible.

So, here are my principles of great assessment 6-10.

6. Regularly review assessments in light of student responses

Validity in assessment is extremely important. For Daniel Koretz it is ‘the single most important criterion for evaluating achievement testing.’ Often when teachers talk about an assessment being valid or invalid, they are using the term incorrectly. In assessment validity means something very different to what it means in everyday language. Validity is not a property of a test, but rather of the inferences that an assessment is designed to produce. As Lee Cronbach observes, ‘One validates not a test but an interpretation of data arising from a specified procedure’ (Cronbach, 1971).

There is therefore no such thing as a valid or invalid assessment. A maths assessment with a high reading age might be considered to provide valid inferences for students with a high reading age, but invalid inferences for students with low reading ages. The same test can therefore provide both valid and invalid inferences depending on its intended purpose, which links back to the second assessment principle: the purpose of the assessment must be set and agreed from the outset. Validity is thus specific to particular uses in particular contexts and is not an ‘all or nothing’ judgement but rather a matter of degree and application.

Picture4If you understand that validity applies to the inferences that assessments provide, then you should be able to appreciate why it is so important to make sure that an assessment gives as valid inferences about student achievement as possible, particularly when there are significant consequences attached for students taking them, like attainment grouping. There are two main threats to achieving this validity: construct under-representation and construct irrelevance. Construct under-representation refers to when a measure fails to capture important aspects of the construct, whilst construct irrelevance refers to when a measure is influenced by things other than just the construct i.e. the example of high reading age in a maths assessment.

There are a number of practical steps that teachers can take to help reduce these threats to validity and, in turn, to increase the validity of the inferences provided by their assessments. Some are fairly obvious and can be implemented with little difficulty, whilst others require a bit more technical know-how and/or a well-designed systematic approach that provides teachers with the time and space needed to design and review their assessments on a regular basis.

Here are some practical steps educators can take:

Review assessment items collaboratively before a new assessment is sat

Badly constructed assessment items create noise and can lead to students guessing the answer. Where possible, it is therefore worth spending some time and effort upfront, reviewing items in a forthcoming summative assessment before they go live so that any glaring errors around the wording can be amended, and any unnecessary information can be removed. Aside from making that assessment more likely to generate valid inferences, such as approach has the added advantage of training those less confident in assessment design in some of the ways of making assessments better and more fit for purpose. In an ideal world, an important assessment should be piloted first to provide some indication of issues with items, and the likely spread of results across an ability profile. This will not always be possible.

Check questions for cues and contextual nudges

Another closely-linked problem and another potential threat to validity is flawed question phrasing that inadvertently reveals the answer, or provides students with enough contextual cueing to narrow down their responses to particular semantic or grammatical fit. In the example item from a PE assessment below, for instance, the phrasing of the question, namely the grammatical construction of the words and phrases around the gaps, make anaerobic and aerobic more likely candidates for the correct answer. They are adjectives which precede nouns, whilst the rest of the options are all nouns and would sound odd to a native speaker – a noun followed by a noun.  A student might select anaerobic and aerobic, not because they necessarily know the correct answer, but because they sound correct in accordance with the syntactical cues provided. This is a threat to validity in that the inference is perhaps more about grammatical knowledge rather than understanding of bodily process.

Example: The PE department have designed an end of unit assessment to check students’ understanding of respiratory systems. It includes the following types of item.

Task: use two of the following words to complete the passage below

Anaerobic, Energy, Circulation, Metabolism, Aerobic 

When the body is at rest this is ______ respiration. As you exercise you breathe harder and deeper and the heart beats faster to get oxygen to the muscles. When exercising very hard, the heart cannot get enough oxygen to the muscles. Respiration becomes _______.

Interrogate questions for construct irrelevance

If the purpose of an assessment has been clearly established from the outset and that assessment has been clearly aligned to the constructs within the curriculum, then a group of subject professionals working together should be able to identify items where things other than the construct are being assessed. Obvious examples are high reading ages that get in the way of assessments of mathematical or scientific ability, but sometimes it might be harder to detect, as with the example below. To some, this item might seem fairly innocuous, but on closer inspection it becomes clear that it is not assessing vocabulary knowledge as purported, but rather spelling ability. Whilst it may be desirous for students to spell words correctly, inferences about word knowledge would not be possible from an assessment with these kinds of items in it.

Example: The English department designs an assessment to measure students’ vocabulary skills. The assessment consists of 40 items like the following:

Task: In all of the ________________ of packing into a new house, Sandra forgot about washing the baby.

  1. Excitement
  2. Excetmint
  3. Excitemant
  4. Excitmint

7. Standardise assessments that lead to important decisions

Teachers generally understand the importance of making sure that students sit final examinations in an exam hall under same conditions as everyone else taking the test. Mock examinations tend to replicate these conditions, because teachers and school leaders want the inferences provided by them to be as valid and fair as possible. For all manner of reasons, though, this insistence on standardised conditions for test takers is less rigorously adhered to lower down the school, even though some of decisions based upon such tests in year 7 and 8 arguably carry much more significance for students than any terminal examination.

I know that I have been guilty of not properly understanding the importance of standardising test conditions.  On more than one occasion I have set an end of unit or term assessment as a cover activity, thinking that it was ideal work because it would take students the whole lesson to complete and they would need to work in silence. I hadn’t appreciated how assessment is a bit more complicated than that, even for something like an end of unit test. I hadn’t considered, for instance, that it mattered whether students got the full hour, or more likely 50mins if it was set by a cover supervisor who had to spend valuable time settling the class. I hadn’t taken on board that it would make a difference if my class sat the assessment on a afternoon, and the class next door completed theirs bright and early in the morning.

It may well be that my students would have scored exactly the same whether or not I was present, whether they sat the test in the morning or in the afternoon, or whether they had 50 minutes or the full hour. The point is that I could not be sure, and that if one or more of my students would have scored significantly higher (or lower) under different circumstances, then their results would have provided invalid inferences about their understanding. If they were then placed in a higher or lower group as a result, or I reported home to their parents some erroneous information about their test scores, which possibly affected their motivation or self-efficacy, then you could suggest that I had acted unethically.

8. Important decisions are made on the basis of more than one assessment

Imagine you are looking to recruit a new head of science. Now imagine the even more unlikely scenario that you have received a strong field of applicants, which I appreciate in the current recruitment climate, is a bit of a stretch of the imagination. With such a strong field for such an important post, a school would be unlikely to make any decision on whom to appoint based upon the inferences provided by one single measure, such as an application letter, a taught lesson or an interview. More likely, they would triangulate all these different inferences about the candidate’s suitability for the role when making their decision, and even then crossing their fingers that they had made the right choice.

A similar principle is at work when making important decisions on the back of student assessment results, such as which group to place them in the following term, identifying which individuals need additional support or how much, if any, progress to report home to parents. In each of these cases, as with the head of science example, it would be wise to be able to draw upon multiple inferences in order to make a more informed decision. This is not to advocate an exponential increase in the number of tests students sit, but rather to recognise that when the stakes are high, it is important to make sure the information we use is as valid as possible. Cross referencing examinations is one way of achieving this, particularly given the practical difficulties of standardising assessments previously discussed.

9. Timing of assessment is determined by purpose and professional judgement

The purpose of an assessment informs its timing. Whilst this makes perfect sense in the abstract, in practice there are many challenges to making this happen. In Principled Assessment Design, Dylan Wiliam notes how it is relatively straightforward to create assessments which are highly sensitive to instruction if what is taught is not hard to teach and learn. For example, if I all I wanted to teach my students in English was vocabulary, and I set up a test that assessed them on the 20 or so words that I had recently taught them, it would be highly likely that the test would show rapid improvements in their understanding of these words. But as we all know, teaching is about much more than just learning a few words. It involves complex cognitive processes and vast webs of interconnected knowledge, all of which take a considerable amount of time to teach, and in turn to assess.


It seem that’s the distinction between learning and performance is becoming increasingly well understood, though perhaps in terms of curriculum and assessment its widespread application to the classroom is taking longer to take hold. The reality for many established schools is that it is difficult to construct a coherent curriculum, assessment and pedagogical model across a whole school that embraces the full implications of the difference between learning and performance. It is hard enough to get some colleagues to fully appreciate the distinction, and its many nuances, so indoctrinated are they by years of the wrong kind of impetus. Added to this, whilst there is general agreement that assessing performance can be unhelpful and misleading, there is no real consensus of the optimal time to assess for learning. We know that assessing soon after teaching is flawed, but not exactly when to capture longer term learning. Compromise is probably inevitable.

What all this means in practical terms for schools is they to work within their localised constraints, including issues of timetabling, levels of understanding amongst staff and, crucially, the time and resources to enact the theory when known and understood. Teacher workload must also be taken into account when deciding upon the timing of assessments, recognising certain pinch points in the year and building a coherent assessment timetable that respects the division between learning and performance, builds in opportunities to respond to (perceived) gaps in understanding and spreading out the emotional and physical demands for staff and students. Not easy, at all.

10. Identify the range of evidence required to support inferences about achievement

Tim Oates’ oft quoted advice to avoid assessing ‘everything that moves, just the key concepts’ is important to bear in mind, not just for those responsible for assessment, but also for those who design the curricula with which those assessments are aligned. Despite the freedoms afforded from the liberation of levels and the greater autonomy possible with academy status, many of us have still found it hard to narrow down what we teach to what is manageable and most important. We find it difficult in practice to sacrifice breadth in the interests of depth, particularly where we feel passionately that so much is important for students to learn. I know it has taken several years for our curriculum leaders to truly reconcile themselves to the need to strip out some content and focus on teaching the most important material to mastery.

Once these ‘key concepts’ have been isolated and agreed, the next step is to make sure that any assessments cover the breadth and depth required to gain valid inferences about student achievement of them.  I think the diagram below, which I used in my previous blog, is helpful in illustrating how assessment designers should be guided by both the types of knowledge and skills that exit within the construct (the vertical axis) and the levels of achievement across each component i.e. the continuum (horizontal axis). This will likely look very different in some subjects, but it nevertheless provides a useful conceptual framework for thinking about the breadth and depth of items required to support valid inferences about levels of attainment of the key concepts.

Screenshot 2017-03-09 16.53.06

In my next post, which I must admit I am dreading writing and releasing for public consumption, is focusing on trying to articulate a set of principles around the very thorny and complicated area of assessment reliability. I think I am going to need a couple of weeks or so to make sure that I do it justice!

Thanks for reading!


* I am aware the numbering of the principles on the image does not match the numbering in my post. That’s because the image is a draft document.