This is the second of a three part series on the principles of great assessment. In my last post I focused on some principles of assessment design. This post outlines the principles that relate to ideas of validity and fairness.* As I have repeatedly stressed, I do not consider myself to be an expert in the field of assessment, so I am more than happy to accept constructive feedback to help me learn and to improve upon the understanding of assessment that we have already developed as a school. My hope is that these posts will help others to learn a bit more about assessment, and for the assessments that students sit to be as purposeful and supportive of their learning as possible.
So, here are my principles of great assessment 6-10.
6. Regularly review assessments in light of student responses
Validity in assessment is extremely important. For Daniel Koretz it is ‘the single most important criterion for evaluating achievement testing.’ Often when teachers talk about an assessment being valid or invalid, they are using the term incorrectly. In assessment validity means something very different to what it means in everyday language. Validity is not a property of a test, but rather of the inferences that an assessment is designed to produce. As Lee Cronbach observes, ‘One validates not a test but an interpretation of data arising from a specified procedure’ (Cronbach, 1971).
There is therefore no such thing as a valid or invalid assessment. A maths assessment with a high reading age might be considered to provide valid inferences for students with a high reading age, but invalid inferences for students with low reading ages. The same test can therefore provide both valid and invalid inferences depending on its intended purpose, which links back to the second assessment principle: the purpose of the assessment must be set and agreed from the outset. Validity is thus specific to particular uses in particular contexts and is not an ‘all or nothing’ judgement but rather a matter of degree and application.
If you understand that validity applies to the inferences that assessments provide, then you should be able to appreciate why it is so important to make sure that an assessment gives as valid inferences about student achievement as possible, particularly when there are significant consequences attached for students taking them, like attainment grouping. There are two main threats to achieving this validity: construct under-representation and construct irrelevance. Construct under-representation refers to when a measure fails to capture important aspects of the construct, whilst construct irrelevance refers to when a measure is influenced by things other than just the construct i.e. the example of high reading age in a maths assessment.
There are a number of practical steps that teachers can take to help reduce these threats to validity and, in turn, to increase the validity of the inferences provided by their assessments. Some are fairly obvious and can be implemented with little difficulty, whilst others require a bit more technical know-how and/or a well-designed systematic approach that provides teachers with the time and space needed to design and review their assessments on a regular basis.
Here are some practical steps educators can take:
Review assessment items collaboratively before a new assessment is sat
Badly constructed assessment items create noise and can lead to students guessing the answer. Where possible, it is therefore worth spending some time and effort upfront, reviewing items in a forthcoming summative assessment before they go live so that any glaring errors around the wording can be amended, and any unnecessary information can be removed. Aside from making that assessment more likely to generate valid inferences, such as approach has the added advantage of training those less confident in assessment design in some of the ways of making assessments better and more fit for purpose. In an ideal world, an important assessment should be piloted first to provide some indication of issues with items, and the likely spread of results across an ability profile. This will not always be possible.
Check questions for cues and contextual nudges
Another closely-linked problem and another potential threat to validity is flawed question phrasing that inadvertently reveals the answer, or provides students with enough contextual cueing to narrow down their responses to particular semantic or grammatical fit. In the example item from a PE assessment below, for instance, the phrasing of the question, namely the grammatical construction of the words and phrases around the gaps, make anaerobic and aerobic more likely candidates for the correct answer. They are adjectives which precede nouns, whilst the rest of the options are all nouns and would sound odd to a native speaker – a noun followed by a noun. A student might select anaerobic and aerobic, not because they necessarily know the correct answer, but because they sound correct in accordance with the syntactical cues provided. This is a threat to validity in that the inference is perhaps more about grammatical knowledge rather than understanding of bodily process.
Example: The PE department have designed an end of unit assessment to check students’ understanding of respiratory systems. It includes the following types of item.
Task: use two of the following words to complete the passage below
Anaerobic, Energy, Circulation, Metabolism, Aerobic
When the body is at rest this is ______ respiration. As you exercise you breathe harder and deeper and the heart beats faster to get oxygen to the muscles. When exercising very hard, the heart cannot get enough oxygen to the muscles. Respiration becomes _______.
Interrogate questions for construct irrelevance
If the purpose of an assessment has been clearly established from the outset and that assessment has been clearly aligned to the constructs within the curriculum, then a group of subject professionals working together should be able to identify items where things other than the construct are being assessed. Obvious examples are high reading ages that get in the way of assessments of mathematical or scientific ability, but sometimes it might be harder to detect, as with the example below. To some, this item might seem fairly innocuous, but on closer inspection it becomes clear that it is not assessing vocabulary knowledge as purported, but rather spelling ability. Whilst it may be desirous for students to spell words correctly, inferences about word knowledge would not be possible from an assessment with these kinds of items in it.
Example: The English department designs an assessment to measure students’ vocabulary skills. The assessment consists of 40 items like the following:
Task: In all of the ________________ of packing into a new house, Sandra forgot about washing the baby.
7. Standardise assessments that lead to important decisions
Teachers generally understand the importance of making sure that students sit final examinations in an exam hall under same conditions as everyone else taking the test. Mock examinations tend to replicate these conditions, because teachers and school leaders want the inferences provided by them to be as valid and fair as possible. For all manner of reasons, though, this insistence on standardised conditions for test takers is less rigorously adhered to lower down the school, even though some of decisions based upon such tests in year 7 and 8 arguably carry much more significance for students than any terminal examination.
I know that I have been guilty of not properly understanding the importance of standardising test conditions. On more than one occasion I have set an end of unit or term assessment as a cover activity, thinking that it was ideal work because it would take students the whole lesson to complete and they would need to work in silence. I hadn’t appreciated how assessment is a bit more complicated than that, even for something like an end of unit test. I hadn’t considered, for instance, that it mattered whether students got the full hour, or more likely 50mins if it was set by a cover supervisor who had to spend valuable time settling the class. I hadn’t taken on board that it would make a difference if my class sat the assessment on a afternoon, and the class next door completed theirs bright and early in the morning.
It may well be that my students would have scored exactly the same whether or not I was present, whether they sat the test in the morning or in the afternoon, or whether they had 50 minutes or the full hour. The point is that I could not be sure, and that if one or more of my students would have scored significantly higher (or lower) under different circumstances, then their results would have provided invalid inferences about their understanding. If they were then placed in a higher or lower group as a result, or I reported home to their parents some erroneous information about their test scores, which possibly affected their motivation or self-efficacy, then you could suggest that I had acted unethically.
8. Important decisions are made on the basis of more than one assessment
Imagine you are looking to recruit a new head of science. Now imagine the even more unlikely scenario that you have received a strong field of applicants, which I appreciate in the current recruitment climate, is a bit of a stretch of the imagination. With such a strong field for such an important post, a school would be unlikely to make any decision on whom to appoint based upon the inferences provided by one single measure, such as an application letter, a taught lesson or an interview. More likely, they would triangulate all these different inferences about the candidate’s suitability for the role when making their decision, and even then crossing their fingers that they had made the right choice.
A similar principle is at work when making important decisions on the back of student assessment results, such as which group to place them in the following term, identifying which individuals need additional support or how much, if any, progress to report home to parents. In each of these cases, as with the head of science example, it would be wise to be able to draw upon multiple inferences in order to make a more informed decision. This is not to advocate an exponential increase in the number of tests students sit, but rather to recognise that when the stakes are high, it is important to make sure the information we use is as valid as possible. Cross referencing examinations is one way of achieving this, particularly given the practical difficulties of standardising assessments previously discussed.
9. Timing of assessment is determined by purpose and professional judgement
The purpose of an assessment informs its timing. Whilst this makes perfect sense in the abstract, in practice there are many challenges to making this happen. In Principled Assessment Design, Dylan Wiliam notes how it is relatively straightforward to create assessments which are highly sensitive to instruction if what is taught is not hard to teach and learn. For example, if I all I wanted to teach my students in English was vocabulary, and I set up a test that assessed them on the 20 or so words that I had recently taught them, it would be highly likely that the test would show rapid improvements in their understanding of these words. But as we all know, teaching is about much more than just learning a few words. It involves complex cognitive processes and vast webs of interconnected knowledge, all of which take a considerable amount of time to teach, and in turn to assess.
It seem that’s the distinction between learning and performance is becoming increasingly well understood, though perhaps in terms of curriculum and assessment its widespread application to the classroom is taking longer to take hold. The reality for many established schools is that it is difficult to construct a coherent curriculum, assessment and pedagogical model across a whole school that embraces the full implications of the difference between learning and performance. It is hard enough to get some colleagues to fully appreciate the distinction, and its many nuances, so indoctrinated are they by years of the wrong kind of impetus. Added to this, whilst there is general agreement that assessing performance can be unhelpful and misleading, there is no real consensus of the optimal time to assess for learning. We know that assessing soon after teaching is flawed, but not exactly when to capture longer term learning. Compromise is probably inevitable.
What all this means in practical terms for schools is they to work within their localised constraints, including issues of timetabling, levels of understanding amongst staff and, crucially, the time and resources to enact the theory when known and understood. Teacher workload must also be taken into account when deciding upon the timing of assessments, recognising certain pinch points in the year and building a coherent assessment timetable that respects the division between learning and performance, builds in opportunities to respond to (perceived) gaps in understanding and spreading out the emotional and physical demands for staff and students. Not easy, at all.
10. Identify the range of evidence required to support inferences about achievement
Tim Oates’ oft quoted advice to avoid assessing ‘everything that moves, just the key concepts’ is important to bear in mind, not just for those responsible for assessment, but also for those who design the curricula with which those assessments are aligned. Despite the freedoms afforded from the liberation of levels and the greater autonomy possible with academy status, many of us have still found it hard to narrow down what we teach to what is manageable and most important. We find it difficult in practice to sacrifice breadth in the interests of depth, particularly where we feel passionately that so much is important for students to learn. I know it has taken several years for our curriculum leaders to truly reconcile themselves to the need to strip out some content and focus on teaching the most important material to mastery.
Once these ‘key concepts’ have been isolated and agreed, the next step is to make sure that any assessments cover the breadth and depth required to gain valid inferences about student achievement of them. I think the diagram below, which I used in my previous blog, is helpful in illustrating how assessment designers should be guided by both the types of knowledge and skills that exit within the construct (the vertical axis) and the levels of achievement across each component i.e. the continuum (horizontal axis). This will likely look very different in some subjects, but it nevertheless provides a useful conceptual framework for thinking about the breadth and depth of items required to support valid inferences about levels of attainment of the key concepts.
In my next post, which I must admit I am dreading writing and releasing for public consumption, is focusing on trying to articulate a set of principles around the very thorny and complicated area of assessment reliability. I think I am going to need a couple of weeks or so to make sure that I do it justice!
Thanks for reading!
* I am aware the numbering of the principles on the image does not match the numbering in my post. That’s because the image is a draft document.