Principles of Great Assessment #3: Reliability

Screenshot 2017-05-03 18.37.45.pngThis is the third and final of my three part series on the principles of great assessment. In the first post I focused on the principles of assessment design, and in the second on principles relating to issues of fairness and equality. This final post attempts to get to grips with principles relating to issues of reliability and making assessments provide useful information about student attainment. I have been putting off this post because whilst I recognise how important reliability is in assessment, I know how hard it is to get to grips with, let alone explain to others. I have tried to do my best to synthesise the words and ideas of others. I hope it helps lead to the better use of assessment in schools.

Here are my principles of great assessment 11-16

11. Define standards through questions set

The choice of the questions set in an assessment are important as they ultimately define the standard of expectation, even in cases where the prose descriptors appear secure. Where there is variation between the rigour of the questions set by teachers, problems occur and inaccurate inferences are likely to be drawn. The following example from Dylan Wiliam, albeit extreme, illustrates this relationship between questions and standards.

Task: add punctuation to the following sentence to make it grammatically correct

John where Paul had had had had had had had had had had had a clearer meaning.

This question could feasibly be set to assess students’ understanding of grammar, in particular their knowledge of how commas and apostrophes are used to clarify meaning, which on the surface seems a relatively tight and definitive statement. Obviously, no right-minded teacher would ever set such an absurdly difficult example, which most of us, including English teachers, would struggle to answer correctly*. But what it highlights is the problems that can arise when teachers deploy their own understanding of the required standards independently.

A teacher setting the above question would clearly have sky-high expectations of their students’ grammatical understanding, or supreme confidence in their own teaching! More realistically, a question assessing for students’ grammatical ability would look more like the example below, which requires a far lower grammatical understanding.

Task: add punctuation to the following sentence to make it grammatically correct

John went to the beach with his towel his bucket his swimming trunks and his spade.

All this is yet more reason why summative assessments should be standardised. It simply cannot be that the questions some students face demand significantly greater knowledge and understanding than others who have been taught the same curriculum. The questions used in tests of this nature should be agreed upfront and aligned with the curriculum to remain stable each year. This is, of course, in practice really difficult: teachers may start teaching to the test, and thus invalidate the inferences from the assessment, or the question set one year is not of the same standard as the ones previously, thus making year on year comparisons difficult.

12. Define standards through exemplar pupil work

Screenshot 2017-05-02 19.49.53As well as defining standards through questions, standards can also be defined through student work. Using examples of work to exemplify standards is far better than defining those same expectations through the abstraction of rubrics. As we have seen, not only do rubrics tend to create artificial distinctions between levels of performance, but the descriptions of these performances are more often than not meaningless in isolation. One person’s notion of detailed and developed analysis, can easily be another’s highly sophisticated and insightful evaluation. As Hamlet says of Polonius’ speech, they are just ‘words, words, words’. They only mean something when they are applied to examples.

Whether we like it or not, we all carry mental models of what constitutes excellence in our subject. A history teacher knows when she sees a great piece of historical enquiry; she doesn’t need a set of performance descriptors to tell her it demonstrates sound understanding of the important causes and effects explained in a coherent way. She knows excellence because she has seen it before and it looked similar. Perversely, performance descriptors could actually lead her to lower the mark she awards, particularly if it is too formulaic and reductive, which seems to be the problem with KS2 mark schemes: the work includes all the prescribed functional elements, but the overall piece is not fluent, engaging or ambitious.

Likewise, the same history teacher knows when something has fallen short of what is required because it is not as good as the examples she has seen before that did, the ones that shape the mental model she carries of what is good. On their own rubrics really don’t tell us much, and though we may think they are objective, in reality we are still drawing upon our mental models whenever we make judgements. Even when the performance descriptors appear specific, they are never as specific as an actual question being asked, which ultimately always defines the standard.

If objective judgement using rubrics is a mirage, we are better off spending our time developing mental models of what constitutes the good, the bad and the ugly in terms of exemplar work rather than our misunderstanding abstract prose descriptors. We should also look to shift emphasis towards the kinds of assessment formats that acknowledge the nature of human judgement, namely that all judgements are comparisons of one thing with another (Laming, 2004). In short, we should probably include comparative judgement in our assessment portfolio to draw reliable judgements about student achievement and make the intangible tangible.

13.  Share understanding of different standards of achievement

Standardisation has been a staple of subject meetings for years. In the days of National Curriculum Levels and the National Literacy Strategy English teachers would pore over numerous examples of levelled reading and writing responses. At GCSE and a Level in other subjects, I am sure many department meetings have been given over to discussing relative standards of bits of student work. From my experience, often these meetings are a complete waste of time. Not only do teachers rarely agree on why one piece of writing with poor syntax and grammar should gain a level 5, but we rarely alter our marking after the event anyway. Those that are generous remain generous, and those that are stingier continue to hold back from assigning the higher marks.

The main problem with these kinds of meeting is their reliance on rubrics and performance descriptors, which as we have seen fail to pin down a common understanding of achievement. The other problem is that they fail to acknowledge the fundamental nature of human judgement, namely that we are relativist rather than absolutist in our evaluation. Since we are probably never going to fully agree on standards of achievement, such as the quality of one essay over another, we are probably better off looking at lots of different examples of quality and comparing their relative strengths and weaknesses directly rather than diluting the process by recourse to nebulous mark schemes.

Out of these kinds of standardisation meetings, with teachers judging a cohort’s work together, can come authentic forms of exemplified student achievement – ones that have been formed by a collective comparative voice, rather than by a well-intentioned individual attempting to reduce the irreducible to a series of simplistic statements. Software like No More Marking is increasingly streamlining the whole process, and the nature of the approach itself lends itself much better to year on year standards being maintained with more accuracy. Comparative judgement is not fully formed just yet, but as today’s report into the recent KS2 trial, there is considerable promise for the future

14.  Analyse effectiveness of assessment items

As we have established, a good assessment should distinguish between different levels of attainment across the construct continuum. This means that we would expect a marks for difficulty assessment to include questions that most students could answer, and others that only those with the deepest understanding could respond to correctly. Obviously, there will always be idiosyncrasies. Some weaker students sometimes know the answer to more challenging questions, and likewise some stronger students do not always know the answer to the simpler questions. This is the nature of assessing from a wide domain.

What we should be concerned about in terms of making our assessments as valid and reliable as possible, however, is whether, in the main, the items on the test truly discriminate across the construct continuum. A good assessment should contain harder questions that discriminate students with stronger knowledge and understanding. If that is not the case then something probably needs to change, either in the wording of the items or in realigning teacher understanding of what constitutes item difficulty.

How to calculate the difficulty of assessment items:

Step one: rank items in order of perceived difficulty (as best you can!)

Step two: work out the average mark per item by dividing the total marks awarded for each item by the number of students.

Step three: for items worth more than 1 mark, divide the average score per item by the number of marks available for it.

Step four: all item scores should now have a metric of between 0 and 1. High values indicate the item is relatively accessible whilst low values indicate the item is more difficult.

This is the formula in Excel to identify the average score of an individual item:

=SUM(B3:B8)/(COUNT(B3:B8)*B9)

screenshot-2017-05-02-19-51-161.pngOn an assessment with a large cohort of students we would expect to see a general trend of average scores going down as item difficulty increases i.e. a lower percentage of students are answering them correctly. Whilst it would be normal to expect some anomalies – after all, ranking items on perceived difficulty is not an exact science and is ultimately relative to what students know – any significant variations would probably be worth a closer look.

How to calculate item discrimination

There are different ways of measuring the extent to which an item distinguishes between more and less able students. Perhaps the easiest of these uses the discrimination index.

Step One: Select two groups of students from your assessment results – one with higher test scores and one with lower test scores. This can either be a split right down the middle, or sample at both extremes, so one group in the top third of total results, and one group in the bottom third.

Step Two: Divide the total of the sum of the range of the chosen high test score group minus the chosen low test score group by the number of students in the high score group multiplied by the marks available for the question

This is the formula to use in Excel:

=(SUM(B5:B7)-SUM(B8:B10))/(COUNT(B5:B7)*B11)

screenshot-2017-05-02-19-51-231.pngThe discrimination index is essentially the percentage of students in the high test score group who answer the item correctly minus the percentage of the students in the low test score who do not. It operates on a range between -1 and +1 with values close to +1 indicating the item does discriminate well between high and low ability students for the construct being assessed.

Values near zero suggest that the item does not discriminate between high and low ability students, whilst values near -1 suggest that the item is quite often answered correctly by students who do the worst on the assessment as a whole and conversely incorrectly by those who score the best results on the overall assessment. These are therefore probably not great items.

15.  Increase assessment reliability (but not at the expense of validity)

Screenshot 2017-05-03 18.45.36

Reliability in assessment is about consistency of measurement over time, place and context. The analogy often used is to a pair of weighing scales. When someone steps on a pair of scales, whether in the bathroom or the kitchen, they expect the measurement of their weight to be consistent from one reading to the next, particularly if their diet is constant. This is the same as reliability in assessment: the extent to which a test produces consistent outcomes each time it is sat. In the same way you wouldn’t want your scales to add or take away a few pounds every time you weigh in, you wouldn’t want a test to produce wildly different results every time you sat it, especially if nothing had changed in your weight or your intelligence.

The problem is that in assessment it is impossible to create a completely reliable assessment, particularly if we want to assess things that we value, like quality of extended written responses which we have already discussed can be very subjective, and we don’t want our students to sit hundreds of hour’s worth of tests. We can increase reliability but it often comes at a price, such as in terms of validity (assessing the things that we believe represent the construct), or in time, which is finite and can be used for others things, like teaching.

What is reliability?

Screenshot 2017-05-03 18.33.39There are two mays of looking at the reliability of an assessment – the reliability of the test itself, or the reliability of the judgements being made by the judges. Reliability can be calculated by comparing two sets of scores for a single assessment (such as rater scores with comparative judgement) or with two scores from two tests that assess the same construct. Once we get these two sets of scores, it is possible to work out how similar the results are by using a statistical term called the reliability coefficient.

The reliability coefficient is the numerical index used to talk about reliability. It ranges from 0 to 1. A number closer to 1 indicates a high degree of reliability, whereas a low number suggests some error in the assessment design, or more likely one of the factors identified from the Ofqual list below. Reliability is generally considered good or acceptable if the reliability coefficient is in or around .80, though as Rob Coe points out (see below), even national examinations, with all their statistical know how and manpower, only get as high as 0.93! And that was just the one GCSE subject.

How to identify the reliability of an assessment?

There are four main ways to identify the reliability of an assessment, each with their own advantages and disadvantages and each requiring different levels of confidence with statistics and spreadsheets. The four main methods uses are:

  • Test–retest reliability
  • Parallel forms reliability
  • Split-half reliability
  • Internal-consistency (Cronbach’s alpha)

Test-retest reliability

Screenshot 2017-05-02 19.50.26This approach involves setting the same assessment with the same students at different points in time, such as at the beginning and end of a term. The correlation between the results that each student gets on each sitting of this same test should provide a reliability coefficient. There are two significant problems with this approach, however. Firstly, there is the problem of sensitivity of instruction. It is likely that students would have learnt something between the first and second administrations of the test, which might invalidate the inferences that can be drawn and threaten any attempt to work out a reliability score.

The other, arguably more, significant issue relates to levels of student motivation. I am guessing that most students would not really welcome sitting the same test on two separate occasions, particularly if the second assessment is soon after the first, which would need to happen in order to reduce threats to validity and reliability. Any changes to how students approach the second assessment will considerably affect the reliability score and probably make the exercise a complete waste of time.

Parallel forms reliability

Screenshot 2017-05-02 19.50.34One way round these problems is to design a parallel forms assessment. This is basically where one assessment is made up of two equal parts (parallel A and parallel B), with the second half (parallel B) performing the function of the second assessment in the test-retest approach outlined above. As with test-retest, correlations between student results from the parallel A and parallel B parts of the test can provide a reliability figure. The problem now is that, in reality, it is difficult to create two sections of an assessment of equal challenge. As we have considered, challenge lies in the choice of a question, and even the very best assessment designers don’t really know how difficult an item really is until real students have actually tried answering them.

Split-half reliability

Screenshot 2017-05-02 19.50.41Perhaps the best way to work out the reliability of a class assessment, and the one favoured by Dylan Wiliam, is the split-half reliability model. Rather than waste time attempting the almost impossible – and create two forms of the same assessment of equal difficulty – this approach skirts round the problem, by dividing a single assessment in half and treating each half as a separate test.

There are different ways the assessment can be divided in half, such as straight split down the middle or creating two parts by separating out the odd and even numbered items. Whatever method is used, the reliability coefficient is worked out the same way: by correlating the scores on the two parts and then taking account of the fact that this only relates to half the test by applying the Spearman-Brown formula**. This then provides a reasonable estimate of the reliability of an assessment, which is probably good enough for school-based assessment.

The formula for applying Spearman-Brown in Excel is a little beyond the scope of my understanding. Fortunately, there are a lot of tools available on the Internet that make it possible to work out reliability scores using Spearman-Brown’s formula. The process involves downloading a spreadsheet and then inputting your test scores into cells containing pre-programmed formulas. The best of these is, unsurprisingly, from Dylan Wiliam himself, which is available to download here. Rather handily, Dylan also includes some super clear instructions on how to use the tool. Whilst there are other spreadsheets available elsewhere that perform this and other functions, they are not as clean and intuitive as this one.

Internal-consistency reliability (Cronbach’s alpha)

Screenshot 2017-05-03 18.35.26

At this point, I should point that I am fast approaching the limits of my understanding in relation to assessment, particularly with regards to the use of statistics. Nevertheless, I think I have managed to get my head around internal-consistency reliability enough to use some of the tools available to work out the reliability of an assessment using Cronbach’s alpha. In statistics Cronbach’s alpha is used as an estimate of the reliability of a psychometric test. It provides an estimate of internal consistency reliability and helps to show whether or not all the items in an assessment are assessing the same construct or not. Unlike the easier to use – and understand – split-half reliability, Cronbach’s alpha looks at the average value of all possible split- half estimates, rather than just the one that has been split in half.

It uses this formula:

Screenshot 2017-05-03 18.36.04

If like most people, however, you find this formula intimidating and unfathomable, seek out one of the many online spreadsheets set up with Cronbach’s alpha and ready for you to enter your own assessment data into the cells. Probably the most straightforward of these can be found here. It is produced by Professor Glenn Fulcher and it allows you to enter assessment results for any items with a mark of up to 7. There are instructions that tell you what to do and are quite easy for the layman to follow.

Make sure everyone understands the limitations of assessment

Given that no school assessment which measures the things we value or involves any element of human judgement is ever likely to be completely reliable, the time has probably come to be more honest about this with the people most impacted by summative tests, namely the students and their parents. The problem is that in reality this is incredibly hard to do. As Rob Coe jokes, can anyone imagine a teacher telling a parent that their child’s progress, say an old NC level 5, is accurate to a degree of plus or minus one level? Most teachers probably haven’t even heard about standard measurement of error, let alone understand its impact on assessment practice enough to explain it to a bewildered parent.

The US education system seems rather more advanced than ours in relation to reporting issues of error and uncertainty in assessment to parents. This is a consequence of the Standards for Educational and Psychological Testing (1999). These lay out the extent to which measurement uncertainty must be reported to stakeholders, which US courts follow in their rulings and test administrators account for in their supplementary technical guides.

A 2010 report commissioned by Ofqual into the way assessment agencies in the US report uncertainty information when making public the results of their assessments showed an impressive degree of transparency in relation to sharing issues of test score reliability. Whilst the report notes that parents are not always directly given the information about assessment error and uncertainty, the information is always readable available to those who want it, providing of course they can understand it!

‘Whether in numbers, graphics, or words, and whether on score reports, in interpretive guidelines (sometimes, the concept is explained in an “interpretive guide for parents”), or in technical manuals, the concept of score imprecision is communicated. For tests with items scored subjectively, such as written answers, it is common, too, to report some measure of inter-rater reliability in a technical manual.’

To my knowledge we don’t really have anything like this level of transparency in our system, but I think there are a number of things we can probably learn from the US about how to be smarter with sharing with students and parents the complexity of assessment and the inferences that it can and cannot provide us with. I am not suggesting that the example below is realistic for an individual school to replicate, but I like the way that it at least signals the scope for grade variation by including confidence intervals in each of its assessment scores.

Screenshot 2017-05-03 18.49.39

There is clearly much we need to do to educate ourselves about assessment, and then we may be better placed to educate those who are most affected by the tests that we set.

The work starts now.

*  The answer to the questions is: John, where Paul had had ‘had’, had had ‘had had’. ‘Had had’ had had a clearer meaning

** The Spearman–Brown prediction formula, also known as the Spearman–Brown prophecy formula, is a formula relating psychometric reliability to test length and used by psychometricians to predict the reliability of a test after changing the test length.

 

Principles of Great Assessment #1 Assessment Design

Screenshot 2017-03-10 17.52.06.png

This is the first in a short series of posts on our school’s emerging principles of assessment, which are split into three categories – principles of assessment design; principles of ethics and fairness; and principles for improving reliability and validity. My hope in sharing these principles of assessment is to help other develop greater assessment literacy, and to gain constructive feedback on our work to help us improve and refine our model in the future.

In putting together these assessment principles and an accompanying CPD programme aimed at middle leaders, I have drawn heavily on a number of writers and speakers on assessment, notably Dylan Wiliam, Daniel Koretz, Daisy Christodolou, Rob Coe and Stuart Kime. All of these have a great ability to convey difficult concepts (I only got a C grade in maths, after all) in a clear, accessible and, most importantly, practical way. I would very much recommend following up their work to deepen your understanding of what truly makes great assessment.

  1. Align assessments with the curriculum 

 Screenshot 2017-03-10 17.52.48.png

In many respects, this first principle seems pretty obvious. I doubt many teachers deliberately set out to create and administer assessments that are not aligned with their curriculum. And yet, for a myriad of different reasons, this does seem to happen, with the result that students sit assessments that are not directly sampling the content and skills of the intended curriculum. In these cases the results achieved, and the ability to draw any useful inferences from them, are largely redundant. If the assessment is not assessing the things that were supposed to have been taught, it is almost certainly a waste of time – not only for the students sitting the test, but for the teachers marking it as well.

Several factors can affect the extent to which an assessment is aligned with the curriculum and are important considerations for those responsible for setting assessments. The first is the issue of accountability. Where accountability is unreasonably high and a culture of fear exists, those writing assessments might be tempted to narrow down the focus to cover the ‘most important’ or ‘most visible’ knowledge and skills that drive that accountability. In such cases, assessment ceases to provide any useful inferences about knowledge and understanding.

Assessment can also become detached from the curriculum when that curriculum is not delineated clearly enough from the outset. If there is not a coherent, well-sequenced articulation of the knowledge and skills that students are to learn, then any assessment will always be misaligned, however hard someone tries to make the purpose of the assessment valid. A clear, well structured and shared understanding of the intended curriculum is vital for the enacted curriculum to be successful, and for any assessment of individual and collective attainment to be purposeful.

A final explanation for the divorce of curriculum from assessment is the knowledge and understanding of the person writing the assessment in the first place. To write an assessment that can produce valid inferences requires a solid understanding of the curriculum aims, as well as the most valid and reliable means of assessing them. Speaking for myself, I know that I have got a lot better at writing assessments that are properly aligned with curriculum the more I have understood the links between the two and how to go about bridging them.

  1. Define the purpose of an assessment first 

 Depending on how you view it, there are essentially two main functions of assessment. The first, and probably most important, purpose is as a formative tool to support teaching and learning in the classroom. Examples might include a teacher setting a diagnostic test at the beginning of a new unit to find out what students already know so their teaching can be adapted accordingly. Formative assessment, or responsive teaching, is an integral part of teaching and learning and should be used to identify potential gaps in understanding or misconceptions that can be subsequently addressed.

The second main function of assessment is summative. Whereas examination bodies certify student achievement, in the school context the functions of summative assessment might include assigning students to different groupings based upon perceived attainment, providing inferences to support the reporting of progress home to parents, or the identification of areas of underperformance in need of further support. Dylan Wiliam separates out this accountability function from the summative process, calling it the ‘evaluative’ purpose.

Whether the assessment is designed to support summative or formative inferences is not really the point. What matters here is that the purpose or function of the assessment is made clear to all and that the inferences the assessment is intended to produce are widely understood by all. In this sense, the function of the assessment determines its form. A class test intended to diagnose student understanding of recently taught material will likely look very different from a larger scale summative assessment designed to draw inferences about whether knowledge and skills have been learnt over a longer period of time. Form therefore follows function.

3. Include items that test understanding across the construct continuum

 Many of us think about assessment in the reductive terms of specific questions or units, as if performance on question 1 of Paper 2 was actually a thing worthy of study in and of itself. Assessment should be about approximating student competence in the constructs of the curriculum. A construct can be defined as the abstract conception of a trait or characteristic, such as mathematical or reading ability. Direct constructs measure tangible physical traits like height and weight and are calculated using verifiable methods and stated units of measurement. Unfortunately for us teachers, most educational assessment assesses indirect constructs that cannot be directly measured by such easily understood units. Instead, they are calculated by questions that we think indicate competency, and that stand in for the thing that we cannot measure directly.

Within many indirect constructs, such as writing or reading ability, is likely to be a continuum of achievement possible. So within the construct of reading, for instance, some students will be able to read with greater fluency and/or understanding than others. A good summative assessment therefore needs to differentiate between these differing levels of performance and, through the questions set, define what it means to be at the top, middle or bottom of that continuum. In this light, one of the functions of assessment has to be a way of estimating the position of learners on a continuum. We need to know this to evaluate the relative impact or efficacy of our curricula, and to understand how are students are progressing within it.

Screenshot 2017-03-09 16.52.15.png

  1. Include items that reflect the types of construct knowledge

 Some of the assessments we use do not adequately reflect the range of knowledge and skills of the subjects they are assessing. Perhaps the format of terminal examinations has had too much negative influence on the way we think about our subjects and design assessments for them. In my first few years of teaching, I experienced considerable cognitive dissonance between my understanding of English and the way that it was conceived of within the profession. I knew my own education was based on reading lots of books, and then lots more books about those books, but everything I was confronted with as a new teacher – schemes of work, the literacy strategy, the national curriculum, exam papers– led me to believe that I should really be thinking of English in terms of skills like inference, deduction and analysis.

English is certainly not alone here, with history, geography and religious studies all suffering from a similar identify crisis. This widespread misconception of what constitutes expertise and how that expertise is gained probably explains, at least in part, why so many schools have been unable to envisage a viable alternative to levels. Like me, many of the people responsible for creating something new themselves been infected by errors from the past and have found it difficult to see clearly that one of the big problems with levels was the way they misrepresented the very nature of subjects. And if you don’t fully understand or appreciate what progression looks like in your subject, any assessment you design will be flawed.

Daisy Christodoulou’s Making Good Progress is a helpful corrective, in particular her deliberate practice model of skill acquisition, which is extremely useful in explaining the manner in which different types of declarative and procedural knowledge can go into perfecting a more complex overarching skill. Similarly, Michael Fordham’s many posts on substantive and disciplinary knowledge, and how these might be mapped on to a history progression model are both interesting and instructive. Kris Boulton’s series of posts (inspired by some of Michael’s previous thinking) are also well worth a look. They consider the extent to which different subjects contain more substantive or disciplinary knowledge, and are useful points of reference for those seeking to understand how best to conceive of their subject and, in turn, design assessments that assess the range of underlying forms of knowledge.

Screenshot 2017-03-09 16.53.06.png

  1. Use the most appropriate format for the purpose of the assessment

 The format of an assessment should be determined by its purpose. Typically, subjects are associated with certain formats. So, in English essay tasks are quite common, whilst in maths and science, short exercises where there are right and wrong answers are more the norm. But as Dylan Wiliam suggests, although ‘it is common for different kinds of approaches to be associated with different subjects…there is no reason why this should be so.’ Wiliam draws a useful distinction between two modes of assessment: a marks for style approach (English, history, PE, Art, etc.), where students gain marks for how well they complete a task, and a degree of difficulty approach (maths, science), where students gain marks for how well they progress in a task. It is entirely possible for subjects like English to employ marks for difficulty assessment tasks, such as multiple choice questions, and maths to set marks for style assessments, as this example of comparative judgement in maths clearly demonstrates.

Screenshot 2017-03-09 16.53.18.png

In most cases, the purpose of assessment in the classroom will be formative and so designed to facilitate improvements to student learning. In such instances, where the final skill has not yet been perfected but is still very much a work in progress, it is unlikely that the optimal interim assessment format will be the same as the final assessment format. For example, a teacher who sets out to teach her students by the end of the year to construct well written, logical and well supported essays is unlikely to set essays every time she wants to infer her students’ progress towards that desired end goal. Instead, she will probably set short comprehension questions to check their understanding of the content that will go into the essay, or administer tests on their ability to deploy sequencing vocabulary effectively. In each of these cases, the assessment reflects the inferences about student understanding the teacher is trying to ascertain, and not confusing or conflating them with other things.

In the next post, I will outline our principles of assessment in relation to ethics and fairness. As I have repeatedly made clear, my intention is to help contribute towards a better understanding of assessment within the profession. I welcome anyone who wants to comment on our principles, or to critique anything that I have written, since this will help me to get a better understanding of assessment myself, and make sure the assessments that we ask our students to sit are as purposeful as possible.

Thanks for reading.

 

 

ResearchED Brighton: inside out not bottom up

Screenshot 2015-04-19 09.30.21

I have been to several ResearchEd events, but I have to say that I thought yesterday’s conference in Brighton was the best one, at least in terms of the amount and quality of ideas I took away with me. The high standard of the speakers certainly helped, as did the deliberate decision to make the event more intimate. It really did make a difference to be able to ask questions of the speakers and to share reflections during breaks. Once again, a big well done and thank you to Tom Bennnet and Hélène Galdin-O’Shea, and to the university of Brighton hosts for offering up such a splendid and amenable venue.

If previous ResearchED events have been characterised by a bottom up approach to the use of research in schools, today seemed to be more about working from inside out – a slightly nuanced adjustment to the metaphor of grassroots teacher professional development that I think better captures the way in which inquiry – in all its different guises – helps to grow the individual and, in turn, develop the organisation. However you frame the metaphor of what’s going in educational circles at the moment, these events sure do beat the stale training days in expensive hotels of yesteryear.

The keynote session was delivered by the charismatic figure of Daniel Muijs. His very pertinent presentation was about the extent to which it is possible to reliably measure teacher effectiveness. Drawing upon a range of international research, including some of his own as well the large-scale study into measuring teacher effectiveness conducted by the Bill and Melinda Gates Foundation, Mujis outlined the complex issues surrounding evaluating the performance of teachers. It was very clear that whilst for every measure there are advantages to be had, these often come at a considerable cost and lead to many significant undesirable consequences.

Screenshot 2015-04-19 09.19.25

Whilst the negative effects of using lesson observation for summative judgements are legion, Muijs did outline some of the ways in which it is possible to make them more effective, particularly if you are willing to invest the time, care and resource necessary to develop a coherent framework, such as the Charlotte Danielson model, and to train observers adequately on how to use it effectively. Even then, for observation to meet adequate standards of reliability and validity somewhere between 6-12 observations per teacher per year are required. I doubt there are many schools up and down the country willing or able to invest that much resource into observing every member of staff throughout the course of the year. The conclusion was that whilst some kind of balance of measures is probably best, this is still far, far from being perfect.

I was glad I stayed in the main hall for the next session, even though that meant missing out on what I later heard was an excellent session by Becky Allen on avoiding some of the pitfalls of testing, tracking and targets. In the main lecture hall Louise Bamfield and Paul Foster introduced the Research Rich Schools Website, a result of an initiative from the National College for Teaching and Leadership, which commissioned a group of teaching school alliances to develop a framework research and development tool in collaboration with the RSA. I haven’t had chance to properly investigate the site yet, but it promises to be an excellent resource, not only for designated Research Leads, but more broadly for teachers and organisations interested in developing their engagement with research and inquiry a stage further. The different levels of emerging, expanding and embedding seem helpful for supporting schools who are at different phases of development.

The next session was led by Andy Tharby on the ways in which his school, Durrington, have formed a partnership with Brighton University to support their teachers in running robust small-scale research projects. Originally the talk was to be co-presented by Brian Marsh, the school’s ‘critical friend’ from the university and from what I gathered a great bloke and fantastic storyteller. Unfortunately, Brian had to pull out at the last minute, but Andy carried on undeterred. Perhaps I am a little biased – I rate Andy’s blog and think he is excellent company – but it was really interesting to learn how his school are building up their engagement with research by matching it at different levels to teacher interest and expertise. Whilst he admits it is still in its embryonic stage, the many benefits of having a professional researcher to support, challenge and guide classroom teachers in conducting their own classroom inquiry were clear.

I don’t usually think of educational conferences in terms of their comedy value, but James Mannion’s presentation was a hoot! A combination of his own humourus and engaging style and the benefits of a smaller, more interactive audience, made this session both informative and enjoyable. James has spent the past 6 months or so working on developing an efficient and meaningful way to bridge the gap between educational research and classroom practice. He believes that ‘all teachers should systematically be engaged with professional inquiry’ and has developed a platform for this happen. The Praxis pilot platform, ‘launched’ at the previous Research Leads conference in Cambridge, provides an excellent online space for teacher to upload their own research inquiries, where they can then be shared and critiqued by others.

What I particularly like about James’s project is the way in which he has thought extremely carefully about how to make the whole process as efficient and as user-friendly as possible. There is an inquiry planner which follows a helpful format for thinking about and organising small-scale research.

  • Title
  • Context
  • Research Question(s)
  • Brief literature review
  • Avenue of inquiry
  • Research methods (how are you going to collect data? )
  • Findings / analysis
  • Conclusions
  • Evaluation

Screenshot 2015-04-19 09.20.20

Whilst I am not fully convinced about the overall aim of getting all teachers to be systematically engaged with professional inquiry (perhaps I simply need to know more about the terms of this statement), I find the sentiment behind it laudable and the effort expended on the project nothing short of remarkable. I can already think of several ways of incorporating James’s platform into the professional inquiry options on offer at my school. James will probably disagree, but I do see value in having a continuum of research options available for classroom teachers to engage with as part of their professional development. For James the word Praxis, as defined by Freire as ‘reflection and action upon the world, in order to transform it’ has much less baggage in educational circles than concepts like Lesson Study, practitioner-led research and disciplined inquiry. I am not so sure, and as Nick Rose pointed out, if anything it contains more of a trace of Marxist ideology. Anyway, for some, the small-scale teacher friendly Praxis model will be great, for others, models implied by the terms ‘disciplined inquiry’ and ‘lesson study may be more appropriate. Perhaps it is all semantics.

My day ended with Nick Rose’s wonderful session on different research tools he has developed to better facilitate teacher inquiry. In his role as research lead and leader of the coaching programme at his school, Nick has produced a number of excellent resources to better support the coaching process and help teachers to better understand what is going on in their classrooms. Some of these tools, all of which Nick stressed were for formative purposes only, included a classroom climate log, the use of student surveys and structured prompts to encourage focused self reflection on targeted areas of professional development.

For me, Nick’s session provided a lovely counterpoint to the findings about lesson observation made in Daniel Muij’s keynote, namely with regards to the different possibilities afforded to the profession from using observation as a formative practitioner tool rather than a high stakes judgement mechanism. I liked many of structured observation protocols Nick has developed on the back of Rob Coe’s work in relation to ‘thinking hard’ about subject content and poor proxies for learning. It was clear how these teaching and learning behaviours could be used as more proximate indicators of learning than the ones more commonly associated with Ofsted framework, particularly within a supportive coaching framework.

Those of you familiar with Nick’s fantastic blog, Evidence into Practice, will already know that Nick is an astute and incredibly meticulous thinker. His real life presentation style is equally impressive and I came out of his session with my head bursting with ideas. I can’t remember being so intellectually stretched by the complexity and range of ideas on offer in a session before, so when Nick announced at the end that ‘he has only just got started with this work’, I joined with everyone else in spontaneous laughter. Has there ever been such an example of ironic self-deprecation before? Probably not.

This was a wonderful day with wonderful people.

Thank you to all at ResearchED.