Glenn Rowley and Lawrence Ingvarson have a piece in today’s Age, criticising my recent study on teacher effectiveness. It’s not online, so I’ve pasted it over the fold, along with a letter I’ve sent to the Age in response.
Incidentally, Ingvarson’s 2007 report on teacher performance pay is here. It’s worth reading, though I think it gives short shrift to the substantial economics literature on merit pay, omitting to mention excellent research byÂ Victor Lavy, Thomas Dee and David Figlio, for example.
Teacher study fails the testÂ Â
Glenn Rowley and Lawrence Ingvarson – Glenn Rowley and Lawrence Ingvarson are principal research fellows at the Australian Council for Educational Research
4 June 2007, The Age
Why a plan to link performance to remuneration is flawed.Â Â
MOST parents recognise that good teachers are worth their weight in gold. There is little debate about the need to place more value on teachers’ work, for example, by providing substantial pay rises to teachers as they attain higher standards of performance. This is unlikely to happen, however, unless we become better at evaluating teacher performance in ways that are valid, reliable and fair.Â Â
Recent research by Andrew Leigh (“Study reveals teacher skill discrepancy”, The Age, 21/5) has been widely reported as demonstrating that the best teachers can be readily identified, and have been shown to be twice as effective as the worst teachers. The Australian reported that the study “has successfully linked teacher performance with student results, bolstering the Federal Government’s efforts to introduce performance-based pay” .Â Â
Dr Leigh stated in the conclusion to his paper that he “has shown how to estimate a measure of teacher performance by using panel data with two test scores per student”.Â Â
Readers could be forgiven for thinking that the research opens the way for the “best” teachers to be identified and rewarded on the basis of their students’ test performances, as they were at the end of the 19th century. It does nothing of the sort.Â Â
Dr Leigh examined the test scores of three cohorts of Queensland government school students in literacy and numeracy. Each cohort included about 30,000 students – three-quarters of the students who actually took the tests in government schools. The focus of his study has been reported as gains in achievement over two years, but this was not so. Dr Leigh examined the changes in relative positions of classes of students within the overall state results from year 3 to year 5 (two cohorts) and from year 5 to year 7 (one cohort).Â Â
Not surprisingly, he found that some classes improved their position within the state results, while others went in the opposite direction. This, of course, was inevitable. For every class that gained in its relative position, another had to go down. This is the nature of relative data.Â Â
But what grabbed the headlines was the assertion that the classes that improved their relative position must have had the “good” teachers, while those whose relative position declined must have had the “bad” teachers. If you follow the same logic, you would conclude that Leigh Matthews was not a successful coach of the Brisbane Lions in 2002 and 2003 because he failed to improve the position of his team on the AFL ladder. They finished first in 2001, and no better than that in 2002 and 2003. Many a football fan from the southern states would have been delighted had the Lions’ management used the same reasoning and decided it was time to replace their coach.Â Â
Just as the Lions’ failure to improve their ladder position over two years did not show that their coach was a failure, Dr Leigh’s research provides no basis for the identification of effective and ineffective teachers.Â Â
First, the research was based on relative measures (where students stood in a statewide ranking). Inevitably, some students’ scores were bound to increase and others to decrease.Â Â
If the literacy and numeracy tests were replaced by a coin-tossing test (tossing a coin 10 times and counting the number of heads), some students’ scores would increase, some would decrease, and some would stay the same. Some classes would go up in the rankings, some down, and very few would stay the same. Such is the nature of data.Â Â
Second, we need to look at the nature of the data used by Dr Leigh. Students were tested in August of one year, and then retested in August two years later. In the time between tests they would have had up to three teachers: one from August to December of the first year; one from January to December of the second year; and another from January to August of the third year. If things go well, who gets the credit? If things go poorly, who gets the blame?Â Â
Dr Leigh suggested two approaches to dealing with this (insurmountable) difficulty: to ignore the intervening year altogether, or to create an assumed test score in the intervening year, which lies at the midpoint of the other two tests. He chose the second option “to maximise sample size”. So students were assigned test scores for a year in which they had not taken a test, and their teachers were judged to be effective or ineffective on the basis of how well their students were assumed to have performed on this non-existent test.Â Â
Neither of the approaches considered by Dr Leigh tackles the real problem – that the data contains no valid basis for linking students’ achievement growth to the performance of a single teacher. Statistical analysis, no matter how complex, cannot overcome this.Â Â
Students learn (and sometimes fail to learn) because of a multitude of factors. A good teacher is vitally important. So are supportive parents, a culture in the school, a community that values and rewards learning, and a school that is provided with the resources it needs to perform its role well. And school learning comes more easily to some than to others. Research that ignores these factors fails to recognise the subtlety of schooling.Â Â
Of course, some teachers are more effective than others. There is plenty of rigorous research that shows that students’ achievement growth varies from class to class. It also varies significantly from family to family, neighbourhood to neighbourhood, and, mostly, from student to student. The impact of each of these factors cannot be assessed with the precision necessary to isolate the effect of the teacher. To suggest otherwise is to offer false hope.Â Â
While there have been significant advances in our ability to measure educational growth, we are a long way from measures with anything like the reliability of, say, measures of growth in children’s weight or height.Â Â
In addition, measures available so far are limited to reading and numeracy. For other areas in the primary and secondary school curriculum there are no measures to which value-added modelling could be applied in judging teacher performance. Valid evaluations of performance need to be based on evidence that covers the full range of a teacher’s responsibilities.Â Â
Nobody should be tempted to believe that the approach used in Dr Leigh’s paper can be translated into a viable and legally defensible system for assessing the performance of individual teachers, as claimed.Â Â
Such a system, which we believe is overdue, would need to be based instead on a range of direct evidence of a teacher’s capacity to provide quality conditions for his or her students’ learning across the curriculum. These conditions must be consistent with present research and with profession-defined standards – as in any profession. You have to look directly at what students are doing and learning in classrooms to find the valid evidence of “performance” that is needed.Â Â
The danger with Dr Leigh’s paper is that it promises much more than it can deliver. It will be interpreted by some as evidence that there is a simple solution to the challenge of linking teachers’ pay to performance.Â Â
The past 100 years is littered with merit pay schemes that failed, mainly because proponents did not do the hard work of developing standards that cover the full scope of what effective teachers know and do, nor the hard work of developing valid measures of teacher performance against those standards.Â Â
A defensible teacher evaluation scheme must be based on a clear understanding about what is reasonable to hold a teacher accountable for. The appropriate basis for gathering evidence about a teacher’s performance is a set of professional standards that describe the full scope of what a teacher is expected to know and be able to do.Â Â
Glenn Rowley and Lawrence Ingvarson are principal research fellows at the Australian Council for Educational Research.Â Â
Here’s my letter in response.
I am grateful to Glenn Rowley and Lawrence Ingvarson for taking the time to comment on my paper on teacher effectiveness, but feel that I should correct them on three points. First, my study explicitly adjusted for measurement error, so the conclusion that the most effective ten percent of teachers can teach in half a year what the least effective ten percent can teach in a year is based on substantive differences, not luck. Second, measuring teacher effectiveness with biennial data is not impossible â€“ just more difficult. As my results show, different approaches to this problem do not substantially alter the results. Third, while Rowley and Ingvarson reject merit pay in favour of rewards for â€œprofessional standardsâ€, they omit to mention that my results (in common with other value-added studies) have found that teachers with a Masters degree are no more effective in raising studentsâ€™ test scores.
No merit pay scheme will ever be perfect, but promising results are emerging from places as diverse as Arkansas, London and Tel Aviv. Rather than blocking all attempts at measuring teacher effectiveness, shouldnâ€™t we be open to experimenting with various salary models in different schools, putting the various claims about performance pay to the test?
Dr Andrew Leigh
Australian National University