Debating the merits of merit pay

Glenn Rowley and Lawrence Ingvarson have a piece in today’s Age, criticising my recent study on teacher effectiveness. It’s not online, so I’ve pasted it over the fold, along with a letter I’ve sent to the Age in response.

Incidentally, Ingvarson’s 2007 report on teacher performance pay is here. It’s worth reading, though I think it gives short shrift to the substantial economics literature on merit pay, omitting to mention excellent research by Victor Lavy, Thomas Dee and David Figlio, for example.

Teacher study fails the test  
Glenn Rowley and Lawrence Ingvarson – Glenn Rowley and Lawrence Ingvarson are principal research fellows at the Australian Council for Educational Research
4 June 2007, The Age

Why a plan to link performance to remuneration is flawed.  

MOST parents recognise that good teachers are worth their weight in gold. There is little debate about the need to place more value on teachers’ work, for example, by providing substantial pay rises to teachers as they attain higher standards of performance. This is unlikely to happen, however, unless we become better at evaluating teacher performance in ways that are valid, reliable and fair.  

Recent research by Andrew Leigh (“Study reveals teacher skill discrepancy”, The Age, 21/5) has been widely reported as demonstrating that the best teachers can be readily identified, and have been shown to be twice as effective as the worst teachers. The Australian reported that the study “has successfully linked teacher performance with student results, bolstering the Federal Government’s efforts to introduce performance-based pay” .  

Dr Leigh stated in the conclusion to his paper that he “has shown how to estimate a measure of teacher performance by using panel data with two test scores per student”.  

Readers could be forgiven for thinking that the research opens the way for the “best” teachers to be identified and rewarded on the basis of their students’ test performances, as they were at the end of the 19th century. It does nothing of the sort.  

Dr Leigh examined the test scores of three cohorts of Queensland government school students in literacy and numeracy. Each cohort included about 30,000 students – three-quarters of the students who actually took the tests in government schools. The focus of his study has been reported as gains in achievement over two years, but this was not so. Dr Leigh examined the changes in relative positions of classes of students within the overall state results from year 3 to year 5 (two cohorts) and from year 5 to year 7 (one cohort).  

Not surprisingly, he found that some classes improved their position within the state results, while others went in the opposite direction. This, of course, was inevitable. For every class that gained in its relative position, another had to go down. This is the nature of relative data.  

But what grabbed the headlines was the assertion that the classes that improved their relative position must have had the “good” teachers, while those whose relative position declined must have had the “bad” teachers. If you follow the same logic, you would conclude that Leigh Matthews was not a successful coach of the Brisbane Lions in 2002 and 2003 because he failed to improve the position of his team on the AFL ladder. They finished first in 2001, and no better than that in 2002 and 2003. Many a football fan from the southern states would have been delighted had the Lions’ management used the same reasoning and decided it was time to replace their coach.  

Just as the Lions’ failure to improve their ladder position over two years did not show that their coach was a failure, Dr Leigh’s research provides no basis for the identification of effective and ineffective teachers.  

First, the research was based on relative measures (where students stood in a statewide ranking). Inevitably, some students’ scores were bound to increase and others to decrease.  

If the literacy and numeracy tests were replaced by a coin-tossing test (tossing a coin 10 times and counting the number of heads), some students’ scores would increase, some would decrease, and some would stay the same. Some classes would go up in the rankings, some down, and very few would stay the same. Such is the nature of data.  

Second, we need to look at the nature of the data used by Dr Leigh. Students were tested in August of one year, and then retested in August two years later. In the time between tests they would have had up to three teachers: one from August to December of the first year; one from January to December of the second year; and another from January to August of the third year. If things go well, who gets the credit? If things go poorly, who gets the blame?  

Dr Leigh suggested two approaches to dealing with this (insurmountable) difficulty: to ignore the intervening year altogether, or to create an assumed test score in the intervening year, which lies at the midpoint of the other two tests. He chose the second option “to maximise sample size”. So students were assigned test scores for a year in which they had not taken a test, and their teachers were judged to be effective or ineffective on the basis of how well their students were assumed to have performed on this non-existent test.  

Neither of the approaches considered by Dr Leigh tackles the real problem – that the data contains no valid basis for linking students’ achievement growth to the performance of a single teacher. Statistical analysis, no matter how complex, cannot overcome this.  

Students learn (and sometimes fail to learn) because of a multitude of factors. A good teacher is vitally important. So are supportive parents, a culture in the school, a community that values and rewards learning, and a school that is provided with the resources it needs to perform its role well. And school learning comes more easily to some than to others. Research that ignores these factors fails to recognise the subtlety of schooling.  

Of course, some teachers are more effective than others. There is plenty of rigorous research that shows that students’ achievement growth varies from class to class. It also varies significantly from family to family, neighbourhood to neighbourhood, and, mostly, from student to student. The impact of each of these factors cannot be assessed with the precision necessary to isolate the effect of the teacher. To suggest otherwise is to offer false hope.  

While there have been significant advances in our ability to measure educational growth, we are a long way from measures with anything like the reliability of, say, measures of growth in children’s weight or height.  

In addition, measures available so far are limited to reading and numeracy. For other areas in the primary and secondary school curriculum there are no measures to which value-added modelling could be applied in judging teacher performance. Valid evaluations of performance need to be based on evidence that covers the full range of a teacher’s responsibilities.  

Nobody should be tempted to believe that the approach used in Dr Leigh’s paper can be translated into a viable and legally defensible system for assessing the performance of individual teachers, as claimed.  

Such a system, which we believe is overdue, would need to be based instead on a range of direct evidence of a teacher’s capacity to provide quality conditions for his or her students’ learning across the curriculum. These conditions must be consistent with present research and with profession-defined standards – as in any profession. You have to look directly at what students are doing and learning in classrooms to find the valid evidence of “performance” that is needed.  

The danger with Dr Leigh’s paper is that it promises much more than it can deliver. It will be interpreted by some as evidence that there is a simple solution to the challenge of linking teachers’ pay to performance.  

The past 100 years is littered with merit pay schemes that failed, mainly because proponents did not do the hard work of developing standards that cover the full scope of what effective teachers know and do, nor the hard work of developing valid measures of teacher performance against those standards.  

A defensible teacher evaluation scheme must be based on a clear understanding about what is reasonable to hold a teacher accountable for. The appropriate basis for gathering evidence about a teacher’s performance is a set of professional standards that describe the full scope of what a teacher is expected to know and be able to do.  

Glenn Rowley and Lawrence Ingvarson are principal research fellows at the Australian Council for Educational Research.  

Here’s my letter in response.

I am grateful to Glenn Rowley and Lawrence Ingvarson for taking the time to comment on my paper on teacher effectiveness, but feel that I should correct them on three points. First, my study explicitly adjusted for measurement error, so the conclusion that the most effective ten percent of teachers can teach in half a year what the least effective ten percent can teach in a year is based on substantive differences, not luck. Second, measuring teacher effectiveness with biennial data is not impossible – just more difficult. As my results show, different approaches to this problem do not substantially alter the results. Third, while Rowley and Ingvarson reject merit pay in favour of rewards for “professional standards”, they omit to mention that my results (in common with other value-added studies) have found that teachers with a Masters degree are no more effective in raising students’ test scores.

No merit pay scheme will ever be perfect, but promising results are emerging from places as diverse as Arkansas, London and Tel Aviv. Rather than blocking all attempts at measuring teacher effectiveness, shouldn’t we be open to experimenting with various salary models in different schools, putting the various claims about performance pay to the test?

Dr Andrew Leigh
Australian National University

This entry was posted in Economics of Education. Bookmark the permalink.

18 Responses to Debating the merits of merit pay

  1. Kevin Cox says:

    By all means run some trials with volunteers. That is, allow people to agree to be assessed on their effectiveness through measuring of performance of students. You will get many teachers agreeing if their pay cannot drop and if it is agreed that the records of teacher performance are private to the teacher and are not available for any other purpose and will be destroyed if the teacher so desires.

    At the same time run some trials where other models are used – such as the school gets some extra money rather than individuals, or that people get awards rather than money, or that students get prizes, etc. In other words look a bit more holistically at the problem and test for other incentives or recognition as well as personal money incentives.

  2. christine says:

    Can’t run the a randomised experiment with volunteers – selection problems, big time. It’d be lovely if someone would run some randomised trials with various different types of incentives, of course, and I’m sure monetary incentives aren’t the be all and end all. Any economist working in the area would be thrilled at the idea. So which education department is going to do it?

  3. Sacha says:

    I’m an employee of the organisation that Glenn Rowley and Lawrence Ingvarson are part of – needless to say, these thoughts below are strictly my own.

    One aspect of educational measurement that I’ve become aware of is that the practical problems of obtaining useful data that one can use for analysis may be quite difficult. The practical problems, say, of testing all Yr 5 kids in a single state at the start of a school year and then testing them again at the end of a school year to determine educational growth, are great. One would need significant resources to do this as well as the agreement of school authorities. This would be a very large task if it only undertook a sample of schools.

  4. Sacha says:

    To determine a student’s educational growth, one would need to have a measuring stick and measure the location of each student along this stick at two different times.

    Was this information available?

  5. Andrew Leigh says:

    Sacha1, tests are cheaper than you think. Check out this Hoxby piece, for example.

    Sacha2, the measuring stick was available. Though I normed each test for the purposes of looking at the distribution of teacher value-added, I then went back and looked at the raw tests, which the Queensland authorities say can be compared from year to year.

  6. Sacha says:

    Andrew – they’re cheap if you have the funds to do the testing! Yhe organisation involved in obtaining useable data is immense. I see the work involved it in my workplace!

  7. Sacha says:

    Also, constructing good tests is non-trivial and expensive.

    I possibly wasn’t clear – for each student, do you have a number that measures their “educational growth” (however this is defined) between the two tests?

    When a student takes a test in August in Yr X and then another test in August in yr X+2, there is a period of two years between the tests. There may be developmental factors that mean that the educational growth is, on averagel, non-linear over the two years. Is this the case and if it is, how has it been factored into the study?

  8. Sacha says:

    Andrew, I hope you havn’t forgotten my question!

  9. Andrew Leigh says:

    Sacha, I only have 2 datapoints per student. I’d love to have 3 datapoints, but the Qld database didn’t go that far back. So I do 2 things – I either assume linearity and impute a score in the non-test year, or I drop the teacher from the non-test year. Both methods produce similar results.

  10. Sacha says:

    What happens if you assume non-linearity and create imputed data points for the middle year with that assumption?

  11. Sacha says:

    Creating imputed data points seems like introducing made-up data – is this a reasonable view?

  12. Sacha says:

    Andrew, I’m reading through your paper and thought I’d take the liberty of making some constructive criticisms that strike me:

    “A second complication is that tests are administered just after the middle of the school year (the school year runs from January to December, and the tests are administered in August). In the case of a child who takes tests in the middle of grade 3 and the middle of grade 5, it is therefore possible that the grade 3 teacher contributes to both tests. Under most plausible assumptions, this will introduce only attenuation bias into estimates of the teacher fixed effects terms. To the extent that teachers focus their attention on the test administered in their year, or the test is based on material taught in that grade and the preceding grade, the attenuation bias introduced by using mid-year tests will be smaller than otherwise.”

    In general, the Yr 3 teacher would contribute to a child’s results in the Yr 5 test.

    If the Qld tests are similar to those in other states, the tests will probably be based on the material in the curriculum taught up to the time of the test (or a bit beforehand) so the Yr 5 test would necessarily include Yr 3 material as the Yr 5 material builds on the Yr 3 material.

  13. Sacha says:

    “Setting the standard deviation of the student test score distribution to one gives the teacher fixed effects a straightforward interpretation. For example, a teacher with a fixed effect of one raises her students’ test scores on average by one standard deviation, relative to all other teachers. Naturally, because the average change in student test scores is zero, the average teacher fixed effect is also zero (ie. students of the average teacher maintain their position in the relative student test score distribution).”

    Say I, as a teacher, has a “fixed effect” of 1. Then this is saying that an student who was completely average in Yr 3 would, after my teaching, be one standard deviation along the distribution at the next test. Another student who started at one standard deviation along the distribution would, after my teaching, be two standard deviations along the distribution at the next test. What’s the justification that the differences between 0 and 1, and between 1 and 2, are equivalent?

  14. Sacha says:

    By taking the difference between the tests in the middles of grade 3 and 5, wouldn’t any “fixed teacher effect” be actually the “combined teacher fixed effect” of the (in general) three teachers between the middles of grade 3 and 5?

  15. Sacha says:

    Andrew – if the Qld tests are constructed like many other tests are, the “student ability” score for each student at both tails of the distribution may not be very precise.

  16. Sacha says:

    Just had a thought – why is a teacher’s “fixed effect” determined by looking at the relative changes in a student’s position in the normalised distributions? In general, once you have a common scale for the two tests on which each student has two data points (one for Yr 3 and one for Yr 5), you would imagine that the distribution of student abilities for Yrs 3 and 5 would not have the same standard deviation.

    Wouldn’t it be more informative to calculate for each teacher a distribution of differences in student ability between Yr 3 and Yr 5? Then take the mean, or whatever you want to do.

  17. Sacha says:

    Andrew, I’d suggest talking to a psychometrician about analysing the results of the Qld tests.

  18. Doreen says:

    I follow your argument with interest but feel that you lost the purpose of education towards the end- you started to talk about both teachers and students as if they were machines – not complex diverse people with lives outside of school. I have watched teachers for 35 years and they do change depending on their personal circumstances- I patiently wait for the dedicated and enthusiastic young teacher who now has a young family of their own to again blossom into a mature and dedicated teacher.

    In terms of the young people in our classes I’d rather see the funding going into ensuring better opportunities and early intervention for those who are slower to develop.

    I also agree that there are a range of intelligences and learning areas and we are selling our education system short to concentrate only on numeracy and literacy.

    Finally I’m not sure what data you are using to state that teachers with a Masters are not better teachers. These statements are very damaging. Where was the masters studied? What was studied? At what stage in the teaching life was that study taken? I cannot believe that a teacher who chooses to study for their Masters after 10 or so years of teaching will not re-assess a great deal about the work of teachers and be more informed.

Comments are closed.