Getting better all the time?

In today’s SMH, senior education bureaucrat (and Sydney University adjunct professor) Paul Brock critiques my study with Chris Ryan. Here’s his oped, and here’s a news story reporting on it. There are a few errors (a. I’m described as the sole author of the study; b. although we look at numeracy data up until 2003, the oped wrongly says “He produces no 1999-2008 data.”; c. although we extensively discuss comparability of tests over time, the oped wrongly implies that we ignore the issue).

But these are relatively small quibbles. What interests me most is Brock’s contention that NSW data can be used to track performance over time. I had always thought that changes in the questions from year to year made this impossible, and that the benchmarking process failed to correct for this (eg. by using item response theory). Does anyone know the degree to which the state testing regimes used over the past decade do in fact permit accurate comparisons over time?

An aside: This is the second time that a Sydney University ed school academic has critiqued a joint study of mine without referring to Chris Ryan (here’s John Hughes, writing with Mercurius Goldstein). Now I’m beginning to wonder – do they really like Chris, or dislike me? Or does it have something to do with the fact that my mother did her PhD in education at Sydney Uni?

This entry was posted in Economics of Education. Bookmark the permalink.

10 Responses to Getting better all the time?

  1. Michael W says:

    Yes they do. the item response theory (which, if you accept the theory does allow a standards referenced approach based on banks of items that have been pre-tested for item difficulty) has been part of the NSW basic skills tests since at least the early to mid 1990s. they do allow comparison of test results from year to year. This also applies to the NSW HSC from 2001 onwards.

    I am not clear on what you mean by “the benchmarking process failed to correct for this (eg by using item reponse theory)”. Do you mean, the failure to use item response theory is a failure to correct? Or did you mean that they “failed to correct…by using item response theory” (ie, because they used IRT, they failed to correct by this fact.). Anyway whichever view you intended (and I suppose the former), it is the case that the tests use that theory and purport to allow comparison between cohorts (notwithstanding different test questions).
    I understand this is also to apply for the national tests to be used for the first time this year.

  2. Verdurous says:

    Never mind Andrew,

    Any publicity is good publicity. Chris Ryan is probably feeling quite jealous.

  3. conrad says:

    1) He’s correct to say the ACER tests are pretty specific about what they measure (if I’m correct in guessing which ACER tests were used — I can’t access the actual document). In addition, if I’m correct, then he’s incorrect about benchmarking. The ACER tests I’m thinking of (which are the ones where you get a passage of writing and multiplie choice answers) have become _easier_ over time (i.e., more simple grammatical structures were used as time went on), so in fact a comparison between then and now on different tests is worse than it seems.

    2) You could probably do a linguistic/psycholinguistic analysis of the questions if you liked (which wouldn’t need response data), which would give you a different way to validate the tests apart from more statistical measures (i.e., pull out word frequencies, grammatical structures and the like, and compare them across the years). This sort of stuff is very commonly done to look at children’s vocab development and other such things, so it isn’t especially controversial.

    3) If you can dig up the data from some of the international tests the answer to your question might also be via cross-validation with other tests, although I imagine that would be a pain.

    4) Is a very weird argument to accept the NSW benchmarking over the ACER tests (it might be in the document I can’t access). There are lots of good tests out there that people use that measure all different aspects of literacy. The Woodcock and the NEALE come to my mind as decent measures for grade 3 kids. A really good study would also look at correlated things, since you get non-linear effects from different styles of teaching — some methods give you quick gains for worse later performance and other mehods give you slower initial gains but better later performance (thats known from the whole-word/phonics debate).

  4. Andrew Leigh says:

    Michael W, thanks for that information. Do you know whether the results for each year from 1996-2007 and the common questions are published on the DET website? It would be nice to delve a little further into the issue (eg. to see whether the NSW maths results can be reconciled with the fall in IEA/TIMSS from 1995 to 2003).

    (Also, to answer your question, what I meant to say in the post was that if the tests had changed from year to year, then item response would be a reasonable way of making them comparable.)

  5. Michael W says:

    1) The following seems to be at least one place on the website where a full series of results is listed from 1996 to 2006:

    Click to access bstresults.pdf

    There are also reports of the results in each year’s annual reports which have more detailed breakdown of the results within each year (eg percentages within skill bands – the skill bands as well as the scores are taken to be comparable from year to year based on the Rasch modelling (item response theory). The reports seem to be presented somewhat differently between some years.
    The reports are found at

    I think there is a departmental unit that would have more information.

    2) Re Conrad’s comment (1), on my understanding of the item response theory, even if there are differences in the absolute difficulty of the questions asked from year to year (ie if the tests are “easier”), so long as there are enough items that are sufficiently difficult for only a very small, or no students to be able to get them correct, this is not a problem. That is because new items are separately tested for difficulty (in another state, or overseas) and ranked against items of known difficulty. The test scores are then determined against a scale that is developed to be common between different tests. So, according to the theory, it should not matter if the average difficulty of items in one test is easier than those of another.
    I think that that means that Conrad’s suggestion in (2) may not be possible becasue the results are not raw test results (ie numbers of answers right or wrong) but adjusted scores agains the difficulty scale.

    3) As to comparability between the state tests and international testing like TIMMS, the NSW tests occur at years 3, 5, 7, 10 and 12. I think the Timms occurs at year 9 or 10 – perhaps most comparable with the NSW year 10 tests.

    4) the NSW pattern seems to show different degrees of improved performance over time in different years – better in primary than in secondary from memory.

  6. Andrew Leigh says:

    Michael W: thanks, that’s super-helpful. FYI, TIMSS is year 8 (they want to get kids before the school leaving age), so I guess we’d be wanting to compare it with year 7 numeracy in NSW.

    Again, it would be rather nice to see the actual questions (cf. Paul Brock’s attack on me for not printing the ACER questions in our report!).

  7. conrad says:


    (2) is complementary. You don’t need item results for it, but you do need the items (I should note that these are really factors that should be considered before designing a test of this type). The basic idea is that you get a bunch of things that describe the grammar that are interesting in terms of performance. You then see how the questions differ from year to year on these measures. This gives you some idea of whether the test is getting easier or harder (you also don’t need overlapping items from year to year)

    Obvious things to look at are word frequency, mean sentence length, and the amount of embedding in sentences. Less obvious things (to non-linguists!) would be deviations from canonical form and so on.

    For example, we know age of acquisition and word frequency play a big role in early reading and language performance, so if I had a question like this one year:

    1) There was a big dog.

    and the next year

    2) There was a big chinchilla

    then that would show things are getting harder. I could look at very similar things where chunks of syntax are in favorable or less favorable positions (e.g., I saw a dog in the night vs. I, in the night, saw a dog).

    The basic idea is you choose a bunch of measures which should allow you to understand how the tests are differing quantitatively from one year to the next. If they do differ, then it gives you some idea about how the means should change due to the test rather than the individual.

  8. Paul Brock says:

    Hi Andrew,

    Generally I thoroughly enjoy your pieces on your blog. You so often hit the right nails on the head and push buttons that need be pushed.

    But I need to reply to a few things you have said on your blog about me and my SMH article. When I heard you on AM and on the various ABC News broadcasts, I heard no mention of Chris Ryan – who is (and, I hope, still remains) a good friend of mine. Maybe I missed something, but it was your name that was prominent in the media dissemination. I never once heard Chris on radio or saw comments attributed to him in the print media. It was you who was most identified with the media blitz that followed your appearance on AM and various ABC News reports – so it was natural for me to mention you in my piece. I certainly was not intending to slight Chris when writing my piece.

    You allege that my following statement “Leigh’s literacy data is from 1975 to 1998. He produces no 1999-2008 data.” is wrong. Andrew, while your numeracy data was across a broader range, your literacy data was 1975 – 1998. I am correct! Perhaps you missed the word “literacy” when you read those two sentences. Clearly, those two sentences – as was my whole article – were about literacy.

    It is rather over the top of you to write “Paul Brock’s attack on me for not printing the ACER questions in our report!” I made no such “attack”: I didn’t do anything of the sort.

    What I did write was “For example, Leigh provides no comparative analysis of evidence that would prove that the degree of difficulty in the questions and the contents of the comprehension pieces used were consistently of the same standards across the 23 years.”. It was the lack of “comparative analysis of evidence” – not any failure to print the ACER questions – that I noted as regrettably absent.

    I have seen Michael Waterhouse’s comments – which should further clarify things for you. In NSW we do use item-response theory and, as Michael put it, “a standards referenced approach based on banks of items that have been pre-tested for item difficulty”. Michael has explained how this methodology enables valid comparisons of test results to be made from year to year.

    I believe that my SMH oped piece was accurate, fair and justified. From the responses that I have received from so many people, it is clear to me that what I wrote needed to be written. I resile from nothing that I wrote in that short piece.

    Next time you and / or Chris are in Sydney, why don’t we have a coffee? I know, from other things that you have written on your blog, that you and I would have more in common than we would have in ‘difference’.


    Paul (Brock)

  9. Andrew Leigh says:

    Paul, thanks for the thoughtful response. three quick things.
    1. I always do my best in media coverage to mention my coauthor. I think all the print reports mentioned Chris, though some of the newspaper reports may not have done. But my main reason for mentioning it was a nagging suspicion that you may have written an oped about our study without having read it. Please tell me I’m wrong about this.
    2. Our study spends a lot of time talking about the comparative dimension of both the LSAY and TIMSS studies.
    3. A coffee would be delightful.



  10. Paul Brock says:

    Hi Andrew,

    Thanks for your encouraging reply. Yes, I can assure you that I read the full study which you and Chris wrote – and very carefully. Twice.

    My comments about your reliance on LSAY were based on my close reading of the study. Incidentally, along wth TIMSS, the 2006 PISA results – which I have studied and discussed with my friend Geoff Masters – are also highly relevant. Perhaps we could include PISA 2006 in our Sydney coffee session.

    As I said in my earlier entry, the reason that I highlighted you was my piece was written for the SMH – ie the context was a media one, as distinct from an academic one (so to speak). Honestly, I had never heard Chris’ name mentioned once in all of the media attention paid to your joint research. Thus, as far as the media was concerned, it seemed to me that you alone were ‘identified’ as championing the literacy ‘gloom and doom’ agenda. So, cognisant that the media had featured you so prominently, it seemed to me that for the purposes of the media ‘audience’ I should specify you.

    Now that you have my email address let’s set up a date for our Sydney coffee!



Comments are closed.