Test Validity

ado (1961) succinctly summarises validity in this way: “Does a test measure what it is supposed to measure? If it does, it is valid.” Six years earlier Cronbach and Meehl had introduced the ‘trinitarian’ view of validity which was dominant until the 1990s, and which Harmer used his three fingers in an unsuccessful attempt to remember in his 2015 TOBELTA Online Conference talk. Validity was seen as comprising content validity, criterion-related validity (the one which eluded Harmer), and construct validity. Messick (1989) challenged this view by drawing attention to the importance of HOW a test is used, thus shifting perspectives on validity from a property of a test to that of test score interpretation. If we follow Messick, we see validity as a judgement on the adequacy and appropriateness of inferences and actions based on test scores, and this leads to more attention being given to the social consequences of a test. Washback, ethics, administration procedures, the test environment, test-taker characteristics (emotional state, concentration, familiarity with the test task), and, perhaps most importantly, the sorting and gate-keeping roles of a test, are all aspects of validity. Furthermore, score interpretation involves questions of values, and thus the assumption that a test elicits the communicative ability of the test-taker, and then arrives at a “true”, objective assessment of that ability ignores the fact that all assessment is value-laden and ‘truth’ is a relative concept.

In light of all this, we must take a critical look at the use – and misuse – of tests. Large-scale tests are often used by the state or other authorities to ration limited resources and opportunities, and such tests are currently being used all over the world to achieve a wide range of political goals, including curbing immigration and promoting private education. Shohamy (2001) argues that “centralized systems” use externally imposed, standardized, one-shot, high-stakes tests to control educational systems by defining what kind of knowledge is prestigious. Fulcher goes further and suggests that we need to understand the political philosophies which lead to centralised or decentralised types of government, and their associated ways of using tests as policy tools. Since political philosophy is concerned with the balance between the state and the individual, Fulcher argues that “depending on where a political philosophy stands on the cline between the two, we can identify the kind of government likely to be favored, and the kind of society valued. It is my contention that it also explains (and predicts) the uses of tests that we are likely to find” (Fulcher, 2009, p. 5).

Fulcher defines “collectivist societies” as “those in which the identity, life, and value of the individual is determined by membership of the state and its institutions. Decisions are made to benefit the collective and its survival rather than its individual members.” In contrast, “modern individualism” starts from the claim that, as Locke put it, “men are by nature all free, equal, and independent”, that “no one can be subjected to the political power of another without his own consent”, and that there are limits upon the authority of the state, such that laws apply to all equally, that they protect the rights of individuals and that laws can only be made by the legislative who must be democratically elected. I’m not entirely happy with Fulcher’s use of these 2 “isms”, but it’s clear that they don’t equate with left- and right-wing politics, and, anyway, they can certainly be used to examine test use.

Collectivism and Testing

Fulcher argues that in societies that tend towards collectivism, the centralization of both educational systems and testing is a priority. Modern collectives use testing to control the educational system, to select and allocate individuals to roles or tasks that benefit the collective, and to ensure uniformity and standardization. While we might think immediately of countries like North Korea or China in this regard, Fulcher argues that established democracies are not immune from “neocollectivism”: we need look no further than the UK.

Examples of centrally controlled standards-based education systems, with a high level of control over teacher training and school learning, are not hard to find (Brindley, 2008). The clearest example is that of the United Kingdom, which has systematically introduced standards-based testing in an accountability framework that ensures total state control over the national curriculum and national tests, as well as teacher training; even educational staff are rewarded or disciplined based on national league tables (Mansell, 2007).  (Fulcher, 2009., p.7).

Fulcher argues that that these hyperaccountability policies are pursued by the state in an attempt to improve performance in the global market place; “the educational system is reengineered to deliver the kinds of people who will serve the perceived needs of the economy” (Fulcher, 2009, p.7).

Fulcher goes on to give the Common European Framework of Reference (CEFR) as an example of neocollectivism at the supranational level, claiming that the system is used to control language learning so as to deal with its weakened position in global markets. Fulcher claims that the CEFR is being used “as a tool for designing curricula, reporting both standards and outcomes on its scales, and for the recognition of language qualifications through linking test scores to levels on the CEFR scales.” He goes on

We now see stronger evidence for more intrusive collectivist policy emerging in calls for claims of linkage to the CEFR to be approved by a central body (Alderson, 2007), and the removal of the principle of subsidiarity from language education in Europe (Bonnet, 2007). If realized, these changes would lead to unaccountable centralized control of education and qualification recognition across the continent. (Fulcher, 2009, p.8).

Individualism and Testing

Enlightenment individualism claims “the right of each person to be free from control or oppression from a state that acquires too much power and begins to control the lives of citizens” (Fulcher, 2009, p. 9). Fulcher is quick to point out that “this is not a right-wing position” and that “attempts to summarily dismiss individualistic critiques of test use as right-wing reactionism by labeling them “Eurosceptic” (Alderson, 2007, p. 660) …fail to engage with the social consequences of test use and misuse” (p.10).

In societies that lean towards Fulcher’s individualistic political philosophy, the state has little say in what is taught, or how it’s taught, and the role of tests is to promote personal growth, or to provide individuals with new learning opportunities. Fulcher gives these examples of the uses of tests which are in keeping with individualism:

  • The original Binet tests, designed for the sole purpose of identifying children in need of additional help.
  • Diagnostic and classroom testing, loosely defined as “low-stakes formative assessment”. “Its purpose is to act as a way of providing individual learners with feedback that helps them to improve in an ongoing cycle of teaching and learning (Rea-Dickens, 2001). In such a context Dewey’s notion of personal growth as a validity criterion is echoed by current researchers, such as Moss (2003)” (Fulcher, 2009, p.11).
  • Dynamic assessment. “In DA [dynamic assessment], assessment and instruction are a single activity that seeks to simultaneously diagnose and promote learner development by offering learners mediation, a qualitatively different form of support from feedback” (Lantolf & Poehner, 2008a, p. 273).

According to Fulcher, the general characteristics of this “individualistic paradigm” are:

  • Classroom assessment is used to help individuals to develop their own potential.
  • Large-scale, high-stake tests are used to ensure that individuals acquire the key knowledge and skills they need to innovate in their own lives and participate in democratic societies.
  • Large-scale, high-stake tests can also provide access to employment through the assessment of critical skills where practicing without those skills would be detrimental to others.
  • Validity is assessed in terms of the success in helping individuals to achieve their goals and develop necessary skills.
  • External systems are never imposed upon teachers.
  • Teachers are involved in defining the knowledge and skills to be taught and assessed, or design their own assessments as part of the learning process.
  • One of the criteria for success is the empowerment of professional educators to make their own judgments and decisions in their own contexts of work.

Large-scale Testing versus Classroom Assessment

In my post about Harmer’s talks on testing, I said that classroom teaching should be 100% test-free, but that there was, surely, some place for testing.  When I said that, I had in mind Fulcher’s  distinction between the “collectivist” uses of standardized large-scale tests, and the “individualist” classroom assessment. I think there is a place for standardised large-scale tests inside the restraints of Fulcher’s individualistic paradigm, when they’re used as an index of proficiency and are intended to give test takers the opportunity to demonstrate their mastery in a range of skills and abilities so as to gain access to further education, jobs and other opportunities. Likewise, from the same perspective, I think classroom assessment is fine when it is used to make decisions about learning and teaching which result in further language proficiency.  Standardised large-scale tests should not be used by the state or other authorities to carry out political objectives, and should not influence normal language classroom practice, although, in my opinion, there’s a legitimate place for well-defined exam preparation courses.

The fundamental difference I want to make between standardised tests and classroom assessment is the one Fulchar makes between the uses to which the two are put. As a result of these different uses, while standardized tests must be fair to all who take them, classroom assessment need not concern itself with fairness, but instead concentrate on further growth. While collaboration in a standardized test is labelled ‘cheating’, in the classroom it is valued and praised. In standardized tests the score users are concerned with how meaningful the score is beyond the specific context that generated that score. Thus, score reliability (dependent on consistency of measurement, discrimination between test takers, the length of the test, and the homogeneity of what is tested) is of prime importance. But in a learning environment like the language classroom, we value divergent and conflicting opinion, and we often encourage it by dialogue and debate. “The only meaning we could ascribe to ‘reliability’ would be the extent to which the decisions we make for future growth are more appropriate than inappropriate” (Fulcher and Davidson, 2007, p.7).


In the 2015 TOBELTA Online Conference Luke Meddings reiterated his call to give tests a rest, supporting his argument with a few not particularly well-articulated, but nevertheless powerful objections to the over-dominant role that tests play in so many ELT environments. Jeremy Harmer’s response to Meddings was so poor that it provoked me to write a review of it, which in turn provoked Scott Thornbury to say that I should explain my own view. By attempting a brief summary of Fulcher’s views on those two issues, views which I completely agree with, I hope I’ve complied.

One question remains, and that’s the one Rose Bard raised concerning the Pearson Education company. In his IATEFL talk at Harrogate, Harmer said that tests were getting much better and he called the Pearson test of Academic English “bloody wonderful”, citing the “massive research” they’d done in support of this view. Rose begs to differ, and explains why in comments you can find under the now “stripped” post on Harmer. I think this deserves separate treatment, and I invite everybody to help me build a file on Pearson Education, prior to discussing their contribution to language testing and to ELT.


See Fulcher, G. (2009) Test Use and Political Philosophy. Annual Review of Applied Linguistics 29, 3–20 for all references except:

Fulcher, G. and Davidson, F. (2007) Tests in Life and Learning: A deathly dialogue. Educational Philosophy and Theory, 40, 3. 407-417.


13 thoughts on “Test Validity

  1. And I thought testing could never be interesting! Thanks, Geoff for introducing me to Glenn Fulcher’s work and in particular his distinction between collectivist and individualistic testing. I discovered that his article “Test Use and Political Philosophy” can be downloaded here: http://languagetesting.info/features/politics/tupp09.pdf and “Tests in Life and Learning:
    A deathly dialogue” here: http://languagetesting.info/articles/store/epat.pdf

  3. Hi Richard,
    I should have mentioned in my post that at the end of his 2009 article, Fulcher suggests a compromise, namely effect-driven testing / assessment. You can find more here:
    The Routledge volume on Language Testing and Assessment – http://cw.routledge.com/textbooks/9780415339476/links/1.aspç
    Kim & Davidson (2014) Effect-Driven Test Specifications ; Approaches and Development http://onlinelibrary.wiley.com/doi/10.1002/9781118411360.wbcla107/full


    • Thanks Geoff – the idea of effect-driven testing is precisely what I found most interesting on reading (thanks to your blog post) Fulcher’s ‘Test use and political philosophy’. I did mention it at the end of my blog post.I look forward to finding out more about it – the Routledge link seems broken though, Possibly more radical than just a compromise position? I’m imagining something like making tests conform ethically to having a positive? neutral? impact on learner and teachers rather than forcing learners/teachers to conform to them? I’ll be interested to learn how / whether it can be done in practice!


      • Hi Richard,
        Language Testing and Assessment costs a mere U.S. $ 1745.00 – any language teacher should be able to afford that, shouldn’t they?
        The editor, Antony John Kunnan, does have some free chapters from another book he’s edited – The Companion to Language Assessment – on his site: http://www.antonykunnan.com/.


      • Hi Richard,

        I think Glenn Fulcher’s trip into political philosophy is interesting and well-informed, but not really a good way of focusing on how tests are used for political purposes because the 2 ends of the cline don’t correspond to left and right wing politics. His “compromise” – effect-driven testing – has lots of backing (from Bachmann and Palmer among others) and this new view of validity is making strong ground. There’s an interesting review of the Pearson test of academic English which uses an “assessment use”, or “assessment justification” approach to validity. See here for abstract: http://ltj.sagepub.com/content/29/4/603.extract .


  4. There’s a rich literature on testing if you move outside the confines of language teaching. I’ve particularly enjoyed David Hursh’s ‘High-Stakes Testing and the Decline of Teaching and Learning’ (Rowman & Littlefield, 2008). Chapter 4 of Picciano and Spring’s ‘The Great American Education-Industrial Complex’ (Routledge, 2013) is called ‘Corporate Influences’ and has some interesting information about Pearson. Gordon Stobart’s ‘Testing Times’ (Routledge, 2008) is also worth a read. More readily available, I’d also recommend Pepi Leistyna’s ‘Corporate Testing: Standards, Profits and the Demise of the Public Sphere’ http://www.teqjournal.org/Back%20Issues/Volume%2034/VOL34%20PDFS/34_2/10leistyna-34_2.pdf

