Education Week: Educational Assessments are Valid, Reliable, and Remarkably Predictive
liberal dismissal of the power of these tools is 100% politics, 0% science
This is the third post in (the first annual?) Education Week at freddiedeboer.substack.com. I acknowledge that most any week could be called an education week on this blog.
One of the most frustrating recent developments in American politics is that progressive people have suddenly become deeply skeptical of the whole field of educational assessment and measurement - not its uses or the appropriate role of testing, which are inherently political questions, but whether we know how to accurately measure educational indicators at all. (For example.1) This is most commonly and directly expressed as resistance to the SATs, but opposition to educational assessment writ large has become a key part of liberal cultural identity and is a growing concern in leftish climes, driven in part by affluent parent resistance to state standardized tests that frequently confuses fair questions about the extent of testing and its uses with misguided swipes at their effectiveness in assessing and predicting student ability. I say that this resistance is frustrating becomes it comes from a toxic combination of ignorance and self-righteousness, will not solve any problems, and involves the constant repetition of demonstrable falsehoods.
To put it simply: liberal skepticism towards educational assessment is totally unjustifiable through reference to the evidence. Liberals speak with utter confidence about the supposed lack of predictive validity of these instruments but present no evidence whatsoever to back up these claims. They constantly repeat canards like “these tests only measure how well you take the tests,” which mostly have no content and to the degree that they do are demonstrably untrue. They attack the tests for revealing racial stratification without acknowledging that literally all educational data (grades, SAT/ACTs, state standardized tests, NAEP, graduation rates, disciplinary rates, and a vast number of ancillary indicators) show racial stratification, which suggests that the tests are more valid rather than less. They pretend to have methodological critiques without being minimally informed about the actual methodologies through which these tests are constructed and validated. And they do all of this in service to a vague and unhelpful egalitarianism and under dubious anti-racist pretenses, when neither broader social equality nor racial justice will actually be served by their efforts.
In order to make my case today, I’m going to frame this as a brief lesson in some of the basic concepts of educational assessment, and use this frame to argue that in fact we have the ability to make remarkably powerful statements about current student ability, and future student performance, with our assessments. Here are some definitions. These are all in my own terms, so you can be sure some people will disagree with how I’ve laid them out here.
Construct & operationalization. An assessment’s construct2 is the attribute, ability, or knowledge it is intended to measure. The construct of a driving test, when you go to get your license, is driving ability - can you safely pilot a car according to the laws and norms that govern the road? This can be more complicated than it appears. A reading test’s construct is, obviously, the ability to read. But there are multiple aspects to being able to read. Comprehension is the obvious one - has the reader absorbed the information and message from a reading passage that the author intended? But speed is also a criterion; if the test subject does in fact comprehend a paragraph, but it takes them three hours to do so, that is relevant to the question of whether they can be said to be a competent reader. Constructs are always contingent upon a given topic’s parameters and are, in many contexts, political. Whether someone knows how to do long division will always be less debatable than whether they know “American history.” But this does not mean that there is no such construct, nor that we can’t do better or worse at measuring it.
Operationalization, if you’ll forgive the hideous term, is how constructs are put into practice in an assessment. When you write a test, you are operationalizing a construct - creating the means through which it will be assessed. Take the construct of “writing,” as (until recently) measured on the SAT. In that case, the construct was operationalized through a 25-minute essay exam, where the test subject wrote a brief essay in response to a prompt, and that essay was then graded by two test raters hired by ETS, according to a set of rating criteria, and scored in a few minutes on a 1-6 scale, that was then converted to the 200-800 scale common to the SAT and integrated with your other scores. The construct of writing can also be operationalized via a 250-page dissertation, written over two years and assessed by a four-person committee that reads it over the course of six weeks. So how you operationalize a construct obviously matters a great deal. Whether a given operationalization of a given construct is appropriate is why we talk about validity and reliability, as discussed below.
Which constructs to measure can be a contentious question, as can operationalization of said constructs. We have debates about the curriculum that are, in part, debates about what constructs to measure. These debates can hopefully involve public input through the regular channels of democratic engagement. Few people would doubt, however, that any assessment regime should involve the kind of basic numeracy and literacy testing common to virtually all K-12 standardized exams, the NAEP, and entrance exams like the SAT and similar. Operationalization questions can also have a political dimension, particularly as in questions about what constitutes adequate knowledge of US History and what topics belong on a test of such. But the nitty gritty of operationalization has to largely fall in the hands of experts due to the technical details of this process, which are considerable.
Validity. Validity in testing is one of those terms, like “heritable” or “irony,” where no matter how you define it someone will try and pull your card over it. Traditionally, a common definition of a test’s validity has been whether that test measures what it purports to measure. That is, a test of reading is valid in the more traditional sense if it actually provides practically useful information for whether a test-taker can read and to what degree, as defined by the constructs it attempts to measure. Similarly a vocabulary test, in these terms, would be valid if its results could be extrapolated to say meaningful things about the test subject’s overall real-language vocabulary. Because academia demands that all simple definitions be troubled, this way of thinking about validity is now often derided as “face validity3.” In its place, we have validity as a complicated concept with many facets, which test theorists delight in expanding over time, and an understanding that validity is a vector and not a destination - a test can be more or less valid but never summatively valid in a simple sense.
Among the large number4 of identified components of validity are predictive validity, or the degree to which a test score can be used to make accurate predictions of future assessments or events (for example, do driver’s tests predict the odds of getting in a future crash?); concurrent validity5, or whether two or more different assessments correlate with each other (for example, the SAT and ACT are entirely different tests with different sections and scales yet their scores are remarkably highly correlated; useful because independent teams working to measure the same general constructs are less likely to have erred in test construction than one team); discriminant validity, or the ability of an assessment to demonstrate differences between populations that we think should be identifiably different (for example a test that could consistently discriminate between students educated at a college-prep high school and those educated at a vocational high school); content validity, which refers to expert opinion that a test contains content that is appropriate for a given tested domain (for example a panel of American history professors and teachers deciding that a state test does in fact ask the right questions about American history). There are more. So many more.
As you can imagine, people argue about this stuff constantly. (This resource contains a definition of concurrent validity that’s totally different from what I’ve encountered in my coursework and reading, but who am I to say if it’s wrong?) And while there are quantitative methods to help guide validity decisions, validity is inherently a somewhat philosophical concept and always requires a certain degree of judgment call. Still, the fact that there is a vast literature about this, and many experts working in this field, should increase your confidence that we know what we’re doing and are getting better at it. And as we’ll discuss in a moment, the average professionally-produced educational assessment in 21st century provides remarkably useful information for discriminating between different people’s various academic abilities. We have an idea of “knowing how to read.” We recognize that there’s variability within that condition and different subskills worth investigating, but we also know that there is a certain objective quality of being able to read English text and extracting understanding from it. We know how to create tests that tell us who can do that - and, given that the Gates Reading Tests first appeared in the mid-1920s, we have for at least a hundred years. You can introduce a lot of sophistry on Twitter and say things like “but what does it really mean, to be able to read?!?”… but I don’t know why you would.
The fact that liberals have decided that we do not know how to create valid educational assessments does not mean we actually do not know how to create them.
Reliability. Reliability refers to an assessment’s ability to consistently measure its given construct across administrations without deviations that erode its validity. If I give a test that produces one set of results on Friday afternoon but a different set on Sunday night, that test has low reliability; if I give a test to charter school students that provides different results when given to public school students who are of actual equal underlying ability, that test has low reliability. There are limits to how reliable we want a test to be; no one expects the SAT to sort 5 year olds into different tiers of ability with the same accuracy as it sorts 17 year olds, because in order to make those discernments the test would have to be less effective as an assessment of those 17 year olds or prohibitively long. A test’s reliability is thus a function of the definition of its construct. There are many ways to measure reliability in assessments, such as test-retest reliability - does the same test subject perform about the same on the assessment when retested? - and inter-rater reliability - do two people rating a test (such as an essay-writing test) provide the same rating for the same student response? Test creators like ETS invest major resources into measuring the reliability of their instruments, and these statistics have proven to be important evidence when the tests are legally challenged. Reliability is also bound up with the question of norm referencing vs criterion referencing, or the definition of the referent of success, which I have written at length about here.
One common reliability challenge lies in the argument that tests reflect cultural assumptions that are not universally shared among test-takers - that is, that they are culturally biased. These arguments are, unfortunately, typically not rigorous or even well-defined. But efforts are always underway to address them; the SAT has been rewritten multiple times to attempt to remove supposed cultural bias, and IQ instruments like the Raven’s Progressive Matrices tests remove language entirely to attempt to minimize cultural influence. (These efforts have not closed perceived gaps.)
Reliability is more quantitatively-determined and thus less debatable than validity. (Many of the same experts who will never say that a test is valid, rather than more or less valid, are willing to say that a test is reliable.) There are reliability statistics that tell us the degree to which our tests are functioning the way they should be. For example, metrics for measuring inter-rater reliability can tell us the degree to which two raters of assessments agree, not just in terms of frequency of exact matches but strength of average disagreement, direction of disagreement, and degree of agreement attributable to chance; a point-biserial correlation can be used to make sure that performance on individual test items more or less corresponds with performance on the test writ large. (That is, if you have an individual item where those who score the highest on the test overall are performing unusually poorly compared to the worst, that can be an indicator of a poorly-worded question.) Reliability is considered a necessary precondition of validity, but the obverse is not true. A test can be invalid but reliable - it provides the same bad information regardless of administration or test taker. Most professionally-developed tests are remarkably quantitatively reliable.
The fact that liberals have decided that we do not know how to create reliable educational assessments does not mean we actually do not know how to create them.
Predictive power. That entrance tests like the SAT and similar are not predictive of future performance has become a matter of holy writ among progressives. I mean that both in the sense that they believe it fervently, as if it were holy writ, and in the sense that it is based on no evidence, like holy writ. In fact those entrance tests, and national diagnostic exams like the NAEP, and state standardized tests of K-12 learning, and even humble in-class content exams are all typically quite predictive. Modern educational and intelligence testing is remarkably predictively powerful and has been for decades. The evidence for this stance is frankly overwhelming.
I have already written repeatedly in this space regarding the predictive power of the SATs, so you can check out those posts for more information. Put simply, the SAT provides strong predictive power not only about college performance but all manner of later academic and life outcomes, including outcomes that only occur decades after testing such as receiving a PhD, holding a patent, or earning tenure at a university. The claim that the SAT is not predictive, or only tells you how well the test-taker takes the test, is simply and unambiguously false. The SAT is predictively valid, and so is the ACT, and the GRE, and the LSAT, and the MCAT, and the GMAT…. We know how to create predictively valid entrance tests. Suddenly deciding that we don’t, by public affirmation, is very strange.
All manner of tests allow us to make strong predictions well into the future. Assessments of early childhood literacy and numeracy provide useful predictive information for academic performance into adulthood. Tests of knowledge of fractions and division given to 10 year olds are remarkably effective predictors of later math skills, even when adjusted for demographic factors like parent income and education, overall academic performance, and working memory. Prior-knowledge tests delivered before a battery of introductory science courses are strongly predictive of student achievement. Tests like the NAEP don’t merely predict more straightforward attributes like reading ability or fluency in arithmetic but more complex constructs like performance in the visual arts6. (Does the NAEP predict college performance, despite that not being its intended function? Probably.) Can we predict 8th grade reading skill with early childhood literacy assessments? Sure. Can we use such early reading skills as effective predictors of future performance even if we use an input variable as coarsely-gradated as low/medium/high group? Sure. Performance on previous academic assessments predicts performance in med school and subject tests predict graduate school achievement. New York exam school entrance tests are highly reliable and reasonable predictive of performance. An underdiscussed element of high-stakes testing is the kind of placement exams used in most community and junior colleges, such as ACCUPLACER and COMPASS; they provide strong predictive information for that use. Could I go on linking to studies that demonstrate that earlier assessments effectively predict later performance? Yes, but I’ll spare you.
As Dochy et al said in a 2002 paper,
research has shown that prior knowledge is not only the best predictor of learning outcomes in the sense of knowledge acquisition (Dochy et al., 1999), but also in the sense of problem solving (Dresel et al., 1998). Moreover, in forms of cooperative learning and peer teaching, prior knowledge seems to influence learning positively (O’Donnell and Dansereau, 2000). Also using contrasting cases has been shown to help students in generating prior knowledge and facilitating the learning process (Schwartz and Bransford, 1998).
We know which students possess prior knowledge, of course, because of how they have performed on educational assessments. We know how to measure student performance in academic domains in a way that predicts future performance in those domains. We just do, as politically inconvenient as some may find that fact.
Our predictive abilities grow even stronger if we consider IQ tests and other tests designed to measure not specific academic domains but general cognitive ability. IQ tests are remarkably effective at predicting academic performance, even many years down the line, but they tell us much more. They tell us who will succeed at work. They tell us who is more likely to commit a crime. They tell us who will live longer, in part because they tell us who will have healthier eating and exercise habits. They tell us who will do better in all manner of random life situations, such as who will have a more successful career as a trucker. Cognitive ability, IQ, g, however you want to define it - it’s remarkably helpful in a vast range of human domains, and we are reasonably good at measuring it. The notion that this threatens egalitarianism or human rights depends on a false understanding of what kinds of equality we should assume in any humane system. Nobody ever said that we’re all equal in our abilities, not even the most radical egalitarians in history.
The fact that liberals have decided that we do not know how to create predictive educational assessments does not mean we actually do not know how to create them.
None of this should be remotely controversial, yet denying what’s presented here has become very common in progressive spaces despite plenty of work in the national media debunking such myths. Why?
I’ve said it many times: conservatives went crazy, and it ruined liberals. In general, because the other side is so crazy, liberals have come to feel no pressure to be informed, coherent, or internally consistent in their own views, as they rest easy in the assumption that the other side will always be worse. In particular, conservatives have voiced a lot of genuinely anti-science arguments, such as in support of creationism or in denial of climate change, which has created a cultural assumption within liberalism that its members are always on the side of science simply by dint of not being conservatives. This has led to the embrace of flatly unscientific positions among progressives who self-define as those who “follow the science,” such as the widespread denial that obesity has serious negative health consequences; they see themselves as the pro-science side, by definition, so feel no need to critically review such unscientific beliefs. And this results in the condition we see with educational assessment: liberals believe that these tests don’t work for purely political reasons, and because they have come to see their instincts as perfectly in line with science in all circumstances, they feel no compulsion whatsoever to determine whether those instincts actually have any basis in fact. And Trump accelerated all aspects of liberal addiction to culture war, making it that much harder to get them to accept things that they don’t want to hear - like the plain fact that educational testing works.
I talk to liberals about this stuff both online and IRL and say, OK, when you say these tests don’t measure anything, what are you talking about? What’s your specific critique? Why do you believe that tests that are quite predictive don’t predict anything? What do you suppose the point of educational testing is, if not to make observations about current student ability that provide useful predictive information about later student performance? Why do you speak with such utter confidence about topics that you clearly are not informed about, and in expressing opinions that are always at least highly debatable and usually just wrong? They never have a substantive reply. Never. They just get mad. Because they’ve become habituated to the idea that the facts are always on their side. And I don’t think that’s any healthier than a conservative who’s certain that the minimum wage kills jobs because his understanding of Econ 101 tells him so.
I hate to keep repeating myself: educational testing has become unpopular because of what testing reveals, which is deep inequality along racial and class lines. These inequalities are indeed lamentable, but they can’t be wished away by refusing to measure them with the tools we have available to us. Nor does it do any good to insist, as so many now do, that any racial inequalities in testing means the tests are racist, anymore than finding higher levels of lead in Black children’s blood means that lead testing is racist. What’s more, recognizing the power of these tools does not require us to surrender to an endless census testing nightmare in K-12 schools. We can recognize the usefulness of these instruments while still debating what level of testing is appropriate, and while using strategic small-sample tests like the NAEP to reduce the need for mass testing. We can use inferential statistics to understand broad trends and use census testing on a limited basis, in a way that respects teacher autonomy and the emotional wellbeing of our children. But it makes no sense to opportunistically pretend that these tests don’t work, in an effort to deny uncomfortable realities. I recognize that there are always political dimensions to every aspect of these assessments. But to suddenly decide that these tests are invalid based on emotional discomfort with their outcomes does no one any good.
Educational research is hard. Let’s not throw out the part that we’re best at because of short-term political fads. And let’s not pretend that we can preempt difficult conversations about education and our system by refusing to find out who’s actually academically prepared.
“The real intelligence test is whether you wear a mask” is self-parody of the highest order.
The term “construct” is, in a sense, an acknowledgement that what we are measuring in educational testing is not as objective as mass or voltage etc.
Some will say “face validity is one aspect of validity among many” and then proceed to treat face validity as an unsophisticated anachronism.
People love to say “THE FIVE TYPES OF VALIDITY” or “THE NINE TYPES OF VALIDITY” with the voice of God, knowing full well that other people disagree about how many types of validity there are.
Whether concurrent validity, convergent validity, and criterion validity are three separate types of validity or different names for the same thing is the type of debate I enjoy not having an opinion on.
This study notes that “Students who have access to full-time art specialists perform significantly higher than those who do not have educators prepared to teach art.” This is the type of scenario that liberals allege invalidates the outcomes of such measures. But that’s wrong. Of course students with dedicated arts teachers do better in visual art! It would be much more disturbing if they didn’t. The fact that there are equity concerns associated with the outcomes of educational assessments does not invalidate those assessments, just like racial performance gaps don’t invalidate the assessments but rather reveal real-world inequalities - just like they are intended to.
Following yesterday's post, is there a chance that some of this anti-testing spittle on the liberal left is partly redirected anxiety over high-end college admissions? Or even bad-faith attempts to transition towards "softer" metrics like GPA, club membership, and letters of rec because they're easier for mediocre rich kids to game than the SAT?
I have this hunch, purely anecdotal, that the most vociferous anti-testers are rich white parents who publicly scream about helping poor black kids but privately worry more about their bumbling failson with SAT scores far below the median for incoming freshmen at their elite alma mater.
I'd add to this in a few ways. One is to note that the liberal consensus was decidedly pro-testing in the 90s and 2000s, but that the way in which large scale state testing rolled out under NCLB was modeled on Texas - which, FDB pointed out, may not have been the success story it claimed to be.
Another is that the testing became a tool in the hands of various reformers. Clearly it's the teachers unions who are to blame, I mean, look at these test scores! Enter Scott Walker, Chris Christie, Mike Bloomberg. Right along side of those reformers came expanded voucher programs, more charter schools, and pushes against collective bargaining. So, while the testing was finding various inequalities - something that liberals thought would provide evidence for locating and improving instructional practices, thus outcomes - the policies that seemed to result from testing's evidence base threatened teachers and schools systems' historic power as well as local control (charters and vouchers being very much outside of local board control in most places). So the liberal intent behind testing was perceived as being turned to somewhat libertarian and market-oriented purposes popular until recently on the right.
Finally, testing came with a veneer of corporate control. As FDB noted in the post kicking off edu week, Gates brought Common Core into being almost single handedly and with surprisingly little resistance (imo this is because, even then, lots of educators were still feeling like testing revealed a problem, so curriculum goals could solve it). The other big thing the Gates Foundation did was roll out curriculum aligned to Common Core in districts around the country and try to implement evidence-based instruction to, well, end educational inequality. It didn't work and Gates more or less lit money on fire for about a decade. But teachers and districts felt burned (especially when the CSAIL report said teachers and districts were the reason Gates' project failed).
So, at least coming from the perspective of the schools and their teachers/aides/admins, nearly two decades of constant changes, instability, and ruthless pay/benefits/job security cuts seemed to be the results of these tests. And the supposed upside, the renaissance of data-driven best-practices and all the other hyphenated buzzwords failed to materialize. Anyone remember teachers being forced to kneel under desks because their test scores didn't go up and then getting fired when the scores didn't go up again? How about the ATL scantron parties where district leadership just falsified the scores?
The tests themselves are not to blame, obviously. And they are not some warped instrument of white supremacy wielding western colonialist mathematics against black and brown bodies (to borrow the parlance). But they were the justification that schools heard again and again for what turned out to be a whole package of reforms that didn't work but made life in many schools much worse. This was especially true in the schools that were the lowest performers.
So you can see why testing gets lumped in with all the bad stuff. Justified or not, it comes from somewhere and it's not only unpopular because of what the testing reveals but also because of what the testing was used for.