What Do We Do with Education Research?
a troublesome category with bad incentives and contentious subjects
In grad school I knew a real character who I’ve lost touch with. I’m not sure of the specific program that housed him but he was getting his PhD in some sort of combination of math and physics. I don’t know that he ever finished; he was forever griping about the priorities of the field and the politics in his program or lab or whatever. In a previous life he had been an expert in hydrological modeling, but had decided to move to more abstract concerns. He talked frequently about extra-dimensional geometry in a way that was totally beyond my understanding but because he always prefaced it by saying “in layman’s terms” I would nod sagely and signal a vague sort of comprehension. (If it’s hard for you to see the connection between hydrological modeling and higher-dimensional geometry, I know how you feel.) He told me that he had turned down a lucrative offer to do quant work for Wall Street, and while this may sound like an empty boast, knowing him I absolutely believe it. I always tell people not to go to grad school and I have an immense number of problems with academia but I dearly miss the opportunity to meet a particular kind of person you just don’t encounter very often. After I graduated we fell out of touch.
In any event, at the time I had become consumed by reading educational research. My longtime girlfriend and I had been split up for over a year, coursework was over, and I was receding deeper and deeper into the library stacks in preparation for my eventual dissertation. I had by this point become permanently disillusioned by my field, which I still naively referred to as “composition studies” or “writing and rhetoric.” I had chased my interest in the assessment of writing down a quantitative rabbit hole, which meant that I was interested in a means my peers rejected (statistics) to study a subject they increasingly disdained (papers, essays, text). Thanks to the broken incentives of academia, the field was openly at war with the reason the institutions were funding these programs in the first place - teaching students writing, as traditionally defined - in favor of endless dissertations on the semiotics of Harry Potter, how X is really writing (X being anything that was not in fact really writing), the same tired rehashes of Foucault, and case studies of ever-more-tightly-defined identity populations, like “The Rhetoric of Illiterate One-Legged Women of Appalachia.”
I was not sick of qualitative research or humanistic study. But I was sick of the skepticism towards the very concept of new knowledge and relentless resistance to generalizability, especially given that society funds research largely to derive generalizable principles. Eventually I would sell a book proposal to Harvard UP about why college writing instruction is broken, a book I would never write due to the considerable mental health challenges in that period of my life1.
My dissertation advisor was a brilliant and friendly guy with an expansive definition of what was discipline-appropriate research and encouraged me to follow my dissatisfaction where it would take me. I would eventually write a dissertation about a major standardized test of college learning and the socioeconomic conditions driving their adoption. I had wanted to do a more technical analysis - I had spent the prior couple years in statistics and methods classes and reading test theory and psychometrics - but the test developer would not give me meaningful data to perform a real analysis. Still, I had to read a lot of empirical research about standardized education assessments and as such I was saturated in the type of research that once existed in my former field but no longer does2. Education research felt like a balm. Yes, this was in part because of the statistical focus; I would go to conferences and argue for the relevance and ethics of quantitative research, and while I would often encounter sympathetic individuals, the general reception was always chilly. But it was also nice to consume papers with a sense of the practical uses for their research and which were not afraid to reach conclusions.
Back to my friend. We would occasionally meet at Greyhouse, the much-beloved and cheerfully pretentious coffeehouse near campus, to talk about academia, which meant complaining about how we were the only sensible people in our respective fields. I was talking to him about all the ed research I was reading and he said, with his usual tact, “I don’t know why you bother.” He went on to say that he thought education research was the type that should largely be abandoned in favor of anecdote and lore-driven practitioner knowledge, with some major exceptions, because the basic empirical landscape was impossible: education research typically involves small effects, big variances, notoriously multivariate research conditions (this is a field where you can literally be confounded by the air conditioning), involving convenience samples or otherwise presorted groupings of no or dubious randomization. My appeals to techniques like multilevel modeling were waved away; there was no way to rescue the signal from all the noise. At the time the p-value crisis was hot in the popular press, and when he said that ed research seemed almost custom-built to produce those kinds of flawed findings, it was hard to disagree.
I’ve thought about that conversation often in the years since. I am torn between my general edunihilism and the persuasiveness of his general point, on the one hand, and the sense that an absence of rigorous research couldn’t possibly be better than the flawed research we currently have, on the other. What I am left with is the question of whether it’s possible to meaningfully sort more certainty from less, without treating work that produces less certainty as inherently of lower value than that which produces more, when actual practicing researchers will always have direct professional incentive to represent their work as more definitive. I also wonder whether any of our findings will be truly generalizable, or if the remarkable diversity in contexts and student populations found across schooling makes that impossible. And I wonder if teachers will ever really implement pedagogical techniques we find to be more effective, if such a thing exists, when they will often find their own lived experience contrary to what researchers say, to say nothing of the turf wars and culture issues between practitioners and researchers.
It all seems like a mess, to be frank. But I do think there is little choice but to keep going and to try and get a little better over time - while accepting that, for reasons of both methods and the underlying reality, effect sizes will usually remain small.
To render things in convenient list form: what are the issues with education research?
Methodological and data issues. Small effects, big variance, lots of endogeneity, lots of confounds, available samples are frequently systematically dissimilar from general population, difficult or impossible to truly randomize in many contexts, bogus randomization in many others…. In sheer analytical terms, this is all quite difficult.
Publication and replication issues. All of the conditions that afflict psychology in its replication/p-value crisis apply to education research, potentially even more damagingly. Very often ed researchers have big ol’ spreadsheets with tons of demographic and school variables that they can then quickly correlate with output variables like test scores or GPA, which makes data snooping tempting - particularly given that you need to publish to get hired and get tenure and you need to get a significant finding to get published. And unlike psychological research, which frequently has limited real-world valence, the now widely-discussed issues with p-value hacking and publication bias can have large (and expensive) consequences in ed research, because policymakers are drawing inferences from research that they then use to make decisions that result in the deployment of a lot of public resources.
Conflicting results facilitate selective reading. Because there is so much conflicting data and contradicting studies, you can always build the narrative you want through choosing the data that supports your work and ignoring that which does not.
Institutional capture and optimism bias. Education research is dominantly funded by institutions that are hungry for positive results - positive effects that are purported to derive from implementable pedagogical or administrative changes which would, supposedly, start to “move the needle.” The increasingly brutal competition in academia for tenure track lines makes the need for access to grant funding only more vital over time, and the people who control the purse strings don’t want to hear negative results. There are committed pessimists within the ed research world but very few of them are pre-tenure or otherwise lacking in institutional security. The Gates Foundation, by sheer size alone, disciplines researchers against speaking plainly about negative findings and subtly influences the entirety of the published research record. In a very real sense the dominant ideology of the educational research world simply is the ideology of the foundations, and this is not healthy.
Accurately measured but controversial conclusions. The relationship between SAT scores and socioeconomic status is a classic example: while usually exaggerated, the correlation between SES and SAT scores is real. This is often used as an argument to dismiss the test as invalid. But in fact there is also an SES effect in GPA, graduation rates, state standardized tests, etc., which tells us that rather than being evidence of a flawed test, the correlation is a reflection of the uncomfortable fact that students from wealthier families actually are more college prepared than students from poorer. The reasons for this are complex, but the idea that the test must be inaccurately measuring the intended construct because the outcomes say unpleasant things is obviously wrongheaded. But this dynamic permeates educational research and policy. Consider research which shows that, when looked at longitudinally using the kind of fixed-effects models that can help adjust for the limitations of purely correlational analyses, we find that suspending students from school has weakly but significantly positive effects on their academic outcomes. It’s fair to say many people would not be welcoming to this research’s conclusions. This is, again, consequence-laden in a way the latest stupid fad in psychology research is not. This kind of finding can prompt the kind of controversy that can, in turn, ruin a young career. Education is a sensitive subject and sensitivity makes clear thinking in research much more difficult.
What do we do about these problems?
Undertake the (very nascent) research and publication reforms that are happening in other academic domains. Education research cries out for pre-registration, the practice of declaring the comparisons you’re looking for before you begin your research, to reduce (never eliminate) the negative consequences of data snooping/hacking. It also would benefit immensely from data transparency - make as much of your anonymized data and your analysis/math publicly available online as you can. (There is frequently a vague fear of this that I don’t quite understand; as near as I can tell it’s like worry over giving up industry secrets, but once your paper is published… what do you need to keep secret?) But for this to happen, you have to give the researchers a little sugar too: you have to create professional incentives like publishing null results and reducing the expectations for publishing significant findings when it comes to hiring and tenure decisions. Unfortunately, those professional requirements are unlikely to change, which means that the bad incentives for researchers are likely to remain.
Embrace epistemological humility. Let me say upfront that for me this one is, ah, aspirational. I speak with great confidence about educational research results all the time, sometimes when I shouldn’t. The reality is that if we acknowledge that it’s just much harder to find certainty of outcomes in education research than in, say, drug discovery (which itself has major challenges!) then we should be less quick to speak about what we know and more likely to speak about what we think, suspect, or assume. There will inevitably be exceptions based on the stability and size of effects; there is no reason to deny, for example, that racial stratification exists in many educational metrics, as that data is abundant. But we also must acknowledge that there are very few findings on which most people will agree - not just on the finding but on the strength of the evidence. To me, the lack of educational mobility for individual students within the performance spectrum over time could hardly be better supported by the data, but many disagree vociferously. I’m not sure how to resolve this conflict other than to communally agree that everything we’re saying is to some degree provisional. Of course, in this (or I guess any) discursive environment the kind of commitment to limited statements I’m endorsing is really hard - we can wax poetic about the virtues of qualification, but if one politician says “teachers/poverty/lead is to blame for bad outcomes” and the other says “the evidence is mixed,” guess who is perceived to win the debate?
Stop using charter school lotteries as randomization mechanisms. I have written at length about the fact that we have no reason to trust that charter school lotteries actually result in genuinely random selection. There are acknowledgements of problems with charter lotteries in the literature, but there are also very ballyhooed studies that rely on the integrity of charter lotteries when that integrity remains almost entirely uninvestigated by independent arbiters.
Start foundations and think tanks that do not have the same biases or produce the same pressures. All funding sources will have biases; that’s inevitable. The problem in education is that the organizations funding research so often have the same biases. There’s the above-mentioned optimism bias. There’s also issue-specific commonalities, most obviously the overwhelming pro-charter sentiment in our existing institutions. I don’t think there’s anything inherently illegitimate about such a bias at any given institution, but as in so many contexts what you’d like is a balance of different biases so that we have a better change of triangulating on the truth. I’ve written before about how the Official Dogma of Education came to be - billionaires do the funding and they have a certain vision of how the world works - but ultimately you can’t reform all of these institutions into having new assumptions and goals. Those things are intrinsic to those places. So what you need is new institutions that represent dramatically different understandings of what education can do and is for. Unfortunately, I have no clever idea about how to fund such institutions other than to hope for billionaires with different priors.
You can be too clever. I can’t possibly defend this sentiment adequately in this space, so I’ll have to write something longer about it at some point, but - I think there is such a thing as research design in education that gets too convoluted, and papers that get written because people want to show off their cool new methods. I recognize the desire to engineer a solution to the problems I’ve already laid out, and I’m not at all opposed to new techniques, but at times people have created such intricate procedures that it becomes impossible to generalize the results or to compare the findings to other research. I’m writing a requested piece for publication soonish about the (lack of a) relationship between education expenditures and test scores, and there really is something to be said for the brute force approach of regressing one number on another number from huge datasets. As complex as is necessary to do what you need to do - but no more.
Tell the truth. Male students, not female, now need special programs to facilitate their learning and the benefits of affirmative action. The SAT is not an income test. Charter schools do not result in meaningful learning gains compared to traditional publics. “School quality” has no impact on student performance. The racial achievement gap is not merely a wealth gap. Students who perform poorly early in life are overwhelmingly likely to perform poorly later in life. Etc. Each of these is, to some degree and to some people, a statement of profound controversy, primarily because they are seen inconvenient to the political projects that spring up around education. I am not hypocrite enough to say that we know all of these things to be true. (I am still working on this epistemological humility thing.) But I would say that the preponderance of the evidence is quite clear in each instance. The question is, are people willing to accept conclusions that cut against their social and political desires, especially the bipartisan commitment to pretending that there’s some magic bullet that will someday solve our education problems? Based on my experience of the education research and policy world, the answer is no. But we must try anyway.
That’s a start. The question moving forward is not just what means we will use to investigate educational questions. The bigger question is, what conclusions will we be allowed to draw?
Part of the common resistance to seeking treatment of bipolar disorder lies in the perception among many of us that we are unusually productive while manic, a perception that is hard to shake because it is sometimes correct. My voracious reading at the time was driven in part by a long hypomanic period that would, eventually, lead to me starting on Risperdal Consta, a treatment I would stick with for barely longer than four months.
It’s one of the ironies of my life that my grandfather, a progenitor of the field and one-time president of the professional organization National Council of Teachers of English, published empirical findings using polygraph machines to measure student stress levels, evaluations of standardized tests of reading, pedagogical papers on teaching the classics, and poetry. In other words, the types of research he was able to publish and use for professional advancement were far more diverse and far less practically constrained than mine were, despite the fact that the field constantly lauds itself for its interdisciplinarity.
A little off topic but I want to rebut the idea that the finish line for educational research and reform should be getting everyone into college and an "elite" career. That is economically illiterate. A healthy economy has a wide diversity of work and not all of those jobs are going to be in the field of computer programming. Blue collar labor is just as necessary to a well-functioning society as white collar labor. It is no failure if somebody doesn't want to go to college--that is a simple and basic diversity that society as a whole should understand and embrace.
What is remarkable to me is the belief that only white collar work should pay well and the blue collar work should naturally pay poverty wages. Why does this paradigm go largely unquestioned/unchallenged? It's factually wrong for the skilled trades and the worship of concepts like "creativity" and "innovation" has become practically fetishized.
Really great post. “Hard science” folks like your friend will always be skeptical of social science, and a lot of that skepticism is deserved – but these are important questions, and there’s no good alternative, especially because we’re largely talking about quantitative outcomes (GPA, test scores, class rank, college admissions and performance, job salary, etc.) Ethnography can suggest hypotheses, but it can’t tell us what works at scale.
I encountered a lot of the same prejudices against statistics in grad school, mostly from people who specialized in inequality (race, gender, and LGBTQ studies). They would convince themselves it was a hate crime to even categorize people for quantitative analysis. For example, a checkbox for race/ethnicity = oppression because it reduces complex personal identities to a few categories. I would often point out that we wouldn’t even be able to talk about inequality without categories and numbers.
For me, one of the most feasible reforms is increased & improved use of randomization. We need greater tolerance for the inherent unfairness of randomizing interventions in the short term so that we can help everyone in the long term.
Another feasible one is publishing null findings--because this can be accomplished without billionaires. The discipline could come together to support a real journal for null results, or (even better) prestigious journals could dedicate a certain % of each issue to null results. A small group of big names could make this happen if they really wanted to. But they’re probably afraid of younger scholars publishing papers that disprove their pet theories.
Finally, more funding for schools to conduct their own quantitative research (in collaboration with professional researchers). Schools have access to their own student data, the ability to conduct new surveys / assessments, and the desire to find out what actually works in reality. I haven’t looked at the research closely enough to speak to the quality, but the concept behind the U Chicago Education Lab (partnering researchers with school districts) seems promising – at least, it’s better than school districts implementing expensive interventions without a plan to study the results that accounts for the many threats to internal validity.