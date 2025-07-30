these are only the people James Holmes killed; the 70 that he wounded would not fit comfortably in my newsletter

I am indebted to my friend Xun for his help in reviewing my methodological interpretation of the offending paper; I would not have had the confidence to publish this post without his knowledge and advice.

Recently, the New York Fed published an astonishingly irresponsible analysis about involuntary psychiatric hospitalization, the practice of forcing someone into mental health treatment if they’re considered a danger to themselves or others. Researchers for the Fed found that patients who they consider “judgment calls” who are involuntarily committed to psychiatric care have significantly worse outcomes in terms of suicidality and later criminality compared to those who are not. These patients are judgment calls because they provoked discordant analysis from doctors who evaluated them, that is, one doctor said that they met the criteria for involuntary treatment, while another said that they did not. The paper argues that involuntary hospitalization actually makes things worse in those cases and based on those outcomes. According to their findings, people who are hospitalized this way are more likely to die by suicide or overdose and more likely to be charged with violent crimes in the months following their release. I believe that this paper is deeply flawed thanks to basic issues of unbalanced psychiatric severity between groups, issues with conditioning on a collider, a quasi-experimental design that cannot address these issues because its assumptions are untrue, and an inadequate sample size to address the scarcity of outcomes of interest (that is, that suicide, overdose, and violent crime are rare events in the data set).

Put simply, the study tries to make the causal claim that hospitalization of “judgement call” patients who receive discordant evaluations from different doctors causes worse outcomes. But if the people being hospitalized were sicker to begin with, then any comparison in post-hospitalization outcomes is contaminated by that preexisting difference. The worse outcomes might simply reflect baseline severity, not any harmful effect of hospitalization. It’s a classic confound: if illness severity affects both the treatment (hospitalization) and the outcomes (death, crime), then it’s a confounder, and failing to control for it adequately breaks causal inference. And it’s a problem here because the rates of serious mental illness are so unequal across the study’s treatment categories. The study attempts to use a quasi-experimental design to address this kind of shortcoming, but success depends on two assumptions we have every reason to doubt - that patients are randomly assigned to doctors and that doctors only affect outcomes through the decision to hospitalize, nether of which is convincing. Let’s get into it.

Who Are These People and Who’s Evaluating Their Work?

None of the study’s three authors have any meaningful training in psychiatric medicine. Natalia Emanuel is a microeconomist for the Fed of New York whose “research focuses on understanding the informal labor market, workplace policies that help firms and workers thrive, and child welfare and criminal justice systems.” Pim Welle is an engineer whose background is in using satellite data in environmental science. Valentin Bolotnyy is an economist and fellow with the far-right Hoover Institution, which employs a mix of cold warriors, market fundamentalists, and right-wing libertarians. None of them appears to have any serious exposure to psychiatric training or medicine; given their very impressive academic credentials (credentials I conspicuously don’t have) they almost certainly do not have long-term experience in inpatient care themselves (experience I conspicuously do have). It’s rather remarkable that the Fed of New York would publish such a paper at all, given the lack of relevant experience or expertise among the researchers and the fact that the research is so far outside of their ordinary purview. The paper has not received peer review from a major psychiatric journal. This work is essentially unrefereed.

The Problem with the Quasi-Experimental Design

The authors didn’t run a traditional randomized controlled trial, where you flip a (metaphorical) coin to decide who gets hospitalized and who doesn’t. In fairness, that would be ethically and practically impossible. Instead, they used a quasi-experimental design, which in this case could be considered a kind of natural experiment. In this case, the experiment comes from how patients are assigned to different doctors in emergency rooms. When someone comes into the ER for a psychiatric evaluation, they’re randomly (or semi-randomly…) assigned to one of several doctors. Some doctors are more likely to hospitalize people than others. So, by comparing outcomes for people who just happened to see a more “hospitalize-happy” (strict) doctor versus a less “hospitalize-happy” (lenient) one, the authors try to estimate the effect of being hospitalized.

This is called a “judge IV” design. “IV” stands for “instrumental variable,” a statistical trick used to isolate the causal effect of a treatment when you can’t run a proper experiment. Such designs are useful because there are a lot of scenarios where you just can’t run a true experiment, such as this one. But these designs are also fragile. For this method to work, several things need to be true:

The doctor a given patient sees really is random. Doctors only affect outcomes through the decision to hospitalize, not in other ways. In particular, we absolutely need the types of patients seen by lenient doctors to not be meaningfully different from those seen by strict doctors.

These are big assumptions, and if they don’t hold, the results can be biased. And I think five minutes of reflection will demonstrate the problems here.

The authors claim that patients are randomly assigned to doctors based on a triage list, but they also admit that this process isn’t perfectly random. Sometimes specific doctors are assigned on purpose, for example, if a patient is a frequent visitor who’s been seen before, or if they’re a child or elderly and need a specialist. You can imagine all manner of casual, undocumented breaks from this “random” selection of doctors - maybe one doctor prefers patients from particular backgrounds and subtly pushes to see more of them, maybe a triage nurse has a grudge against a particular doctor, maybe another doctor takes strategic lunch breaks to avoid certain patients. Who knows? The Fed researchers certainly don’t. Even small breaks in the asserted randomness of doctors assigned to cases could potentially severely skew the results. If stricter doctors tend to get the most severe cases, or more experienced ones handle the hardest ones, then we can’t assume everything evens out in the wash, and note that this doesn’t have to be a formal policy of the institution for it to happen consistently. Just the fact that doctors have schedules can break this assertion of randomness; if stricter doctors are assigned to a particular time slot for nonrandom reasons (very possible) and particular patients are most likely to get evaluated at particular times of the day/week (almost certain) then hey presto, the whole design breaks and we have no practical ability to assess such tendencies in any robust way.

The second assumption is potentially even worse. Of course doctors affect outcomes in ways other than in decision to hospitalize! The whole logic of the study depends on doctors affecting outcomes only by deciding whether to hospitalize a patient or not. But doctors aren’t robots. They might interact differently with patients, give different advice, refer people to different outpatient services, or write different notes that influence how patients are treated later. Some doctors might have special relationships with other practitioners or programs that end up resulting in meaningful differences in treatment. If those things affect outcomes (and they do) then we can’t isolate the effect of hospitalization. We’re also picking up the effect of having a certain kind of doctor. This is what’s called a violation of the exclusion restriction, and it’s a common problem in instrumental variable designs.

If patients who end up with “lenient” or “strict” doctors are systematically different in ways we don’t see in the data then the whole thing collapses. For example, if people with more resources or stronger support systems tend to show up at certain times of day (and therefore see certain doctors), then we might be picking up those differences, not the effect of hospitalization. The authors try to test for this by checking whether patient characteristics like age, race, and insurance status vary across doctors. I find this woefully inadequate as a control. I can think of all sorts of unmeasured things, like severity of symptoms, family dynamics, comorbid disorders, past trauma, etc, that could still bias the results. Indeed, I would argue that demographic characteristics are almost useless to establish homogeneity this way. What matters is psychiatric factors, more than anything, which we don’t have any real information about, and anyway all manner of lurking variables could mess this all up.

The design only works if the complier group (those influenced by the doctor’s preferences) is reasonably balanced, and it’s not.

Unequal Rates of Underlying SMI Are a Big Issue Without Random Assignment to Treatment

The research they’ve produced has issues with highly unequal rates in core variables of across groups, which are supposedly addressed by that quasi-experimental design. Let’s start with a simple but very serious issue. The authors of this study compare three groups:

People who were involuntarily hospitalized.

People who were not hospitalized.

A middle group, called “compilers,” who are sort of in-between - people who would have been hospitalized by some doctors but not others.

The researchers then look at these groups to see how they compare to each other in terms of the output variables, which in this study concern the rate of patients dying from suicide or overdose and the rate of patients engaging in future criminal acts. So what’s the problem? The problem is that a key variable, the rate of serious mental illness (SMI), in each of these groups is very different.

In the hospitalized group, nearly 48% have a serious mental illness.

In the non-hospitalized group, only about 29% do.

In the middle group - the “compliers” who are the main focus of the study - it’s about 38%.

That’s a 20-point spread between the hospitalized and non-hospitalized groups. That level of imbalance tells us the groups are fundamentally different in ways that almost certainly affect the outcomes of interest (suicide, overdose, violence). The authors acknowledge some of this heterogeneity (like in pg. 18) but understate its threat to causal inference.

Imagine you’re trying to figure out whether giving someone antibiotics causes them to feel better. You give the antibiotics to people with severe infections and don’t give them to people with mild colds. Then you check to see who actually gets better. Surprise - the people with the colds recover more often! But that doesn’t mean the antibiotics made people worse or even failed to make them better. It just means that you gave them to the people who were sicker in the first place. This has been such a major part of the (often deliberate) misinterpretation of Covid-19 vaccination data that I’m surprised there isn’t better communal statistical literacy about this. Something like that is what’s going on here. The people who were hospitalized had much more severe mental illnesses than the ones who weren’t. That alone could explain why they had worse outcomes. Even if hospitalization helped, it might not be enough to cancel out how much more unwell they were to begin with. And you can’t wiggle out of this by saying that we’re only paying attention to the compliers group because it’s the selection of that group itself that breaks generalizability.

Again, the issue is a confounded variable. Some third factor (here severity of illness) could be influencing both the treatment (hospitalization) and the outcome (death or crime). When that happens, it’s really hard to isolate the effect of the treatment itself. Another way to express it is with the term conditioning on a collider. The study’s sample includes only people who were brought in for evaluation under the standard of “danger to self or others.” This is a highly selective and unusual group; they’re already flagged as being in crisis, and someone like a cop, doctor, or family member has decided they might need to be forcibly treated. You could think of the group as being selected based on two broad factors: psychiatric severity (e.g., schizophrenia, suicidality) and behavioral crisis (e.g., violence, aggression, substance use). You don’t have to choose those two selection criteria, mind you, you just have to be aware that they are distinct variables that could each increase the probability of being referred for an involuntary evaluation. That means that even if psychiatric severity and violence are strongly related in the population, they might appear negatively correlated in the evaluated group, just because of how the selection process works. But of course those factors also influence who will later criminally offend or die by suicide.

That’s conditioning on a collider; it’s also a textbook example of Berkson’s paradox, a statistical curio where if you have an artificial cutpoint you can often generate specious negative correlations. A famous example is that, among NBA players, basketball ability and height are not correlated, which is surprising because of what we know about the population of NBA players. (As a group, they are very tall.) But the perceived lack of correlation is an artifact of the artificial cutpoint “among NBA players”; people who are tall but not very good at basketball have still traditionally had a path to the league - think of the Shawn Bradleys, tall stiffs who make rosters simply because of height - while people who are neither tall nor good at basketball don’t make the league and are excluded from the sample. That means that you have a quadrant that cuts against the correlation (tall but not good) while a quadrant that would increase the correlation (short and not good) is systematically eliminated from your sample. In other words, the cutpoint has manufactured the appearance of a lack of correlation. That’s Berkson’s paradox.

Think about the study design like this. If

Hospitalization decisions are based more on behavioral reasons (e.g., intoxication, domestic violence), and

Those not hospitalized tend to be in crisis for more perceived psychiatric severity (e.g., psychosis),

Then you get:

A hospitalized group with less illness but more outwardly risky behavior and

A non-hospitalized group with more psychiatric illness and possibly less actual behavioral risk.

Even if the assignment to doctors is quasi-random, the condition being evaluated in the first place is not randomly distributed. So later, if the hospitalized group commits more crimes, or has higher suicide rates, you can’t be sure it’s because of hospitalization; the group differences are baked in by the selection process. This is a classic weakness of designs that are not true random experiments.

The Monotonicity Assumption Seems Obviously Suspect

A monotonicity assumption is the assumption that the relationship between two variables consistently moves in one direction. The monotonicity assumption in this study says that if one doctor would hospitalize a patient, then a doctor who is more likely to hospitalize in general would also hospitalize that same patient; doctors more likely to involuntarily commit (stricter doctors) will always hospitalize a patient that would be hospitalized by a doctor less likely to involuntarily commit (more lenient doctors), and more lenient doctors will always release a patient released by stricter doctors. In simple terms, it assumes there’s a clear ranking: overall stricter doctors always hospitalize at least the same people as overall more lenient ones, overall more lenient doctors always street at least the same people as overall stricter ones.

But in reality, that may not be true and in fact is very likely not true in individual instances. Doctors don’t all agree on what “dangerous” looks like; indeed, this is one of the most longstanding and loudest complaints of people who reject involuntary hospitalization! One doctor might be more likely to hospitalize someone for suicidal thoughts while another is more concerned with aggression. A doctor could be stricter in terms of having a higher overall rate of involuntary hospitalization but could be selectively more lenient based on given behaviors or symptoms. It’s possible that two doctors with the same overall hospitalization rate could be making very different decisions about which types of patients to hospitalize. That breaks the monotonicity assumption and means the study might be mixing together different kinds of treatment decisions, making its conclusions about “the effect of hospitalization” unreliable. This stuff adds up.

Sampling Issues

The study uses data from one county in Pennsylvania over a 9-year period. That includes about 16,600 people who were evaluated for involuntary hospitalization. On one hand, that’s a good-sized dataset; on the other hand, no one in their right mind would assume that data from one county in Pennsylvania is generalizable to the population of patients evaluated for involuntary treatment everywhere. But, you know, research is hard and I’m not going to make an isolated demand for rigor here. The researchers aren’t violating basic research standards by focusing on this one population, they’re just constrained in the way all researchers are. On the other hand, once you start slicing it into subgroups (hospitalized vs. not, compliers vs. non-compliers, suicides vs. not, crime vs. no crime) the actual number of outcomes becomes quite small. In fact, the base rate for some of the outcomes is really low: only about 1% of people in the study die by suicide or overdose within three months and about 3% are charged with a violent crime. When you’re dealing with such small numbers, even a tiny difference can look big in percentage terms. Random variation is very much in play at this scale. Reporting effect sizes is likely to give people an exaggerated sense of the actual human consequences here.

Meanwhile the outcomes they focus on (being charged with a violent crime or dying by suicide/overdose) are pretty blunt instruments. They don’t capture non-lethal self-harm, threats, drug relapse, or many of the things that might matter clinically. Who knows how many people who weren’t hospitalized and weren’t later arrested nevertheless went on to utterly ruin their lives? Who knows how many people who weren’t hospitalized and didn’t die by suicide or overdose instead went on to live lives of utter despair on the streets? Who knows how many patients might have lost their careers absent the intervention of involuntary treatment, but didn’t? We don’t know, the researchers don’t know, nor do they seem to care. Also, what about people who were helped by hospitalization and didn’t end up in crisis again, but whose improvement isn’t visible in crime or death stats? They disappear from the data. And yet I would argue they’re among the most important patients of all. The study also mostly looks at outcomes in the three months after hospitalization. That’s an awfully short timeframe; the research might catch the short-term chaos, but not the long-term effects. Some people crash after release and then stabilize. Others might struggle for years. A longer view would help.

All of this is why researchers usually say things like “we need replication in other settings” or “these results are suggestive.” But this paper is being interpreted as making the very strong claim that hospitalization causes people to die and commit crimes at higher rates. That’s a big statement to make based on relatively sparse data from a single county when the crosstabs are as small as they are.

The Authors Are Surely Aware That The Paper Will Be Interpreted in Exactly the Way It Shouldn’t Be

I must again highlight what I highlighted in the first paragraph: the authors aren’t actually estimating the effect of involuntary hospitalization in general. They’re estimating the effect for people whose cases are borderline, the “judgment calls.” I must insist on this - the paper does not and cannot comment on the influence of forced hospitalization in general. The authors are saying that among the patients whose doctors disagreed on the question of involuntary treatment, the ones who got hospitalized ended up worse off. The paper literally does not present and cannot present any information about the population of “humans who have been involuntarily committed” and “humans who have not been involuntarily committed,” which of course are infinitely larger and more important populations than the studied sample. The results quite explicitly don’t apply to everyone. People who were obviously a sufficient danger to warrant hospitalization, or obviously not, aren’t part of the story, at all, period. The study is about the grey area in the middle.

And yet I have already received this paper in my inbox a dozen times, sent by people who insist that it undermines the case for any and all involuntary treatment of mental illness. I have been a target of antipsychiatry cultists for eight years now and so I receive a steady stream of this stuff from them. If you think they don’t matter, well, I disagree. But OK, look at Dr. Awais Aftab, a psychiatrist and newsletter author who repetitively gives antipsychiatry arguments the benefit of the doubt. Here he does exactly what the paper’s authors supposedly don’t want him to do, draw broad conclusions about the process of involuntary commitment and not just about the specific cases where doctors disagree. His headline is literally “Groundbreaking Analysis Upends Our Understanding of Psychiatric Holds.” But the paper’s authors explicitly say that we can’t generalize the finding to all psychiatric holds. Why on earth would a study explicitly and inherently restricted to consideration of patients who have provoked discordant evaluations from different psychiatrists be said to “upend our understanding of psychiatric holds?” The paper’s authors are themselves telling you very directly not to do that!

Go look for reactions to this paper and you’ll see that this is not an unfortunate and irresponsible exception but rather the norm. Most people sharing the study are insisting that it speaks to involuntary commitment in general, when everything about the design restricts its analysis purely to those cases where patients receive discordant recommendations for hospitalization. This two-step about what’s being studied and what’s being concluded from the study has real teeth. The trouble is, policy debates don’t usually happen at the margin. When a city changes its involuntary treatment audience, they’re not just targeting the narrow group of compilers. They’re often sweeping in many other patients into new regulations. The grubby business of local politics and public policy simply doesn’t allow for this kind of specificity in analysis; no city councilperson is going to look at this study and say “this has to be restricted only to the clearest edge cases,” they’re just going to interpret it in light of their preexisting position on this issue. So it’s beyond risky to take this study and use it to argue for (or against) broader policies without being clear about what it actually measures.

There Are Victims

At the top you will find a photo of the twelve murdered victims of James Holmes, the Aurora Colorado mass shooter who was the subject of a mental health evaluation prior to walking into a movie theater and opening fire; he also tried to call a crisis hotline moments before his rampage, in an effort to give himself one last chance to get help, but his call was disconnected. I say the murdered victims to differentiate them from his 70 wounded victims, several of whom were so badly hurt in the attack that their lives were permanently and seriously damaged. A collage of their faces, of course, would be very large indeed. Holmes was evaluated specifically for severe mental illness because multiple people in his life found him to be dangerous. But because the walls we’ve erected around involuntary treatment have grown so tall, he was deemed a judgement call that should be let go. Twelve human beings paid for that decision with their lives.



I could have shown you the faces of the victims of Seung-Hui Cho, the Virginia Tech shooter who was evaluated for mental illness multiple times but deemed not to meet then standard for involuntary treatment and went on to kill 32 people. I could have shown you the faces of the victims of James Huberty, who shot and killed 22 people after a psychiatric intervention failed to identify him as a severely dangerous man. I could have shown you the face of Jessica Short, a nine-year-old girl from my hometown who was hacked to death in front of a horrified crowd of onlookers because a severely schizophrenic man was inexplicably released from his involuntary treatment. I could have shown you the face of the parents of Milton Douglas Moye, who were murdered by their son, a severely and obviously unwell man who was not involuntarily committed despite ample reason to be. I could have shown you the faces of Catherine and Robert Bardoni. All of these are people who were killed by psychotic patients who should have been involuntarily hospitalized but were not. There are of course thousands more I could refer to, victims of those who should have been forced into treatment.

But I could have gone the other way and shown you a photo of Rebecca Smith, a schizophrenic woman who was too ill to understand the threat of sleeping outside in a brutally cold New York winter and was found frozen to death in a cardboard box before the glacially-slow involuntary commitment process could be completed. Or I could have shown you a photo of Joyce Brown, who was freed from involuntary treatment by the NYCLU, to great fanfare and against the passionate disagreement of her sisters, the women who cared for her the most in her life. Within weeks she was smoking crack and living in filth on the streets again; she died there, on the streets, in poverty and utter despair. Or perhaps it would have been best to send you a photo of Michael Laudor and his pregnant fiancé. He was a model of what schizophrenic people could do and be, right up until he hacked her to death with a kitchen knife. You see, Mr. Laudor was a victim of our refusal to forcibly treat those who need it as well as a victimizer. Every patient is potentially both. Who knows what his life might have been if our society had had the integrity to force him to save his own life?

This stuff is really, really serious. The stakes here are incredibly high. I am just disgusted that an organization with the social status of the New York Fed would publish this work without any meaningful outside refereeing by people with appropriate psychiatric training, when there is a large, coordinated, ruthlessly ideological antipsychiatry movement in this country that seizes on every opportunity to undermine lifesaving psychiatric care. That “lifesaving” is not theoretical for me. I myself have been the beneficiary of the lifesaving potential of establishment psychiatry. Were it not for being forced into treatment, I am very confident that I would no longer be alive. Being forced into treatment that way hurt; everything that has happened to me since then has hurt. I don’t enjoy taking the pills. But they have saved my life, and there’s a little boy bouncing on his mother’s knee right now who, like me, would not exist if it weren’t for the power of involuntary treatment. You’ll forgive me, I hope, if I sometimes go to extremes in arguing my case. I am, after all, mentally ill.

You can read more about why we need to make involuntary treatment easier, not harder, here and here.