The Dying Gasp of Corpus Linguistics

Mar 27, 2025

The New Digest is pleased to present a guest post by Jack Kieffaber and Matias Mayesh. Mr. Kieffaber received his J.D. from Harvard Law School in 2023. Mr. Mayesh is a student at Harvard Law School. The post can profitably be read in conjunction with an earlier one, “Judge.AI.”

Some judges have used ChatGPT to find out what words mean. And that makes a lot of sense given that ChatGPT is just a giant statistical referendum on the median usage of every single word on the internet; if that’s not ordinary meaning, there isn’t one. But Justice Thomas R. Lee at BYU Law doesn’t care for these “AI apologists”—and he put out a critique of them in October that more people should be talking about.

His crux is that ChatGPT is not “transparent,” “replicable,” or “generalizable”—which are his personal “pillars” of textualism. He notes that GPT doesn’t provide “empirical support for its conclusions,” failing to produce “probabilistic maps” of how it gets from dataset to definition. Not transparent. He laments that GPT runs on a “black box algorithm” that prevents us from seeing its internal processes—and whether it does the same process for identical queries. Not replicable. And he reminds us that GPT essentially acts as a single human being that trains on an unseen dataset that might be unduly influenced by individual GPT employees who provide specific “reinforcement learning.” Not generalizable. Three strikes, it’s out.

These are very fair critiques of AI as a legal tool—and, indeed, of the formalist fetish for mechanizing language. But then Justice Lee does something hysterical: He tries to sell you his AI instead. And that’s where the argument goes off the rails.

Lee.AI is known in the business as corpus linguistics—a tool for academic linguists that Justice Lee and Professor Stephen Mouritsen (also of BYU) conscripted to the formalist front in the 2010’s. It works about like this: A handful of professors at Brigham Young University photocopy a bunch of books with words in them—and they categorize the books by type so as to represent a “particular speech community.” Photocopied a bunch of law books? Well, that’s the American legal English speech community. Photocopied a bunch of not law books? That’s the ordinary American English speech community. And then they note when all the books were written—1788 ordinary American English goes in one folder, 2025 ordinary American English goes in another folder.

When he needs to define a word, the Lee.AI user isolates the relevant speech community—that is, he picks the right year and, in all but five cases, smashes the “ordinary American English” button. He then types the word into the corpus search bar, hits enter, and watches the thing turn up photocopy fragments like a HeinOnline search. Then what? Well, he “develop[s] standards for distinguishing [how] the [words] were used” and “present[s] the results in a clear, open manner.”

Translation: He looks at ‘em. And he says “I think they’re using it this way”; other times he says “I think they’re using it that way.” Then he codes a big matrix that says:

Times I thought they were using it this way: 60%

Times I thought they were using it that way: 40%

The 60% has it; this way wins.

This might seem like we’re parodying—and we are—but what we just sketched out might actually make more sense than how Lee.AI would have rewritten the Eleventh Circuit’s opinion in US v. DeLeon, 116 F.4th 1260 (11th Cir. 2024). The question there was whether the words “physically restrain” in a statute required physical contact with the restrainee—as opposed to pointing a gun at him and saying “freeze.” Justice Lee’s approach, laid out in his article, was to open up the “Corpus of Contemporary American English” (COCA)—which contains “academic journal articles, blogs, fiction novels, magazine articles, TV/movie scripts, news articles, broadcast speech, and texts from the web” from between “1990 and 2019.”

With The Da Vinci Code and the script from Transformers 2 locked and loaded, Justice Lee then typed in “physically restrain” and hit enter. One hundred and seventy three results appeared. “Two independent coders” then read all the first fifty snippets—not even the full 173—and grouped them into three categories: “indicates bodily contact,” “does not indicate bodily contact,” and “indeterminate.” Then they ran the numbers: 9/50 indicated bodily contact, 41/50 were “indeterminate.” Rough start. So they went to a bigger database—the aptly named ”iWeb”—and got some five hundred results. The coders once again applied their technical wizardry of reading words and reported that 419 out of 500 instances were indeterminate. Foiled again by words.

Now trying anything and everything to jam the words peg into the math hole, Justice Lee went back to COCA and started searching for the nouns around the words—and picked out all the stuff you could physically restrain a guy with, like “rope,” “chain,” “strap,” etc. That yielded fifty seven results—seven more than the two coders originally “coded.” Wouldn’t the BYU 1L’s doing Justice Lee’s coding have already seen all the “rope” and “chain” language when they looked at this stuff the first time? Shhhh—corpus linguistics is happening. Lee.AI then, for the coup de grace, compared those fifty seven instances against another random word—”threaten,” which comes flying in from left field—and found that “threaten” appeared next to “gun” in 100% of the results. “Restrain,” on the other hand, appeared next to “gun” in only 15% of the results—but appeared next to “other objects” like “rope” and “chain” 85% of the time. The empirical, rigorous mathematics had spoken: “Restrain” was a rope word, not a gun word—which sounds pretty physical to us. Restrain = physical. AI = bad. Eleventh Circuit = wrong. Math.

Perhaps this is why ChatGPT keeps its reasoning in a black box; better to remain silent and be thought unwise than to open your mouth and leave no doubt. Corpus linguistics is, at bottom, just Google but if the only stuff on Google was the stuff the BYU faculty put on Google. Its 1L “coders” are tasked with doing the same thing an AI does: Taking a target word and running a statistical analysis of which words tend to appear around that word—such that it can guess where to situate that word in various sentences going forward. It’s just that GPT’s calculations happen in a tenth of a second across a hundred trillion words on the largest supercomputer on earth, while the BYU 1L is using the back of an envelope.

As for transparency? Lee.AI is only as transparent as the corpus linguist feels like making it on a given day. The paper never tells us where Justice Lee’s “distinguishing” methods came from; the closest he gets to an answer is citing generally to entire books on corpus linguistics that no one will ever buy, much less read. What’s so replicable about an approach where 90% of the corpus is indeterminate and the user has to pull a fast one to cull any meaning from the remaining 10%? And how is Lee.AI more generalizable than ChatGPT when the latter’s dataset is everything (albeit moderated by a human) and the former’s dataset is whatever Justice Lee happened to have on hand (also moderated by human coders)? In the end, we have to call the spade a spade: When it comes to making words into math, corpus linguistics is just AI but worse. And Justice Lee knows that.

So he’s coping—but so are formalists more generally. If you squint, Justice Lee debating the GPT judges over who can calculate words better starts to look like two guys playing competitive Yu-gi-Oh. And getting really mad about the Yu-gi-oh rules. And telling one another they don’t understand the rules because they don’t understand the Yu-gi-oh lore. They’ve been larping in the Yu-gi-oh universe for so long—going to anime conventions, dressing up as Yu-gi-oh characters, writing volumes of Yu-gi-oh fanfic—that they’ve forgotten that what Yu-gi-oh really is: A thing a Japanese guy came up with one time to make some money.

It doesn’t matter what the Yu-gi-oh rules are—and, from a statebuilding perspective, it doesn’t matter what the words mean either. What matters is that you pick a meaning in advance and stick with it. What Justice Lee really wants, at bottom, is an uber-statute that defines all the words and creates a binary language for statutes to speak in—and he wants you to believe that Brigham Young University has enacted it de facto. And maybe he’s right; maybe Lee.AI is what the formalist state should adopt. We tend to think that a Large Language Model would be a more complete uber-statute that actually has the capacity to assign fixed, binary meanings to every word and replicate those meanings over time. That would be called Judge.AI and it would replace the entire legal profession.

But we can’t forget that there is a third option: That, while the Yu-gi-oh lore doesn’t matter—because it’s made up, like our statutes—the life lore does. That’s morality; that’s religion. Higher law is the prerequisite to words having any greater telos than fixture in time. The trouble is that BYU’s 5G tower of babel can’t reach that higher law, if there is one at all. Neither can OpenAI. And neither can the Eleventh Circuit.

Choose your gods carefully.

Shaul Shapira

Mar 27Edited

"But we can’t forget that there is a third option: That, while the Yu-gi-oh lore doesn’t matter—because it’s made up, like our statutes—the life lore does. That’s morality; that’s religion. Higher law is the prerequisite to words having any greater telos than fixture in time. The trouble is that BYU’s 5G tower of babel can’t reach that higher law, if there is one at all. Neither can OpenAI. And neither can the Eleventh Circuit."

If the statutes are 'made up,' there's nothing wrong with using equally 'made up' tools to figure out what they mean. Ergo, textualism, dictionaries, etc. The fact that dictionaries don't solve every ambiguity doesn't render them useless. Ditto for corpus linguistics.

"Choose your gods carefully."

Or maybe just stop trying to keep stuffing theological mumbo jumbo into everything.

Expand full comment

Hollis Robbins (@Anecdotal)

Mar 27

I am reminded here of Paul de Man’s admonition not to confuse the materiality of the signifier with the materiality of what it signifies: “no one in his right mind will try to grow grapes by the luminosity of the word ‘day,’ but it is very difficult not to conceive the pattern of one’s past and future existence as in accordance with temporal and spatial schemes that belong to fictional narratives and not to this world. This does not mean that fictional narratives are not part of the world and of reality; their impact upon the world may well be all to strong for comfort.” (“The Resistance to Theory,” p. 11)

2 more comments...

The New Digest

The Dying Gasp of Corpus Linguistics

Discussion about this post