PSA: ChatGPT is the trailer, not the movie
March 02, 2023Authored by Casey Flaherty, Baretz+Brunelle partner and LexFusion co-founder
To Whom It May Concern,
You have been provided this link because someone believes it may help.
You may be new to this topic. If so, welcome! Alternatively, you may have contributed to the unhealthy preoccupation with ChatGPT. This does not mean you've said anything factually inaccurate. But you are reinforcing the unhelpful narrative that ChatGPT is the proper point of emphasis—a monomania reflected in articles like Chat GPT for Contract Drafting: AI v. Templates and tweets like:
It's fine. It happens. In fact, it happens a lot these days. Hence the post. Please do not be offended. The person who directed you here is only trying to advance the discourse. This is all rather new. We're all learning together.
The objective is to raise baseline awareness so we can engage in more productive conversations. The subsequent post is long. But the introductory synopsis is mercifully brief:
- ChatGPT is exciting. But ChatGPT is also explicitly a preview of progress with no source of truth.
- ChatGPT is an application of GPT-3.5, a transformer-based large language model ("LLM"). GPT-3.5 is a raw "foundation model" from Microsoft-backed OpenAI. Many other companies, like Google and Meta, are also investing heavily in foundation models. ChatGPT is one application of one foundation model.
- LLMs like GPT 3.5 can produce impressive results. But, in their raw form, they are not intended to be fit-to-purpose for many tasks, especially in highly specialized domains with little tolerance for error—i.e., there is no reason to expect raw foundation models to generate high-quality legal opinions or contracts without complementary efforts to optimize for those outputs.
- In particular, with no source of truth, LLMs are prone to hallucination, confabulation, stochastic parroting, and generally making shit up, among many other pitfalls. This can be remedied by combining LLMs with sources of truth—e.g., a caselaw database or template library—through methods like retrieval-augmented generation.
- We already have real-world examples of LLMs being enriched through (i) domain-specific data sets, (ii) tuning, including reinforced learning from human feedback, and (iii) integration into complementary systems that introduce sources of truth to successfully augment human expertise in areas like law, medicine, and software development. A myopic focus on ChatGPT ignores these examples and arbitrarily limits the conversation in unproductive ways.
- We also have real-world examples of failed attempts to leverage LLMs, as well as different applications of the same LLM to the same problem set with material differences in performance level. A powerful LLM is not sufficient. The complementary data, tuning, and tech are often necessary. Be wary of magic premised on the mere presence of an LLM.
- Despite the understandable focus on "Generative AI," the usefulness of LLMs is not limited to generation. LLM-powered applications can perform data extraction, collation, structuring, search, translation, etc. These lay the groundwork for generation but need not be generative to deliver value.
- LLMs will continue to advance at a rapid pace, and there will be many attempts at applying LLMs to legal. Some applications will be bad. Some applications will be good. Cutting through the hype and properly assessing these applications will require work.
- It remains TBD (and fascinating and, for some, frightening) how and when LLMs will prove most applicable to augmenting legal work. The road to product is long, and we are at the front end of an accelerating growth curve.
- No one knows what will happen. We're all making bets. Abstention is a bet.
- Be prepared for unrealized hype, unforced errors, excruciating debates, exciting experimentation, and (the author is betting) real progress. Things will get weird.
- "AI will replace all lawyers" is almost certainly an embarrassingly bad take for the foreseeable future. But "AI will not displace any lawyers because of what ChatGPT currently does poorly" is undoubtedly a bad take today.
The truly TLDR summary: too many lawyers are worried about ChatGPT when they should be excited about CoCounsel. Those are the bones. If you need more, the meat follows.
The author is an LLM bull. But being an LLM bear is totally fine. I take strong positions on the potential application of LLMs to legal service delivery, including:
- This will be done BY you or this will be done TO you. This is happening. I consider these seismic advances to be more akin to email and mobile than some narrow progress within legal tech that lawyers can choose to ignore. Shifts in the general operating environment will make incorporation of this rapidly maturing technology a necessity and require changing many of the ways we currently work.
- In 2023, AI will be capable of producing a first draft of a legal opinion or contract superior to the output of 90% of junior lawyers. "Junior" is responsible for some heavy lifting in that statement. And capacity is not the same as fully productized and widely available—the road to product is long. Yet technical thresholds will be surpassed in ways that should force us to fundamentally rethink workflows, staffing, and training.
A primary source of my confidence is previews from co-conspirators at law departments, law firms, and legaltech companies working on LLM use cases that are not yet public. Some of these friends mock me for not having the courage of my convictions. Compared to them, my predictions are downright conservative.
One senior in-house friend said to me this week, "People will start scrambling. Their place in the value chain is about to be markedly less secure." Another put a similar sentiment in more colorful terms, "The boat is leaving the dock. You can be on it, or you can swim."
I am a relative LLM bull. There are, however, legitimate reasons to be an LLM bear. Many peers I respect default to doubt on this subject. You will find smart, credentialed people on both sides of the debate. This post is not an attempt at persuasion on the inevitability of LLMs, let alone the robot lawyer event horizon.
The LLM doubters may turn out to be right. No one knows. So we place bets. Indeed, after including the "90% of junior lawyers?" statement above in the LexFusion Year in Review piece on Legal Evolution, I ended up in a friendly wager with the great Alex Hamilton.

Alex set the parameters of the bet: AI displacing 5% of what lawyers do in contracting within 5 years. I would have taken 3 years and 30% displacement to make it more interesting.
Yet, as bullish as I am (and as much crow as I will eat if LLMs turn out to be Watson 2.0), the prediction I have highest confidence in is rather bearish:
- There will be a flood of garbage products claiming to deliver AI magic. This is a near certainty. Regardless of how useful well-crafted applications powered by LLMs may prove, there will be many applications that are far from well-crafted and are merely attempting to ride the hype train.
We're already seeing this with ChatGPT.
ChatGPT is awesome. If we assess ChatGPT on its own merits, ChatGPT absolutely delivers.
ChatGPT is a "preview of progress." So explained Sam Altman, CEO of OpenAI, the Microsoft-backed startup behind ChatGPT.
The November 30, 2022 release notes for ChatGPT are not cryptic:
- ChatGPT was being released as a "research preview"
- ChatGPT has "no source of truth"
- ChatGPT therefore "sometimes writes plausible-sounding but incorrect or nonsensical answers"
- ChatGPT "is sensitive to tweaks to the input phrasing"
- ChatGPT has issues that "arise from biases in the training data"
- ChatGPT tends to "guess what the user intended" rather than "ask clarifying questions"
- ChatGPT "will sometimes respond to harmful instructions or exhibit biased behavior"
ChatGPT was novel because of the Chat aspect. ChatGPT offers a conversational user interface layered on top of one of OpenAI's foundation models, GPT-3.5. ChatGPT proved an immediate sensation, reaching one million users in five days and one-hundred million users within two months.

The hype cycle commenced. The interest drove hype. The hype drove interest. Mass tinkering uncovered all manner of tantalizing use cases. BigLaw partners who have been practicing for 40-years were entering prompts and rightly finding some (not all) "results were nothing short of amazing—especially with the speed. Mere seconds which even the most knowledgeable expert could not hope to match."
Posts, articles, and news coverage on ChatGPT approached ubiquity. Suddenly, everywhere you looked, someone like Larry Summers was signal boosting themselves with pronouncements like "ChatGPT is a development on par with the printing press, electricity and even the wheel and fire."
Invariably, the backlash followed. Specifically, many legal denizens became avid hate prompters. Hate prompting entails a domain expert inputting a ChatGPT query and then publicly bashing the output. A benign example is a beloved friend texting me ChatGPT's list of notable female CEOs, which includes yours truly (never been a CEO; never identified as a woman) and concluding the "results are laughably bad."

Those results were laughably bad. Hate prompts have produced similarly ludicrous results when tasking ChatGPT with all manner of legal work.
But, again, ChatGPT was not built to do any of that well, let alone perfectly. ChatGPT merits experimentation, and people have every reason to explore the possibilities it might presage for incorporating foundation models into fit-to-purpose products. But we should also learn enough about what ChatGPT is, and is not, to avoid being shocked that a raw model released as a research preview with no source of truth sometimes produces plausible sounding but incorrect or nonsensical answers, especially in highly specialized domains. They told us that Day One.
The hate prompters will say they are merely responding to the hypists. I submit both are guilty of injecting too much noise into a vital conversation by artificially narrowing the debate to what ChatGPT can and cannot do today. I attempted to address this noise in a two-part series for Legaltech News entitled "The Focus on ChatGPT Is Missing the Forest for the Tree" (Part 1, Part 2). This post is a recitation and extension of that series.
In Part 1, I suggested those fixated on ChatGPT—rather than appropriately treating it as a preview of progress—are re-enacting the eponymous Zoolander's tantrum upon being presented a scale model for the "Derek Zoolander Center for Kids Who Can't Read Good and Who Wanna Learn to Do Other Stuff Good Too." Lacking any capacity for abstraction, the face-and-body boy confuses the miniature preview for the thing itself, summarily rejecting it, "What is this? A center for ants?"... How can we be expected to teach children to learn how to read... if they can't even fit inside the building?"

Invoking Zoolander failed to elevate the discourse. The avalanche continued.
About a week after my series, the usually informed and informative Jack Shepherd published Chat GPT for Contract Drafting: AI v. Templates, which treats ChatGPT as a stand-in for AI and then AI as somehow incompatible with templates (it is all there in the title, but you can read the piece for yourself). This was followed a few days later by So How Good is ChatGPT at Drafting Contracts?, which, to be fair, offers all the correct caveats in its conclusion. The cacophony resulted in webinars on Legal Considerations for ChatGPT in Law Firms and articles like As More Law Firms Leverage ChatGPT, Few Have Internal Policies Regarding Its Use—which were necessary but also serve to reinforce the monomania.
When Brookings is publishing acontextual tracts like Building Guardrails for ChatGPT, we should forgive casual observers for thinking ChatGPT is the correct focal point. But once you stop being a casual observer and choose to engage in the discourse, you assume a duty to advance it.
So here we are, as I live my motto: if you find yourself screaming into the void, just scream louder (and with a much higher word count).
I am not a female CEO nor an LLM expert. Since I have the audacity to label Jack and the hate prompters as Zoolanders, I must confess I am more Hansel than JP Prewitt. My technical knowledge does not extend much beyond "the files are IN the computer." (h/t Stephanie Wilkins)
I will not embarrass myself trying to explain LLMs. I commend this article from the famed Stephen Wolfram, as well as this video from the soon-to-be-famed Jacob Beckerman, founder of Macro who did his thesis work in natural language processing.
I will not pretend to have a comprehensive grasp on the players in the space. I suggest this article from Andreesen Horowitz, who, as it happens, just led a $9.3M seed round in Macro (bias/brag alert: Macro is a LexFusion member).

Indeed, while I have endeavored to slowly educate myself, I am not the least bit qualified to educate others on tokens, alignment, retrieval augmented generation (RAG), DocPrompting, reinforcement learning from human feedback (RLHF), edge models, zero-shot reasoning, multimodal chain-of-thought reasoning, fine tuning, prompt tuning, prompt engineering, etc.
My super basic take is LLMs recently passed a threshold with language that computers long ago crossed with numbers. This is the convergence of decades of cumulative advances in AI architecture, computing power, and training-data availability. This time is different because LLMs have demonstrated unprecedented flexibility, including emergent abilities.
GPT is the abbreviation for "generative pre-trained transformer." These are foundation models because the pre-training creates the conditions for the models to be tailored to different domains and applications (WARNING: frequently, they still need to be tailored). Seemingly daily, there is yet another paper on a more efficient way to tune models to tasks. This opens up a world of possibilities we're just starting to get a taste of with the likes of ChatGPT, AllSearch, CoCounsel, Copilot, Med-PaLM, Midjourney, and CICERO, among many others.
Like I said, it is a basic take, and you should look elsewhere for deep understanding.
But Richard Susskind did not need schematic understanding of SMTP, IMAP, POP, or MIME in 1996 to predict email would become the dominant form of communication between lawyers and clients. He merely needed to grasp the jobs email did and recognize that the rise of webmail clients built atop WYSIWYG editors would cross the usability tipping point Ray Tomlinson envisioned after he invented email in 1971.
Tomlinson worked in relative obscurity for decades. At the same time Tomlinson's contribution was finally receiving its just due, Susskind was being labeled "dangerous" and "possibly insane." Some lawyers called for Susskind to be banned from public speaking because he was "bringing the profession into disrepute" due to his willful ignorance of how email undermined client confidentiality. Less than a decade later, a lawyer was laughed out of court for claiming failure to check his email was "excusable neglect."
Susskind did not persuade lawyers to adopt email. Clients did. The world changed. Resistance was futile. But futility took time to become apparent and was the subject of furious, if mostly nonsensical, debate.
With ChatGPT, what was relatively obscure has become an extremely public conversation that dwarfs any previous AI hype cycle (even Watson). My thesis is you do not need a technical background to understand that limiting the terms of the attendant debate to ChatGPT is a disservice to the discourse.
LLM-powered applications are effective in legal. Casetext began experimenting with LLMs years ago to augment the editorial process some of us still call "Shepardizing" despite that trademark belonging to our friends at LexisNexis (another bias/brag alert: Casetext is also a LexFusion member).

The challenge with automating the analysis of subsequent treatment of a judicial decision is the linguistically nuanced ways a court might overturn, question, or reaffirm. Core to the appeal of LLMs is the capacity to handle linguistic nuance.
In the beginning, the LLMs did not work too well on legal text. Eventually, after being trained on massive amounts of legal language and refined through reinforced learning from human feedback, the models proved so adept that Casetext extended the technology to search.
Parallel Search represents the first true "conceptual search" for caselaw. While we've had "natural language" search for decades, it, at best, is the machine translating keywords into Boolean searches with some additional fuzzy logic and common synonyms. Conceptual search is different in kind, not just degree—identifying conceptual congruence despite no common keywords.
One of the many jokes I have stolen from Casetext co-founder Pablo Arredondo is that Parallel Search could have been called "partner search" because it is the realization of a dream/end of a nightmare. Pablo and I are both former litigators. Every litigation associate is intimately familiar with the terror of a partner exclaiming some variant of "I am sure there is a case that says X."

When searching for "X" does not produce the case that definitely exists, the associate begins the sometimes endless pursuit of typing in potential cognates of X. The associate is attempting to break out of the keyword prison to translate statement X into concepts. In grossly simplified terms, this is what the transformer-based neural networks underpinning Parallel Search have already accomplished—indexing the text of the entire common law in 768-dimensional vector space to surface similarities based on meaning rather than word selection (which is not say the machine "understands" meaning in the conscious sense, only that it is able identify parallel word clusters based on meaning instead of verbiage).
Parallel Search is excellent. But judicial decisions are not the only document type containing legal language. Casetext built AllSearch to extend the technology to any corpus of documents—e.g., contracts, brief banks, deposition transcripts, emails, prior art, your document management system, etc.
Who objects to better search for caselaw, contracts, or a DMS? Not every application needs to be totally transformative. Most won't be. Still, better is better. In some instances, the introduction and proper application of LLMs will simply result in a superior version of that which we already do.
Further, while Casetext has used GPT-3 to help rank judge-generated text, Parallel Search and AllSearch were developed entirely in-house. They are also not generative in nature—they only return actually existing caselaw or documents. All three points are critical to thinking through LLM-powered applications in legal:
- GPT 3.5, the LLM underpinning ChatGPT, is only one LLM. LLMs can incorporate or be enriched through domain specific data. There will be horses for courses.
- LLMs are compatible with sources of truth, like the common law or a document repository. Just because ChatGPT does not have a source of truth does not mean all LLM-powered applications will operate without one.
- Asking ChatGPT to generate items from scratch has been the most prominent form of experimentation. But LLMs have many use cases beyond blank-page generation, including search, synthesis, summarization, translation, collation, categorization, and annotation.
Parallel Search and AllSearch are not generative. But CoCounsel is.
Properly-calibrated LLM-powered applications are effective in producing legal content. Today, Casetext announced CoCounsel. From the press release:
Today, legal AI company Casetext unveiled CoCounsel, the first AI legal assistant. CoCounsel leverages the latest, most advanced large language model from OpenAI, which Casetext has customized for legal practice, to expertly perform the skills most valuable to legal professionals...
CoCounsel introduces a groundbreaking way of interacting with legal technology. For the first time, lawyers can reliably delegate substantive, complex work to an AI assistant—just as they would to a legal professional—and trust the results...
CoCounsel can perform substantive tasks such as legal research, document review, and contract analysis more quickly and accurately than ever before possible. Most importantly, CoCounsel produces results lawyers can rely on for professional use and keeps customers'—and their clients'—data private and secure.
To tailor general AI technology for the demands of legal practice, Casetext established a robust trust and reliability program managed by a dedicated team of AI engineers and experienced litigation and transactional attorneys. Casetext's Trust Team, which has run every legal skill on the platform through thousands of internal tests, has spent nearly 4,000 hours training and fine-tuning CoCounsel's output based on over 30,000 legal questions. Then, all CoCounsel applications were used extensively by a group of beta testers composed of over four hundred attorneys from elite boutique and global law firms, in-house legal departments, and legal aid organizations, before being deployed. These lawyers and legal professionals have already used CoCounsel more than 50,000 times in their day-to-day work.

In short, the reliability concerns about ChatGPT are both legitimate and addressable. As I understand it (and, if I am wrong, Pablo will correct me once he's done making jokes on MSNBC), CoCounsel incorporates a specialized variant of retrieval augmented generation(RAG).