Humans vs Machines: GenAI outperforms legal professionals?

Everlaw report reveals that GenAI can perform comparably to human legal professionals – or even better.

With knowledge work and in particular the legal profession squarely in the sights of those working on GenAI based software applications, the quality and accuracy of the technology has been in the spotlight.

Recent research has been published on some of these products, analysing AI’s accuracy across a variety of legal tasks, and many of you will have undertaken your own tests and projects to determine the quality and trustworthiness of them.

However, there has been less focus on how AI compares to performance. One Solomonic customer, Everlaw, a leading eDiscovery software provider, has recently set out to shed light on recent developments in GenAI-powered tools, especially in the context of their performance versus a human counterpart.

 
 

Findings from their report, focused around their e-disclosure platform, reveal that generative AI can perform comparably to human legal professionals -- or even better in certain cases. The results are promising and pose key challenges for legal professionals around how current processes will have to change, open new opportunities for quicker review and of course open up questions related to fees and pricing.

Of course, for legal professionals, the accuracy of our data is all paramount, which is why both Solomonic and Everlaw combine the power of machine learning with human expertise to glean, clean and analyse data - extracting essential information to create a complete data picture with humans and machines complementing each other.

A precis of Everlaw’s report is below and you can find the full report here.


Report: GenAI-Powered Document Review Matches or Outperforms Humans on Real-World Litigation Data

In tests against real-world litigation data, generative AI tools were able to code documents as responsive or not responsive to particular prompts with accuracy that matched, and in one case well exceeded, human performance, according to a new report from Everlaw, a leading ediscovery software provider.

In a series of evaluations, Everlaw evaluated its Coding Suggestions feature against four real-world datasets initially reviewed by humans as part of the edisclosure phase of litigation. Across the four datasets, Coding Suggestions achieved precision and recall rates that match performance for human document review generally and, in one instance where direct comparison was available, surpassed the performance of first level reviewers on recall by 36%.

 

When applied to real-world litigation data, Coding Suggestions achieved precision and recall rates comparable to human performance.

 

These promising results indicate that much of the first pass review typically handled by legal professionals can be achieved with the help of generative AI tools to alleviate the burdens of manual workflows. 

Results Show Strong Performance in Recall and Precision

The experiment pitted Coding Suggestions, part of the Everlaw AI Assistant suite of generative AI tools, against first level human reviewers, across four edisclosure datasets comprising a total of 7,737 documents, all part of active matters. 

Comparing the suggested codes against first-pass review allowed the team to determine performance based on both precision and recall.

 

Coding Suggestions across all datasets, compared to first-pass human review. At the yes and soft yes cutoff, coding suggestions achieved precision of 0.67 and recall of 0.89.

 

When aggregated across all four datasets, Coding Suggestions delivered a precision score of 0.67 and a recall score of 0.89. These results are well in the range of human manual review performance.

Furthermore, in one dataset where second-pass review allowed direct comparison between Coding Suggestions and the first-level human reviewers, Coding Suggestions outperformed humans on recall by 36%, achieving 0.82 compared to reviewers’ 0.6.

 

Some Implications for Litigators

Everlaw’s tests of Coding Suggestions are preliminary but promising, suggesting that these tools can perform at or above first-level human review. Of course, there are caveats, but what these initial tests show is that legal professionals can reasonably rely on Coding Suggestions performance to help prioritize and classify documents. 

As readers begin preparing for the implications of generative AI on their practices, here are a few items to consider:

  • Scale and Operationalisation: As GenAI tools become more integrated into key litigation workflows, legal organizations will need to adjust to take advantage of these technologies. In the context of document review for edisclosure, for example, iterative prompt refining, relying more on higher-level fee earners, may supplement or displace existing processes for human reviewers, requiring adjustments to standard workflows but providing significant efficiencies.

  • Proportionality and Urgency: The efficiency and speed of generative AI tools may allow for more effective review in cases where document volumes are high, but the matter in controversy or client resources may not justify large review teams. Similarly, where timing makes understanding new datasets an urgent matter, these tools open new avenues for quickly prioritizing documents. 

  • Potential Shifts in Revenue Sources: Generative AI presents both challenges and opportunities to traditional legal business models. Firms should consider potential adjustments to pricing models, for example by leveraging value-based billing to account for greater efficiencies offered by GenAI, and the additional revenue streams that may be opened by LLM-powered tools. 

 

To see the full report, including a detailed breakdown of performance, download your copy here.

Previous
Previous

Trowers & Hamlins teams up with Solomonic to leverage unique litigation insights

Next
Next

The UK litigation landscape: H1 2024 key trends and developments