Hosting
Wednesday, February 5, 2025
Google search engine
HomeArtificial IntelligenceMaking it easier to verify an AI model's responses | MIT News

Making it easier to verify an AI model’s responses | MIT News


Despite their impressive capabilities, large language models are far from perfect. These artificial intelligence models sometimes “hallucinate” by generating incorrect or unsupported information in response to a query.

Because of this hallucination problem, an LLM’s responses are often verified by human fact-checkers, especially if a model is deployed in a high-stakes environment such as healthcare or finance. However, validation processes typically require people to read through lengthy documents cited by the model, a task so cumbersome and error-prone that it may deter some users from deploying generative AI models in the first place.

To help human validators, MIT researchers have developed an easy-to-use system that allows people to verify an LLM’s answers much faster. Using this tool, called SymGen, an LLM generates responses with citations that point directly to the location in a source document, such as a particular cell in a database.

Users can hover over highlighted portions of the text response to view data the model used to generate that specific word or phrase. At the same time, the unmarked portions show users which sentences need extra attention to check and verify.

“We give people the ability to selectively focus on parts of the text that they should be more concerned about. Ultimately, SymGen can give people more confidence in a model’s responses because they can easily look at it to make sure the information is verified,” said Shannon Shen, a graduate student in electrical engineering and computer science and co-lead author of a article about SymGen.

In a user study, Shen and his associates found that SymGen sped up verification time by about 20 percent, compared to manual procedures. By making it faster and easier for people to validate model results, SymGen can help people identify errors in LLMs deployed in a variety of real-world situations, from generating clinical notes to summarizing financial market reports.

Shen is joined on the paper by co-lead author and fellow EECS graduate student Lucas Torroba Hennigen; EECS graduate student Aniruddha “Ani” Nrusimha; Bernhard Gapp, chairman of the Good Data Initiative; and senior authors David Sontag, professor of EECS, member of the MIT Jameel Clinic, and leader of the Clinical Machine Learning Group at the Computer Science and Artificial Intelligence Laboratory (CSAIL); and Yoon Kim, assistant professor at EECS and member of CSAIL. The research was recently presented at the Conference on Language Modeling.

Symbolic references

To facilitate validation, many LLMs are designed to generate citations that refer to external documents, along with their language-based answers, for users to check. However, these verification systems are usually designed as an afterthought, without taking into account the effort it takes for people to sort through countless citations, Shen says.

“Generative AI aims to reduce the time it takes for the user to complete a task. If you have to spend hours reading through all these documents to make sure the model says something reasonable, then there is less point in putting the generations into practice,” says Shen.

The researchers approached the validation problem from the perspective of the people who will do the work.

A SymGen user first provides the LLM with data to reference in its response, such as a table of statistics from a basketball game. Then, instead of immediately asking the model to perform a task, such as generating a game summary from that data, the researchers perform an intermediate step. They prompt the model to generate its response in a symbolic form.

With this prompt, each time the model wants to quote words in its answer, it must write the specific cell from the data table that contains the information it references. For example, if the model wants to quote the phrase “Portland Trailblazers” in its response, it replaces that text with the cell name in the data table that contains those words.

“Because we have this intermediate step where the text is represented in a symbolic format, we can have very fine-grained references. We can say for every single text string in the output that this is exactly the place in the data it corresponds to,” says Torroba Hennigen.

SymGen then resolves each reference using a rule-based tool that copies the corresponding text from the data table into the model’s response.

“This way we know it is a verbatim copy, so we know there will be no errors in the part of the text that corresponds to the actual data variable,” Shen adds.

Streamline validation

The model can create symbolic responses because of the way it is trained. Large language models are fed massive amounts of data from the Internet, and some data is captured in a ‘placeholder format’, with codes replacing actual values.

When SymGen asks the model to generate a symbolic response, a similar structure is used.

“We design the prompt in a specific way to take advantage of the capabilities of the LLM,” Shen adds.

During a user survey, the majority of participants said that SymGen made it easier to verify LLM-generated text. They were able to validate the model’s answers about 20 percent faster than if they used standard methods.

However, SymGen is limited by the quality of the source data. The LLM could quote an incorrect variable, and a human verifier could be none the wiser.

In addition, the user must have source data in a structured format, such as a table, to enter into SymGen. Currently the system only works with tabular data.

In the future, the researchers will improve SymGen so that it can process arbitrary text and other forms of data. With that capability, it could, for example, help validate parts of AI-generated summaries of legal documents. They also plan to test SymGen with physicians to explore how it can identify errors in AI-generated clinical summaries.

This work is funded in part by Liberty Mutual and the MIT Quest for Intelligence Initiative.



Source link

RELATED ARTICLES
- Advertisment -
Google search engine

Most Popular