Recent findings from Apple researchers have cast doubt on the mathematical prowess of large language models (LLMs), challenging the idea that artificial intelligence (AI) is on the verge of human-like reasoning.
In a test of 20 state-of-the-art LLMs, performance on elementary school math problems plummeted when questions were slightly modified or irrelevant information was added, Apple found. Accuracy dropped by as much as 65.7%, revealing a surprising vulnerability in AI systems when faced with tasks that require robust logical reasoning.
This weakness could have far-reaching consequences for the trade that relies on AI for complex decision-making. Financial institutions in particular may need to reconsider their use of AI for tasks that involve complex calculations or risk assessments.
At the heart of this debate lies the concept of artificial general intelligence (AGI) – the holy grail of AI that could match or surpass human intelligence in various tasks. While some technology leaders are predicting the imminent arrival of AGI, these findings suggest that we may be further away from that goal than previously thought.
“Any real-world application that requires reasoning of the kind that can be definitively verified (or not) is basically impossible for an LLM to get right with any degree of consistency,” Selmer Bringsjord, a professor at Rensselaer Polytechnic Institute, told PYMNTS.
Bringsjord draws a clear line between AI and traditional computing: “What a calculator can do on your smartphone is something an LLM cannot do – because if someone really wants to be sure that the result of a calculation you do on your iPhone If this is correct, it would ultimately and always be possible for Apple to verify or falsify that result.”
Limitations and understanding
Not all experts view the limitations exposed in the Apple article as equally problematic. “The limitations outlined in this study are likely to have minimal impact on real-world applications of LLMs. This is because most real-world applications of LLMs do not require advanced mathematical reasoning,” Aravind Chandramouli, head of AI at data science company Tredence, told PYMNTS.
Potential solutions exist, such as refining or rapidly developing pre-trained models for specific domains. Specialized models such as WizardMath and MathGPT, designed for mathematical tasks, could improve AI’s capabilities in areas that require rigorous logical thinking.
The debate goes beyond math and focuses on a fundamental question: do these AIs really understand anything? This issue is central to discussions about AGI and machine cognition.
“LLMs have no idea what they are doing. They’re just looking for sublinguistic patterns from the patterns in the stored data that are statistically analogous to those in that data,” Bringsjord said.
Chandramouli said: “While their coherent answers may give the illusion of understanding, the ability to map statistical correlations in data does not mean they truly understand the tasks they are performing.” This insight highlights the challenge of distinguishing between advanced pattern recognition and true understanding in AI systems.
Eric Bravick, CEO of The Lifted Initiative, acknowledges the current limitations but sees potential solutions. “Large language models (LLMs) are not equipped to perform mathematical calculations. They don’t understand the math,” he said. However, he suggests that coupling LLMs with specialized AI subsystems could lead to more accurate results.
“When combined with specialized AI subsystems trained in math, they can obtain accurate answers rather than generating them based on their statistical models trained for language production,” Bravick said. Emerging technologies such as Retrieval-Augmented Generation (RAG) systems and multimodal AI could address current limitations in AI thinking.
An evolving field
The field of AI continues to develop rapidly, with LLMs demonstrating remarkable capabilities in language processing and generation. However, their struggles with logical reasoning and mathematical understanding show that much work is still needed to achieve AGI.
Careful evaluation and testing of AI systems remains critical, especially for high-stakes applications that require reliable reasoning. Researchers and developers can find promising avenues in approaches such as refinement, specialized models, and multimodal AI systems as they work to bridge the gap between current AI capabilities and the envisioned robust, general intelligence.