Recent research suggests that larger language models may be less reliable than previously thought, as their accuracy in simple tasks declines despite enhanced capabilities.
Increasing Scale of Large Language Models Leads to Decline in Reliability
Recent research has cast doubt on the assumption that larger and more powerful Large Language Models (LLMs) are inherently more reliable. The study, conducted by José Hernández-Orallo at the Polytechnic University of Valencia in Spain, indicates that as LLMs grow and incorporate more human feedback, their reliability in answering simple questions diminishes.
Developers enhance the capabilities of LLMs through two primary methods: scaling up, which involves the incorporation of more training data and computational power, and shaping up, which is the fine-tuning of models based on human feedback. These efforts generally aim to improve the overall performance and utility of AI systems.
Hernández-Orallo and his team assessed the performance of several well-known LLMs, including OpenAI’s series of GPT chatbots, Meta’s LLaMA models, and BLOOM, a model developed by the BigScience research collective. The scope of the study included tasks such as arithmetic problems, anagram solving, geographical queries, scientific challenges, and extracting information from disorganised lists.
The findings revealed that while LLMs showed enhanced performance on complex tasks, like solving the anagram “yoiirtsrphaepmdhray” to produce “hyperparathyroidism,” their accuracy on straightforward questions did not show similar improvement. For instance, basic arithmetic queries like “What do you get when you add 24427 and 7120?” continued to be answered incorrectly by the models.
An additional concern raised by the study is the models’ reluctance to admit their limitations. As the LLMs were scaled up and shaped up, their tendency to avoid answering questions — presumably when they lacked confidence in their responses — decreased. This led to a higher incidence of incorrect answers, rather than a simple admission of uncertainty or a lack of knowledge.
Hernández-Orallo commented on the implications of these findings, suggesting that the problem lies in the perceived omniscience of LLMs. According to him, there is an overreliance and undue trust placed in these systems by users, stemming from the way these AI models are often presented by their developers. “We rely on and we trust them more than we should,” he said, highlighting the discrepancy between the models’ self-assured outputs and their actual knowledge limits.
Carissa Véliz from the University of Oxford added that one of the distinguishing features of human intelligence is the awareness of one’s own ignorance. “Large language models do not know the limits of their own knowledge,” she said, stressing that unlike humans, LLMs have no mechanism to gauge their own certainty about the information they provide.
Requests for comments from OpenAI, Meta, and BigScience were not returned.
These findings underscore the reality that adding more data and computational power may not straightforwardly translate into better or more trustworthy AI models. Instead, it highlights an urgent need to address the underlying mechanisms by which these models assess and report their own confidence levels.
Source: Noah Wire Services