Apple study finds LLMs failing in basic human reasoning in math questions

A new study by some AI scholars associated with Apple have found that all state-of the-art LLMs engines, even those from Meta and OpenAI, still lack basic reasoning skills when it comes to understanding maths queries.

The group proposed a benchmark, GSM-Symbolic, to help others measure the reasoning capabilities of various large language models.

Several of their experiments done with GSM8k -- a benchmark widely used to assess the mathematical reasoning of models on grade-school-level questions -- revealed that slight changes in the wording of queries can result in significantly different answers, raising a question mark on the reliability of the models.

"Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark," researchers said.

They hypothesized that the decline in performance was due to current LLMs' inability to use genuine logical reasoning. "Instead, they attempt to replicate the reasoning steps observed in their training data."

Researchers found the LLMs' performance declining by as much as 65 per cent when they added a single clause that appeared relevant to the question, but was not.

"Specifically, the performance of all models declines [even] when only the numerical values in the question are altered in the GSM-Symbolic benchmark," the group wrote in their report.

"Furthermore, the fragility of mathematical reasoning in these models [demonstrates] that their performance significantly deteriorates as the number of clauses in a question increases," it said.

A particular example that illustrated the issue was a maths problem that required genuine understanding of the question.

The task the team developed, called "GSM-NoOp" — or No Operation – which meant the seemingly relevant information added afterwards had no operational duty.

The query was: "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday."

Researchers then added another clause: "However, on Sunday, 5 of these kiwis were smaller than average."

It then asked the LLMs how many kiwis Oliver picked in all.

"However, the majority of models fail to ignore these statements and blindly convert them into operations, leading to mistakes," researchers observed.

"We found no evidence of formal reasoning in language models," they then concluded.

The behaviour of LLMS "is better explained by sophisticated pattern matching" which the study found to be "so fragile, in fact, that [simply] changing names can alter results."

Building human-like cognitive abilities in LLMs remains a critical challenge for the field, they said.

Image Source: Unsplash