DeepMind has found a simple way to make language models reason better


Logical reasoning is still a major challenge for language models. DeepMind has found a way to support reasoning tasks.

A study by Google’s AI division DeepMind shows that the order of the premises in a task has a significant impact on the logical reasoning performance of language models.

They work best when the premises are presented in the same order as they appear in the logical conclusions. According to the researchers, this is also true for mathematical problems. The researchers make the systematically generated tests available in the R-GSM benchmark for further investigation.

In this work, we show that the premise order significantly affects LLMs’ performance on reasoning tasks, even when the premise order does not change the underlying task itself. Our comprehensive evaluation demonstrates that LLM tendencies resemble human preference w.r.t. premise order, i.e., LLMs achieve the best performance when the premise order follows the intermediate reasoning steps to solve the problem. Conversely, LLMs face difficulties when the reasoning problem requires the model to read the problem description back-and forth, resulting in a performance drop of over 30%.

From the paper

If A is B, then B is also A

A premise is a statement or assumption that serves as the basis for an argument or action. In their study, the researchers conducted a systematic investigation of the effects of premise ordering on various AI models.



Focusing on deductive reasoning, they tested the models using tasks that required only the logical inference “modus ponense,” i.e., the derivation of other true statements from true statements.

To ensure fair tests across all language models, the researchers filled the rules with fantasy words such as “wozz”, “riff” and “fum”. | Image: Chen et al.

The modus ponens is a form of deductive reasoning in logic. If you have the two statements “If P, then Q” and “P is true”, then you can infer that “Q is true”.

This form of reasoning is relatively straightforward for humans, but has proven to be a major hurdle for language models. The researchers found that changing the order of the premises can reduce the accuracy of the models by more than 30 percent.

The tests were performed with GPT-3.5 Turbo, GPT-4 Turbo, PaLM 2-L, and Gemini Pro. OpenAI’s GPT models performed better when the order of the premises was exactly reversed from the ground truth.

In general, poorer performance was also observed as the number of rules increased. Superfluous premises also confused the models.


Demis Hassabis, CEO of DeepMind, and Sam Altman, CEO of OpenAI.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top