Task contamination could be fooling us, researchers say


A new study shows that much of the improvement in the performance of large language models in recent years may be due to the phenomenon of task contamination.

In a new paper, researchers at the University of California, Santa Cruz, show the possible effects of task contamination on the performance of large language models such as GPT-3 in zero-shot and few-shot tasks.

Task contamination refers to a phenomenon in which an AI model is exposed to examples or data during training that are later used as part of test or evaluation tasks. This can skew the results of zero-shot or few-shot evaluations because the model is not truly “blind” to the tasks – it has already seen similar or identical tasks during training.

In practice, the model may then perform better on certain tasks, not because it can learn from few or no examples (as would be the case with true zero-shot or fee-shot learning ability), but because it has already been exposed to similar examples during training. Task contamination thus calls into question the model’s ability to deal with new, unfamiliar tasks and may lead to an overestimation of its performance.



Study reveals task contamination in language models

The team looked at different variants of the GPT-3 model series, including GPT-3.5-Turbo, as well as several open language models such as Metas Llama, Bloom, Alpaca or Vicuna.

The researchers found that performance on datasets published before the date of the training data collection was significantly better than on more recent datasets. This strongly suggests contamination by the task.

The study also included the analysis of open model training data and a membership inference attack. By examining the training data and extracting task examples from the models, the researchers found further evidence of task contamination: The methods showed that certain task examples were present in the training data, which could falsify the evaluation of the zero and few-shot abilities of the models.

Using the Membership Inference Attack, the team also checked whether the content generated by the models corresponded exactly to examples from the dataset. A high degree of agreement indicates contamination of the model – again, the team found evidence of task contamination.

The team has not yet investigated GPT-4, but points out that the problem of task contamination is likely to be even greater in reinforcement learning with human feedback.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top