GPT-4 fails at simple tasks that humans can easily solve


Researchers from Metas AI Research (FAIR), HuggingFace, AutoGPT, and GenAI present the GAIA (General AI Assistants) AI benchmark, which measures AI performance on tasks that are easy for humans to solve.

The benchmark is based on the hypothesis that a potential General Artificial Intelligence (AGI) must outperform humans even on tasks that are easy for the average person to solve.

In contrast, the trend in benchmarks is to have AI solve tasks that are difficult for humans or require a high level of technical skill and knowledge, the researchers write.

GAIA aims to accelerate the development of AI systems with human-like capabilities across a broader range of tasks. The research team expects that solving the GAIA benchmark will mark a milestone in the development of AI.



466 everyday tasks

GAIA consists of 466 questions that require basic skills such as reasoning, dealing with different modalities, navigating the Internet, and using tools (e.g., Internet search). The questions are designed to be challenging for AI systems, but conceptually simple for humans.

Sample questions at different levels. | Image: Mialon et al.

Questions are divided into three levels of difficulty based on the number of steps required to solve the question and the number of different tools required.

Level 1 questions usually require no tools or at most one tool, but no more than five steps. Level 2 questions usually have more steps, about five to ten, and require a combination of tools. Level 3 questions require a system that can perform sequences of actions of any length, use any number of tools, and have general access to the world.

Image: Mialon et al.

GAIA attempts to avoid the pitfalls of current AI assessment methods by being easy to interpret, non-manipulable, and simple to use. The answers are fact-based, concise and unambiguous, allowing for easy, quick and objective assessment. The questions are designed to be answered using the zero-shot method, which simplifies the evaluation.

GPT-4 fails at simple tasks

In the first evaluation, even advanced AI systems such as GPT-4 with plugins struggled with the GAIA benchmark.


37.7% of people over the age of 25 in the U.S. had a bachelor’s degree in 2022. Whether having an academic degree has a significant impact on a person’s ability to perform the tasks in the benchmark is an open question.

Standard search engine falls behind in GAIA benchmark

The research team also sees potential for using LLMs as a search engine replacement: the study notes that while human web searches can provide direct text results for Level 1 questions, from which the correct answer can be inferred, they are less effective for more complex Level 2 and Level 3 queries.

In this case, a human web search would be slower than a typical LLM assistant because the user would have to sift through the initial search results. However, this assessment does not take into account the reliability and accuracy of the search result, which is the real problem with LLMs as a search engine replacement.

Back in September 2023, a study showed that language models cannot generalize the simple logical conclusion “A is B” to “B is A,” demonstrating a “fundamental failure in logical reasoning.”

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top