Google Gemini Pro falls behind free ChatGPT, says study


A recent study by Carnegie Mellon University (CMU) shows that Google’s latest large language model, Gemini Pro, lags behind GPT-3.5 and far behind GPT-4 in benchmarks.

The results contradict the information provided by Google at the Gemini presentation. They highlight the need for neutral benchmarking institutions or processes.

Gemini Pro loses out to GPT-3.5 in benchmarks

Google DeepMind’s Gemini is the latest in a series of major language models. The Gemini team claims that the “Ultra” version, due out early next year, will outperform GPT-4 on various tasks. But Google has already fiddled with the presentation of Ultra’s benchmark results.

Google also claims that Gemini Pro, which is now available and powers the Bard chatbot, is comparable to or better than OpenAI’s GPT-3.5. However, the CMU study shows that Gemini Pro performed worse than OpenAI GPT-3.5 Turbo on all benchmarks tested at the time of the study.



Benchmark discrepancies

Some discrepancies may be due to Google’s protection mechanisms, which caused the model to not answer some questions in the MMLU assessment. These missing answers were scored as incorrect for each model.

However, the researchers also found that Gemini Pro performed worse in the area of basic mathematical reasoning, which is required for tasks in formal logic and elementary mathematics.

In terms of subject categories, Gemini Pro only outperformed GPT-3.5 in Security Studies and High School Microeconomics. It trailed in all other categories.

Image: CMU, Akter et al.

Google reported Gemini Pro’s MMLU 5-Shot and Chain of Thought (CoT) scores as 71.8 and 79.13, respectively, while the CMU researchers reported 64.1 and 60.6, respectively. The Big Bench Hard benchmark score reported by Google was 75.0, while the CMU researchers found it to be 65.6. These are significant differences, the origin of which is still unclear.

Google achieved significantly higher benchmark scores with Gemini Pro than the CMU researchers. | Image: Google Deepmind

Need for neutral model benchmarking

The results of the study show that the exclusive use of self-reported benchmarks from large companies is not a reliable measure of LLM performance.


best-rated model in the chatbot arena. This shows that benchmarks are only of limited value.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top