Study finds ChatGPT gets science wrong more often than you think
Washington State University professor Mesut Cicek and his research team repeatedly tested ChatGPT by giving it hypotheses taken from scientific papers. The goal was to see if the AI could correctly determine whether each claim was supported by research or not in other words, whether it was true or false.
In total, the team evaluated more than 700 hypotheses and asked the same question 10 times for each one to measure consistency.
When the experiment was first conducted in 2024, ChatGPT answered correctly 76.5% of the time. In a follow-up test in 2025, accuracy rose slightly to 80%. However, once the researchers adjusted for random guessing, the results looked far less impressive. The AI performed only about 60% better than chance, a level closer to a low D than to strong reliability.
The system had the most difficulty identifying false statements, correctly labeling them only 16.4% of the time. It also showed notable inconsistency. Even when given the exact same prompt 10 times, ChatGPT produced consistent answers only about 73% of the time.
The findings, published in the Rutgers Business Review, highlight the importance of using caution when relying on AI for important decisions, especially those that require nuanced or complex reasoning. While generative AI can produce smooth, convincing language, it does not yet demonstrate the same level of conceptual understanding.
The researchers tested the free version of ChatGPT-3.5 in 2024 and the updated ChatGPT-5 mini in 2025. Overall, performance remained similar across both versions. After adjusting for random chance, which gives a 50% probability of a correct answer, the AI's effectiveness was only about 60% above chance in both years.
Science Daily
Washington State University professor Mesut Cicek and his research team repeatedly tested ChatGPT by giving it hypotheses taken from scientific papers. The goal was to see if the AI could correctly determine whether each claim was supported by research or not in other words, whether it was true or false.
In total, the team evaluated more than 700 hypotheses and asked the same question 10 times for each one to measure consistency.
When the experiment was first conducted in 2024, ChatGPT answered correctly 76.5% of the time. In a follow-up test in 2025, accuracy rose slightly to 80%. However, once the researchers adjusted for random guessing, the results looked far less impressive. The AI performed only about 60% better than chance, a level closer to a low D than to strong reliability.
The system had the most difficulty identifying false statements, correctly labeling them only 16.4% of the time. It also showed notable inconsistency. Even when given the exact same prompt 10 times, ChatGPT produced consistent answers only about 73% of the time.
The findings, published in the Rutgers Business Review, highlight the importance of using caution when relying on AI for important decisions, especially those that require nuanced or complex reasoning. While generative AI can produce smooth, convincing language, it does not yet demonstrate the same level of conceptual understanding.
The researchers tested the free version of ChatGPT-3.5 in 2024 and the updated ChatGPT-5 mini in 2025. Overall, performance remained similar across both versions. After adjusting for random chance, which gives a 50% probability of a correct answer, the AI's effectiveness was only about 60% above chance in both years.
Science Daily
Washington State University professor Mesut Cicek and his research team repeatedly tested ChatGPT by giving it hypotheses taken from scientific papers. The goal was to see if the AI could correctly determine whether each claim was supported by research or not in other words, whether it was true or false.
In total, the team evaluated more than 700 hypotheses and asked the same question 10 times for each one to measure consistency.
When the experiment was first conducted in 2024, ChatGPT answered correctly 76.5% of the time. In a follow-up test in 2025, accuracy rose slightly to 80%. However, once the researchers adjusted for random guessing, the results looked far less impressive. The AI performed only about 60% better than chance, a level closer to a low D than to strong reliability.
The system had the most difficulty identifying false statements, correctly labeling them only 16.4% of the time. It also showed notable inconsistency. Even when given the exact same prompt 10 times, ChatGPT produced consistent answers only about 73% of the time.
The findings, published in the Rutgers Business Review, highlight the importance of using caution when relying on AI for important decisions, especially those that require nuanced or complex reasoning. While generative AI can produce smooth, convincing language, it does not yet demonstrate the same level of conceptual understanding.
The researchers tested the free version of ChatGPT-3.5 in 2024 and the updated ChatGPT-5 mini in 2025. Overall, performance remained similar across both versions. After adjusting for random chance, which gives a 50% probability of a correct answer, the AI's effectiveness was only about 60% above chance in both years.
Science Daily
comments
Study finds ChatGPT gets science wrong more often than you think
comments