AI systems can recognize mental states more accurately than humans in testing

22nd May, 2024
AI systems can recognize mental states more accurately than humans in testing

Humans are intricate beings with complex communication methods. Psychologists have developed numerous tests to evaluate our ability to interpret and understand interactions. Recent research published in Nature Human Behavior reveals that some large language models (LLMs) now match, or even surpass, humans in tasks assessing the ability to track mental states, known as “theory of mind.”

This advancement doesn’t imply AI systems can truly understand human emotions, but it shows significant progress in models performing tasks thought to be uniquely human. Researchers applied the same systematic methods used in human theory of mind tests to LLMs to explore their successes and shortcomings.

Superior AI models could appear more empathetic and useful in interactions. Recently, both OpenAI and Google introduced advanced AI assistants—GPT-4 and Astra—designed to provide more natural responses. However, it’s crucial to remember that these abilities are not genuinely human, even if they seem so.

Cristina Becchio, a neuroscience professor at the University Medical Center Hamburg-Eppendorf, cautions, “We naturally attribute mental states and intentions to entities without minds. The risk of attributing theory of mind to LLMs is significant.”

Theory of mind is essential for social and emotional intelligence, enabling us to infer intentions and empathize with others. Most children develop these skills between ages three and five.

Researchers tested two LLM families—OpenAI’s GPT-3.5 and GPT-4, and three versions of Meta’s Llama—on tasks designed to test human theory of mind, including recognizing false beliefs, detecting faux pas, and understanding implied meanings. They compared these results with scores from 1,907 human participants.

Five types of tests were conducted: the hinting task, the false-belief task, recognizing faux pas, understanding strange stories, and comprehending irony. The AI models underwent each test 15 times in separate chats to ensure independent responses, which were then scored similarly to human responses.

Results showed GPT models often performed at or above human averages in tasks involving indirect requests, misdirection, and false beliefs. GPT-4 excelled in irony, hinting, and strange stories tests. Conversely, Llama 2 models performed below the human average, except for the largest model, which outperformed humans in recognizing faux pas scenarios. GPT models struggled with faux pas due to their reluctance to draw conclusions without sufficient information.

“These models aren’t demonstrating human-like theory of mind,” says a researcher. “But they show competence in making mentalistic inferences and reasoning about minds.”

The LLMs’ strong performance could be attributed to their training on well-established psychological tests, suggests Maarten Sap, an assistant professor at Carnegie Mellon University. “When administering a false-belief test to a child, they likely haven’t encountered that exact test before, but language models might have.”

Understanding LLMs remains a challenge. Research like this helps clarify what these models can and cannot do, says Tomer Ullman, a cognitive scientist at Harvard University. However, outperforming humans in theory of mind tests doesn’t equate to AI possessing a human-like theory of mind.

“I’m not anti-benchmark, but we’re reaching the limits of their usefulness,” Ullman states. “How these models pass benchmarks is not in a human-like manner.”