AI has a better vocabulary than humans

2024.09.11.

The verbal intelligence of artificial intelligence-based software was studied at ELTE PPK. The research showed that the machines outperform even native speakers with a doctoral degree in terms of vocabulary.

Large language models (LLM) are types of artificial intelligence that are able to generate content that is similar to texts produced by humans. These models – such as the popular ChatGPT – have revolutionised the world of AI and are now able to imitate human capabilities quite realistically.

Experts have long been concerned about how intelligent machines are when compared to humans. However, classic comparative tests (such as the Turing Test) cannot differentiate between human performances: they can only measure how well machines can mimic universal aspects of human cognition, such as communication skills. In contrast, the study of human intelligence focuses on individual differences – IQ itself is a relative indicator that compares our performance to that of others.

According to some researchers, the best way to measure the intelligence of AI is to have the models perform psychometric tests originally designed for humans. In light of this, Kristóf Kovács, senior research fellow at ELTE PPK and head of the Cognitive Abilities Lab, and Balázs Klein, who works with test platforms, compared the verbal intelligence of two AI applications (ChatGPT and Bing) with the performance of more than 9000 humans.

In the study, they used a computer-based adaptive vocabulary test in which the models had to choose those two words from a list of nine words that were closest to each other in meaning. In adaptive testing, an algorithm selects items from a question bank so that the difficulty level is always close to the subject’s ability. This method gives a more accurate result than a test with fixed questions.

Both language models performed highly in the test: they performed better than 95% of the subjects (so outperformed 19 out of 20 people) and scored higher than native speakers with a doctoral degree as well. If these machines were humans, they would be considered outstanding talents – the authors pointed out, adding that it is likely that AI applications will soon possess better vocabularies than 100% of humans.

Despite their outstanding performance, however, the machines made some mistakes. They gave different individual answers to 42% of the repeated questions – something that does not happen when the subjects are humans. Occasionally, they also “hallucinated”: they answered with words that were not among the offered options. (This happened even when they had previously answered the question correctly, so it was not a case of not knowing the answer.) These errors, however, do not point to shortcomings in the software, but rather to the limitations of using psychometric tests designed for humans for the testing of artificial intelligence.

Looking at the results, we might wonder how we should distinguish between AI-generated content and human-written texts, given that the machines are already capable of such high verbal performance. The researchers advise us to be cautious if the communication is too sophisticated as opposed to too simplistic: it is possible that an AI software works with a more extensive vocabulary than we do.