Old Eliza beats ChatGPT 3.5 in Turing test

Image Courtesy: Unsplash

A Turing test recently conducted by AI researchers saw Eliza, a 1960s chatbot, outperforming ChatGPT 3.5 by almost double the margin.

In a recent study published on arXiv in October, researchers Cameron Jones (a PhD student in Cognitive Science) and Benjamin Bergen (a professor in the Department of Cognitive Science at UC San Diego) made a website called turingtest.live to organise an online version of the Turing test to see how well GPT-4 could convince people it was a human when given different prompts.

Through the website, people acted as questioners interacting with different AI programmes, including GPT-4, GPT-3.5, and ELIZA, one of the earliest chatbots. The researchers randomly assigned two people to chat as if they were human, while others were always questioning the AI.

The study involved 652 participants, leading to 1,810 interactions. After excluding specific situations like repeated AI matches or instances where the participants knew each other, 1,405 interactions were analysed.

Surprisingly, the older programme ELIZA scored 27 per cent success in convincing people it was human, outperforming GPT-3.5, which achieved 14 per cent. GPT-4 did better with a 41 per cent success rate, coming second only to real humans.

Shocking as it may seem, some AI researchers warned against reading too much into the drubbing handed to the much vaunted chatbot.

Princeton computer science professor Arvind Narayanan in a post on X said that the result were in line with ChatGPT's design, which limits its human-likeness.

Another researcher, Ethan Mollick, wrote, "I think the fact that GPT-3.5 loses to ELIZA is not that surprising when you read the paper. OpenAI has considered impersonation risk to be a real concern, and has RLHF'ed to ensure ChatGPT doesn't try to pass as human. ELIZA very much is designed to pass using our psychology."

Throughout the interactions, interrogators commonly used tactics such as casual conversation and asking about general knowledge and current happenings.

Some of the more effective strategies included using a language other than English, discussing time or current events, and directly accusing the respondent of being an AI.

Participants determined the nature of their conversations based on the replies they received. Interestingly, the research revealed that people mostly relied on the way messages were written and emotional characteristics rather than just assessing intelligence. They noticed when the replies were too formal or informal, lacked personal touch, or seemed generic.

Additionally, the study showed that a person's education level or how well they knew large language models (LLMs) didn't significantly impact their ability to detect AI.

Alan Turing, a mathematician and computer scientist from Britain, originally created the Turing test, also known as "The Imitation Game," back in 1950. This test aims to figure out if a machine can talk like a human.