Zining Zhu Helps Developers and Users Navigate 'Adversarial Helpfulness' in Artificial Intelligence
Stevens computer science researcher and his team are exploring the unreliability of AI models
Artificial intelligence platforms such as ChatGPT and Notion show great promise in facilitating research and problem-solving. However, even when their answers seem accurate, they can be disarmingly unreliable.
Zining Zhu, assistant professor in the Department of Computer Science at Stevens Institute of Technology, has taken on the challenge of understanding the reasons for this fallibility and offering advice to avoid being misled. The project team also includes Stevens Ph.D. student Shashidhar Reddy Javaji ’27, who earned his master’s degree in computer science last year from the University of Massachusetts; Rohan Ajwani, a recent University of Toronto graduate with a master’s degree in computer engineering; and Frank Rudzicz, an associate computer science professor at the Vector Institute.
Their groundbreaking paper on the accuracy and limitations of large language models (LLMs), LLM-Generated Black-Box Explanations Can Be Adversarially Helpful, is being published and included as a poster for a machine learning workshop at the 2024 Conference on Neural Information Processing Systems in Vancouver.
Artificial intelligence isn't always as smart as it seems
Large language models can be powerful tools for problem-solving and knowledge generation. The trouble is, their correct and incorrect explanations can be equally convincing, leading people—and other LLMs—to trust these mistakes. Zhu and his team describe this phenomenon as "adversarial helpfulness."
"One day, I made a typo when entering a problem into an LLM, and to my surprise, it still explained the problem smoothly," Zhu recalled. "That’s when I realized that the 'helpfulness' of these tools could work against us. Further testing proved this effect was prevalent. It’s alarming—whether you’re a professional making high-stakes decisions, a researcher attempting to solve a scientific challenge, or a child seeking to learn about the world."
Javaji then documented the disturbingly high frequency of LLMs creating misleading but believable explanations by reframing questions, showing unwarranted confidence and cherry-picking evidence to support incorrect answers.
"I am thrilled to work on this innovative and impactful research," Javaji said, "leveraging our cutting-edge resources at Stevens to advance our understanding of LLM limitations and develop more reliable AI systems for a better future."
Zhu and his team continued their investigation by designing a task involving graph analysis, a known challenge for these models. They tasked the AI with finding alternative paths in graphs. Despite their struggles to solve basic assignments, the models confidently produced incorrect answers, further showcasing their logical reasoning limitations.
The researchers also discovered that "black-box" explanations—the common practice of using LLMs to provide only the problem and the answer without revealing the reasoning process—further clouded the integrity of the responses.
Building trust for a safer future
As AI tools evolve to function like enhanced search engines that can deliver comprehensive explanations, how can developers and users counteract adversarial helpfulness?
"Ask LLMs to explain multiple perspectives," Zhu recommended. "Seek transparency in the AI’s decision-making process. Above all, fact-check the outputs."
This isn’t just an academic exercise. As AI tools become increasingly integral to education, professional decision-making, and daily life, ensuring their accuracy and reliability is crucial.
"Our research aligns with Stevens Institute’s mission to address critical societal challenges through technological innovation," Zhu noted. "I hope this work contributes to making AI tools safer and more effective for everyone."