
As businesses race to adopt chatbots to help with workflow, customer support and other functions, they’re often finding out the hard way that some chatbots have inherent flaws. Artificial intelligence – by definition – can learn, and it’s not always learning positive things. The most famous example of this is Tay, a Twitter chatbot released in 2016 by Microsoft. Less than 24 hours from its release, Tay, while absorbing and learning from Twitter content, began repeating offensive content, including racist and misogynist phrases.
Appen Limited, which provides high-quality data for the AI lifecycle, recently announced the launch of two new products designed to enable customers to launch high-performing large language models (LLMs) whose responses are “helpful, harmless and honest” to reduce bias and toxicity, said the company.
To do this, the company has developed two new features. The first is AI Chat Feedback, which empowers domain experts to assess a multi-turn live conversation, enabling them to review, rate and rewrite each response. In addition, the company is introducing Benchmarking, a solution designed to help customers evaluate model performance across various dimensions, such as model accuracy, toxicity, etc.
Appen’s Benchmarking tool solves an inflection point businesses face while under pressure to enter the AI market quickly: how to determine the right LLM to choose for a specific enterprise application. Model selection has strategic implications for many dimensions of an application including user experience, ease of maintenance and profitability. With the Benchmarking solution, customers can evaluate the performance of various models along commonly used or fully custom dimensions. Combined with a curated crowd of Appen’s AI Training Specialists, the tool evaluates performance along demographic dimensions of interest such as gender, ethnicity and language. A configurable dashboard enables efficient comparison of multiple models across various dimensions of interest.
“As AI Chatbots grow more advanced, the stakes are higher for enterprises to get them right before they’re released into the world, or they risk harmful biases and dangerous responses that could have long-term impacts on the business,” said Appen CEO Armughan Ahmad. “Appen’s new evaluation products provide our customers with an essential trust layer that ensures they are releasing AI tools that are truly helpful and not harmful to the public. This trust layer is backed by robust datasets and processes that have proven effective in our 27 years of AI training work, and a team of over a million human experts who are attending to the nuances of the data.”
Edited by
Greg Tavarez