This article is part of a Fortune Special Report on Artificial Intelligence.
In October, Google announced the biggest change to the way its search engine works in five years. Given its centrality to Google’s business, the tech giant doesn’t tinker with its search algorithm lightly. But the new algorithm added capabilities Google had been trying to achieve for years without success.
Thanks to the overhaul, the algorithm finally understands how prepositions, such as “for” and “to,” alter meaning. A search for “2019 brazil traveler to usa need a visa” no longer returns, as it did previously, irrelevant results about Brazilian visa requirements for U.S. visitors. Searching “Can you get medicine for someone at the pharmacy” now returns results specifically related to picking up another person’s prescription—not just having one filled in general. Google says the new algorithm improves the results returned for 1 in 10 of its English language searches. That may not sound like much, until you realize Google handles 63,000 searches every second.
This big leap forward was made possible by revolutionary developments in a branch of A.I. called natural language processing (or NLP for short). NLP refers to software that can manipulate and to some degree “understand” language. (The extent to which the mathematical models that underpin NLP equate to human language “understanding” remains hotly contested). The current A.I. boom, which has been underway now for about a decade, was initially sparked by breakthroughs in computer vision—software that can classify and manipulate images. Scientists have tried to apply many of the same machine learning techniques to language, with impressive results in a few areas, like translation. But for the most part, despite the appearance of digital assistants like Siri and Alexa, progress in NLP had seemed plodding and incremental.
Over the past 18 months, though, computer scientists have made huge strides in creating algorithms with unprecedented abilities at a variety of language tasks. What’s more, these new algorithms are making the leap from the lab and into real products at a breakneck pace—already changing the way tech’s biggest players, and many other businesses, operate. The NLP revolution promises better search engines and smarter chatbots and digital assistants. It could lead to systems that automatically analyze—and maybe even compose— legal documents and medical records. But some believe the NLP revolution will do far more. They think better language understanding just might be the key to unlocking more human-like or, even superhuman, artificial general intelligence.
Ashish Vaswami doesn’t like to take credit for sparking the NLP revolution. In fact, when I catch up with the self-effacing 40-year old computer scientist at an A.I. conference in Vancouver in December, he is reluctant to speak unless I also interview the rest of his research team. “This isn’t about me,” says Vaswami, who works for Google’s A.I. lab, Google brain. His modesty notwithstanding, Vaswami has played a key role in the advancement of NLP. In 2017, he was the lead author on a research paper that transformed the entire field. Vaswami proposed and tested a new design for a neural network —a type of machine learning loosely based on the human brain. There are many different ways of configuring these networks. Vaswami and his team called their novel configuration, appropriately enough, a transformer.
Existing A.I. algorithms were pretty good at predicting the next datapoint in a sequence, but they had a critical weakness: they were lousy at long sequences, especially if the next datapoint was heavily dependent variables that occurred much earlier. This is frequently the case in language where, for example, the correct conjugation of a verb or gender of a pronoun at the end of a sentence can depend on a subject that occurs at the start of the sentence, or even several sentences back. The transformer, to a large degree, solved this problem.
The following year, Vaswami’s colleague at Google Brain, Jacob Devlin, lead another team that took Vaswami’s transformer and trained it on the relationship between words in a massive dataset: 3.3 billion words from 11,000 English-language books as well as Wikipedia. The Google Brain researchers further honed the algorithm by teaching it to accurately predict a missing word in a sentence and, if given a sentence, to accurately identify the next sentence from two possible choices. Once trained, they tried the algorithm on a series of language ability tests. It came close to human performance on most of them, even when faced with tasks for which it had never been explicitly trained.
Devlin and his team called the new algorithm BERT. (Short for Bidirectional Encoder Representations from Transformers, the acronym continued an in-joke among NLP researchers of naming algorithms after characters from Jim Henson’s ouvre. There’s ELMo, Grover, Big BIRD, two ERNIEs, KERMIT and more. Collectively, they are referred to as “Muppetware.”) Google not only published its research, but also open-sourced the algorithm, allowing anyone to download it and then fine-tune it for their own specific purposes. That has spawned a wave of BERT-based innovations.
“That is literally the moment that changed this company,” John Bohannon, director of science at San Francisco technology startup Primer, says of BERT’s publication. Primer makes software that analyzes large datasets for customers that include big law firms, financial firms, and intelligence agencies. Difficult problems Primer once had—such as teaching a system how to determine whom the pronouns “he” and “she” refer to in a sentence when the primary noun wasn’t present—BERT can now handle with only a modicum of additional training. With just 1,000 labelled training examples, Bohannon said, it was now possible to achieve 85% accuracy on many business-specific NLP tasks, something that would have taken 10 times as much data previously. With BERT as a backbone, he says, Primer is working to create software that accurately summarizes complex documents, a Himalayan goal that has stumped NLP researchers for years.
That improvement to Google’s search algorithm in October? That was BERT. Weeks later, Microsoft announced it was using BERT to power its Bing search engine too. At LinkedIn, search results are now categorized using a smaller version of BERT called LiBERT that company created and calibrated on its own data. It has helped increase engagement metrics from search results—such as connecting to a person on the professional network or applying for a job—by 3% overall, and clickthrough rates on online help center query results by 11%, says Ananth Sankar, the company’s principal staff engineer.
Like mechanics tuning a stock car, Facebook’s engineers also modified BERT—changing its training regimen, its training objective and training on more data for longer. The result is a model Facebook calls RoBERTa. And it has set RoBERTa on one of the company’s thorniest problems: content moderation. In 2018, Facebook was forced to admit its social network had been used to incite ethnic violence against Rohingya Muslims in Myanmar. Part of the problem: the company didn’t have enough people who spoke Burmese to screen the volume of content being posted for hate speech and disinformation. And, at the time, it couldn’t turn to existing machine translation algorithms because such systems required a large body of text in a target language to train effectively. Burmese is what is language experts call a “low resource language”—relatively few examples of translated Burmese text are available in digital form.
RoBERTa, however, offered a solution. Facebook took the algorithm and instead of having it learn the statistical map of just one language, tried having it learn multiple languages simultaneously. By doing this across many languages, the algorithm builds up a statistical image of what “hate speech” or “bullying” looks like in any language, Srinivas Narayanan, Facebook’s head of applied A.I. research, says. That means Facebook can now use automatic content monitoring tools for a number of languages, including relatively low resources ones such as Vietnamese. Burmese may be next. The company says the new techniques were a big reason it was able, in just six months last year, to increase by 70% the amount of harmful content it automatically blocked from being posted.
While Facebook and LinkedIn focused on paring BERT down to make it more efficient, other labs took transformers and scaled them up. OpenAI, the San Francisco-based A.I. research company, decided to see what would happen if—rather than using Wikipedia and a dataset of thousands of books to train its model, as Google did with BERT—it scraped 8 million pages from the Internet. Instead of looking at 340 million different variables, as the largest version of BERT did, OpenAI’s system considered 1.5 billion parameters. And the company only stopped there because that was the biggest algorithm it could fit on a single server, says Dario Amodei, OpenAI’s vice president of research. The result was GPT-2: an algorithm that can write several paragraphs of novel and mostly coherent prose from a human-authored prompt of a few sentences. (GPT-2 has been controversial, largely because OpenAI initially withheld public release of the full scale version of the model over concerns people might use it for malicious purposes, like automatically generating fake news stories. It later reversed that decision.)
Algorithms like GPT-2 could point the way towards much more fluent chat bots and digital assistants, with big potential implications for customer relationship management and sales, says Richard Socher, Salesforce’s chief scientist. “At some point, maybe we can automate certain parts of the conversation fully,” he says. Salesforce’s A.I. lab has created one of the largest language models published so far. Called CTRL, it is slightly larger than GPT-2 and gives users the ability to more easily control the genre and style of text the algorithm writes (hence the name.) As with BERT, Socher says one of the big benefits of CTRL is that a company can take the pre-trained model and, with very little data of their own, tune to their exact business needs. “Even with a couple thousand examples, it will still get better,” he says.
People have used these large language models in surprising ways. Microsoft took a version of GPT-2 and tuned it on lines of software code from Github, the code repository service it now owns. The result, called IntelliCode, works like the autocomplete function in gmail or Microsoft Office, only for code. OpenAI used the same underlying transformer as GPT-2 but trained it on music instead of text, creating MuseNet, an A.I. that generates four-minute composition for as many as 10 different instruments.
BERT and GPT-2 are creeping closer to what computer scientist Alan Turing first proposed in the 1950s as his test of whether a machine should be considered “intelligent.” (Turing said a machine should be considered intelligent if a person couldn’t tell if it were a machine or a person based on its written responses to questions.) And they may continue to make progress. “I don’t think we’ve seen the limits of what transformers can do for conversations and dialogue,” says Jerome Pesenti, the vice president for A.I. at Facebook.
But, as OpenAI’s Amodei readily admits, today’s NLP software is still very far from perfect. “The things it writes are not at the point where they are indistinguishable from a human,” he says of GPT-2. And Bohannon says that there are critical aspects of language that these transformer-based models don’t capture. One notable one: negation. “They don’t understand the word ‘not,’” he says. They also can’t follow logical chains that humans find trivial, Bohannon says. And computer scientists are still exploring the extent to which these large algorithms actually understand things like grammar.
Another big problem with BERT and its offspring: because they are pre-trained on lots of books, many of them written decades ago, they bake in historical biases, particularly around gender. Ask BERT to fill in the missing pronoun in the sentence, “The doctor got into ____ car,” and the A.I. will answer, “his” not “her.” Feed GPT-2 the prompt, “My sister really liked the color of her dress. It was ___” and the only color it is likely to use to complete the thought is “pink.”
Gary Marcus, an emeritus professor of cognitive psychology at New York University, who is a frequent critic of deep learning approaches like those that underpin BERT and GPT-2, says despite being trained on such large datasets struggle to keep track of numerical quantities, solve simple word problems, and exhibit little common sense. “The underlying representations are actually very superficial,” he says.
Interestingly, both Marcus and Amodei agree that NLP progress is critical if scientists are ever going to create so-called artificial general intelligence, or AGI. (That is the sort of human-like or superhuman intelligence that can perform a range of tasks.) And they think so for exactly the same reasons. Amodei says OpenAI wanted to create GPT-2 in the first place because it is interested in creating a better way for humans to interface with machines using natural language. That is important, Amodei says, so a human could help teach a future machine intelligence what to do—and just as critically what not to do. Marcus says that so much of the world’s knowledge exists in written form, any AGI would have to be able to read and understand what it was reading.
But the two disagree on how to get there. Amodei is convinced that larger and larger deep learning systems will be an important component in solving NLP, while Marcus thinks a hybrid approach that combines something like deep learning with rule-based, symbol manipulation, will be necessary.
David Ferrucci, the former IBM researcher who built and led the team that created Watson, the system that famously beat a human champion in gameshow Jeopardy!, has now dedicated himself to trying to build this kind of hybrid system. He is the founder and chief executive of Elemental Cognition, a Connecticut-based startup that was initially funded by hedge fund billionaire Ray Dalio’s Bridgewater Associates. It is trying to create human-like A.I. through a natural language understanding and dialogue system.
Elemental’s software is not a single algorithm. It is built from several disparate components—including pre-trained transformer-based language models, which it uses to extract information from texts and power its chatbot interface. Elemental’s system can take a simple story then ask a series of questions, through the chatbot, about it, which a human has to answer. The questions ensure the software has correctly extracted the subject and action of the story. But critically, Ferrucci says, the primary objective is to get the software to learn about how the world works, including causation, motivation, time and space. These answers are then encoded symbolically, in a rule-based form. “It is building causal models and logical interpretations of what it is reading,” says Ferrucci.
Ferrucci says Elemental’s software performs well on a category of difficult NLP tests, called Winograd Schema, that are designed to see how well NLP systems grasp logic and common sense, and that it has performed much better than the transformer-based algorithms do on their own. But Elemental has yet to publish these results.
Of course, a big drawback of Ferrucci’s approach, even if it proves to work (and that is still a big if), is that depending on human instructors and a dialogue is slow. “We are trying to figure out how to scale this,” he says.
Ferrucci is convinced he is on the right track, however. The problem with most NLP research today, he says, is that it is essentially trying to reverse engineer language to get at the underlying concepts that generate it. But as Wittgenstein and other philosophers of language have long pointed out, language is fundamentally ambiguous. Two people don’t have the exact same representation of something in their head, even though they may use the same words for it. “Humans don’t even agree on most concepts,” says Ferrucci. “That is why you actually need dialogues, to establish a common interpretation.”
Simply building ever larger statistical models of language are unlikely to ever yield conceptual understanding, he says. Conversation on the other hand, just might.
No longer lost in translation
Here’s a quick look at a few of the algorithms powering the revolution:
Based on a new type of neural network called a “transformer,” developed at Google Brain, BERT is the algorithm behind the improved Google search. (Short for Bidirectional Encoder Representations from Transformers, the acronym continues an in-joke among NLP researchers of naming algorithms after Jim Henson’s creations: There’s also ELMo, Grover, Big BIRD, two ERNIEs, and a KERMIT. Together, they’re known as “muppetware.”)
Google not only published its BERT research but also open-sourced the algorithm. That quickly spawned a wave of BERT-based innovations. Microsoft is now using BERT to power its Bing search engine. At LinkedIn, search results are now more efficiently categorized using a stripped-down version of BERT called LiBERT. And Facebook has created a new algorithm called RoBERTa, designed to better identify hate speech and bullying including in languages, such as Burmese, for which there is less digital material to study.
San Francisco A.I. startup OpenAI trained this new NLP system on 1.5 billion language parameters scraped from 8 million Internet pages (versus the 340 million different variables used to train the largest version of BERT). The resulting algorithm can write several paragraphs of mostly coherent prose from a human-authored prompt of a few sentences—and could point the way to more fluent digital assistants.
Salesforce’s A.I. lab has created one of the largest language models published so far. Slightly larger than GPT-2, it gives users the ability to more easily control the genre and style of text the algorithm writes (hence the name).
As with BERT, one of the big benefits of CTRL is that a company can take the pretrained model and, with very little data of its own, tune it to its business needs. “Even with a couple thousand examples, it will still get better,” says Salesforce chief scientist Richard Socher.
A version of this article appears in the February 2020 issue of Fortune.
More from Fortune’s special report on A.I.:
—Inside big tech’s quest for human-level A.I.
—Medicine by machine: Is A.I. the cure for the world’s ailing drug industry?
—Facebook wants better A.I. tools. But superintelligent systems? Not so much.
—A.I. in China: TikTok is just the beginning
—A.I. is transforming HR departments. Is that a good thing?
Subscribe to Eye on A.I., Fortune’s newsletter covering artificial intelligence and business.