This post is by Grammarly research scientists Dimitris Alikaniotis and Vipul Raheja.
Over the past ten years, Grammarly has focused on building one of the world’s best writing assistants. One major aspect of this is grammatical error correction (GEC). It’s a core component of achieving our mission of improving lives by improving communication—from helping people craft a mistake-free social media post to supporting them in perfecting a cover letter for a dream job.
To this end, we have made tremendous progress toward building an AI-powered system that achieves state-of-the-art performance in correcting grammatical errors in English writing. Progress in GEC research has historically relied on training large machine learning models on vast quantities of annotated data. As part of our pursuit to improve the performance of GEC systems, we have also researched models that do not rely on these large annotated datasets but instead leverage language-modeling methods of natural language processing (NLP).
This is also the subject of a new paper from the Grammarly research team that appeared at the 14th Workshop on Innovative Use of NLP for Building Educational Applications (BEA) at the Association for Computational Linguistics (ACL) conference in Florence, Italy.
An informal background of language modeling
Let us begin with a simple linguistic fact:
- The cat sat on the mat.
- The cat sit on the mat.
- On the mat the cat sat.
- On cat mat the the sat.
Many people would be able to infer the intended meaning of these four sentences: there is a cat that’s sitting on a mat. Speakers of English would find the first option the most plausible, however. Why is this? The answer is not trivial—philosophers, linguists, psychologists, and cognitive scientists have long debated the answer.
Let’s first take the simplest approach and offer that English, like all human languages, follows some grammatical rules. These include:
- The object of the sentence follows the verb.
- The subject and the verb must agree in number.
- The article comes before the noun.
Students of English tend to learn these rules very early on, either through explicit instruction or simple induction from being exposed to the language, and follow them throughout their lives. Rules of this kind are essential pieces of language and can tell us a lot about its evolution, its connection to other languages, and its learnability.
Though languages do tend to follow such rules, let’s take a step back and look at the problem of the cat on the mat from another perspective. Consider the following thought experiment, in which we should presume not to know how to write or speak English but do have access to thousands of books written in English. In this scenario, we might notice that words ending in -s tend to go before the word sit, while words without -s tend to go before sits. Because we don’t have any direct knowledge of English, we’re not aware of the rules governing this behavior. Even so, we can extract some statistical patterns that might help us construct the correct sentence. These might be:
- Words without -s are more probable with sits or sat.
- There is a low probability of the word the following another the.
These statistical properties of language are what we call “shallow,” in the sense that they take into account superficial regularities. In other words, they do not “understand” the meaning of the sentence but rather reflect some of its statistical properties. Even such shallow properties can effectively cover quite a few rules.
What we did
Would it be possible to identify some statistical patterns that could help us in correcting usage? It turns out that such patterns are quite powerful tools for distinguishing between correct and incorrect usage. In our paper, we explore the engineering point of view of how to do so with accuracy.
Let’s begin by examining at a recent trend in natural language processing: what are called transformer models. The name might sound like something out of a sci-fi movie but refers in our research to models that go through enormous amounts of text and distill the linguistic knowledge into statistical patterns. They might, for instance, be given the following sentence with a blank:
- The cat sat on the __.
The transformer model will try to fill in the blank with the most appropriate word.
The model does not need to know what mat means, but based on the data it has seen and the statistical patterns induced, it can infer that it’s more probable to have mat in the blank than Roomba.
The task in itself is simple. Imagine you have an excellent memory, went over large amounts of text, and can remember exactly how many times you saw the sequence “sat on the mat.” It will have been many more times than you saw “sat on the Roomba.”
The transformer models have another advantage, which makes them all the more powerful: they are capable of generalizing. Imagine that they have seen the word cat and the word dog appearing in similar sentences. Does this make the sentence “The dog sat on the mat” as probable? According to transformers, it sure does.
Now the question becomes whether we could exploit such models to avoid having to enumerate all the possible rules in language. Could we do this for rules we cannot describe—or even use it for understudied languages for which grammatical rules have not been thoroughly documented? What we show in our paper is that we are indeed able to do so with high accuracy by merely taking into account different wordings of the same sentence:
- The cat sat on the mat.
- The cat sits on the mat.
- The cats sit on the mat.
- The cats sat on the mat.
- ….
Before we move on to our key findings, though, the critical reader might raise two valid questions. First, how does the transformer know that sit and sits are related? Regarding this question, we will admit that, at this point in our research, we help the model out by including a large, automatically generated database of English words so the model knows that sit, sitting, sat, and sits are related.
A reader might also ask how we look for the most probable sentence. Do we look for all the possible alternatives? To this question, we would answer that we can’t enumerate all possible alternatives, as it would be too time-consuming. In practice, we scan the sentence from left to right—beginning to end—and try to facilitate the best possible choice at each point. This means that down the road there might be another layer of decisions to be made about problematic results, but it’s a trade-off between accuracy and speed.
What we found
What we find is that transformers making such local decisions are able to reach high levels of accuracy, comparable not only to customized systems that follow hard-coded English rules but also deep neural networks specifically designed for carrying out this task. What’s more, because such systems only compare a few sentences at a time, they represent a scalable solution that can deliver corrections much faster than other systems.
While the results of this method are promising, does this mean we do not any longer need rules or sophisticated systems? The answer is, of course, no. This methodology should in no way be considered a panacea for systems that aim to automatically correct ungrammatical sentences.
The best performance can be achieved using a hybrid approach that combines multiple methods—custom-made rules, deep neural networks, and language models, among others. Transformer models so far cannot capture errors as nuanced as more sophisticated methods. (For example, what happens if we need to introduce a new word to the sentence?) Still, despite shortcomings in these early stages, using transformer models can capture linguistic usage at scale and aid such systems in delivering the best corrections possible.
Looking forward
Here at Grammarly, we are committed to improving communication, and a big part of this is correcting incorrect grammatical usage. By using cutting-edge technology, Grammarly continues to develop the ways it provides the highest quality corrections to our millions of users.
Dimitris Alikaniotis presented this research at the 57th Annual Meeting of the Association for Computational Linguistics in Florence, Italy, which took place from July 28 to August 2, 2019. The accompanying research paper, “The Unreasonable Effectiveness of Transformer Language Models in Grammatical Error Correction,” will be published in the Proceedings of the 57th Conference of the Association for Computational Linguistics.