This article was co-written by Analytical Linguists Lena Nahorna and Lily Ng.
When working with language data to develop machine learning (ML) models, there is a movement toward prioritizing data quality over data quantity. The idea is that your model is only as good as the data you use for training and evaluation. And to build a high-quality labeled dataset, it’s essential to have a great annotation process.
In annotation, humans label or transform data inputs into so-called “gold data” that informs what machine learning practitioners are trying to model. For example, to create a gold dataset that will be used to build a model capable of correcting grammatical mistakes, annotators might be asked to identify the grammatical mistakes for a wide range of sample sentences. Regardless of the annotation task, there are certain repeatable strategies and processes that are useful if you’re a data science team member, computational linguist, ML practitioner, or researcher.
In this article, we’ll offer best practices for teams that use annotation to create high-quality datasets, focusing particularly on complex, subjective tasks using textual data. We’ll cover choosing the right data, working with annotators, running quality control, and more. If you invest in annotation and think through each step, you’ll almost certainly get better data as a result—so you can save time and budget in the long run, and build better models.
Getting ready for annotation
Before annotation can begin, the data science team will need to prepare the data and design the annotation task. There are no right answers here—you’ll be guided by your machine learning use case and requirements. A thorough, hands-on approach will help you stay one step ahead of potential issues and nuances that could come up during annotation.
Sampling: Where should you get your data from?
Sampling has a huge impact on your annotation task and model. To prevent overfitting, you’ll want to think through how to sample across different times, locations, and contexts.
Your data should reflect the domains where your model will be applied—if you’ll be creating a natural language model to assist users on web and mobile, you should sample communications from both areas. However, you’ll want to consider whether to do representative sampling or sample evenly across all domains. You’ll also want to decide how much noise to tolerate in your dataset. You would use representative sampling if you want to train or evaluate your model on the natural distribution of data. If that doesn’t give you enough interesting or relevant data points, you would want to be more intentional about sampling (e.g., use even sampling).
Preprocessing: How do you need to clean and filter your data?
To save time and resources, preprocess your dataset to exclude anything that doesn’t need to be annotated, like duplicated or noisy data (such as URLs or garbled text). But remember to keep any relevant metadata (like domain, origin of data, and index or character offset) attached to the source, so you can properly handle and evaluate your data down the line.
Finally, consider your desired level of granularity for the data items that will be annotated. Depending on your model’s specifications, you may need to split up paragraphs into sentences or sentences into tokens.
Designing your annotation task
Text annotation tasks fall into three categories:
- Labeling, where annotators apply one or more categories to the inputs
- Transformation, where annotators edit or rewrite the inputs
- Generation, where annotators create new text based on the inputs or from scratch
Regardless of the type of task, the user experience matters. If your task is designed in a simple, clear way and your annotators have a good experience, the end result will be a higher-quality dataset. These principles will help your annotators have a better time working on your task:
- Clarify your guidelines. Avoid complex terms, ambiguities, or inconsistencies. Describe the motivation for your task, and provide examples and counterexamples of what you’re looking for. (Tip: Perform version control for your guidelines, so you can track changes that inevitably occur and associate data with different sets of guidelines.)
- Keep your task simple. Breaking down your task into smaller steps will help your annotators stay focused and aligned. You may want to have more than one round of annotation, each focused on a single subtask.
- Prioritize efficiency. Annotator time is valuable, and you should only gather annotations when necessary. One way to save time is to perform prelabeling or programmatic labeling using a heuristic or model. Another approach is active learning, where the model chooses examples that it’s not sure how to treat during training, and only these inputs are selected for annotation.
- Your interface should work for annotators, not slow them down. Ideally, annotators should be able to go back and review and change their annotations (which can help raise the quality of your dataset). They should be able to use the keyboard more than the mouse, as this is more ergonomic. The layout should be optimized to reduce scrolling and clicking. Finally, annotators should be able to easily reference the task guidelines.
Working with annotators
Once you’re ready to begin, there are a few best practices to consider during the annotation phase itself.
Annotation stages
Rather than jumping into a large-scale annotation right away, begin with a smaller pilot to catch design flaws or ambiguous guidelines. After incorporating feedback from your pilot annotators, you can move on to large-scale annotation. Throughout, quality control and thoughtful communication with annotators will help you navigate any road bumps.
It’s important to close with a retrospective session, reflecting on what went well and didn’t to optimize your future annotations. This is especially valuable if you plan to iterate on your annotation task.
Communication
Great communication with annotators is essential. When you introduce the task, explain the context and highlight the impact on your project—it’s motivating for someone to know that their effort makes a difference. Check in frequently: If your annotation spans several weeks, you might reach out on Mondays and invite feedback and questions. Consider creating an FAQ document as well, to compile all decisions and clarifications.
Annotation is important work that humans make possible. And annotators are partners in the model design, rather than just a resource to make use of. Express your gratitude, and most importantly, treat annotators ethically and provide fair pay.
Overfitting and how to avoid it
In working with annotators, you can mitigate overfitting in several ways. First, you can assign multiple annotators to the task and collect multiple annotations for each item. This is especially important for text annotations, as language can be highly ambiguous and subjective. Second, avoid priming annotators by making sure you don’t focus your communication on one aspect of the task over another.
Finally, it’s important to take the annotation results with a grain of salt. You might be thrilled if your evaluation dataset indicates that your model has very high performance—but keep in mind that if you change the pool of annotators, your model’s performance could also change. (Luckily, that’s something you can mitigate by following best practices with your new annotators!)
Assessing annotation quality
Your relationship with your annotators can make all the difference in the quality of the annotations. But how do you analyze annotation quality? There are a few principles to consider before, during, and after your annotation process.
Defining quality
Before you begin collecting annotations, set a quality goal for your gold dataset. Consider what level of quality you need to meet to achieve success, based on a similar previous annotation task or an expert benchmark.
To measure quality, you should use both automatic and manual metrics. For example, a common automatic quality metric for labeled data is interannotator agreement (IAA). High IAA means annotators understand the guidelines and are aligned on how to apply the guidelines to the data, but it doesn’t mean that the data itself is high quality. And a high IAA isn’t always possible, especially for language tasks that can be very subjective; lower IAA can provide an important signal about the task or the underlying data. Automatic metrics are a good start, but you’ll always want to manually examine the data to understand the story behind those metrics.
When to do quality control
Ideally, you should use the manual and automatic metrics described above to perform two kinds of quality control (QC). These two types are concurrent QC and postannotation QC.
Concurrent QC should be your top priority, as it allows you to address issues as annotation is happening or in the next round. Based on the results, you can clean your data, update your guidelines, or add to your FAQ doc for annotators.
Postannotation QC will help you understand your upper bound for model quality since your model will only be as good as your annotations. You can also use these QC results as part of a retrospective for future iterations of the task or for similar tasks in the future.
Summary
We’ve covered the following best practices in this guide for teams that are undertaking dataset annotation:
1 Sampling and preprocessing your data to create a quality starting point and reduce overhead for annotators
2 Designing your annotation task to be simple and efficient
3 Performing annotation in stages with time for reflection and correction
4 Creating clear channels of communication with annotators and emphasizing the importance of their work
5 Ensuring that the data isn’t overfitted to one perspective
6 Performing concurrent and postannotation quality control
In conclusion, we hope these best practices help you strengthen your relationship with your annotators and build high-quality datasets. Do you like working on these kinds of machine learning and natural language processing problems? Grammarly is hiring! Check out our open roles here.