Poor Inter-Annotator Agreement: what to do?

3 min readMay 17, 2019

By Jorge Campos

Some weeks ago we rolled out a feature to track the quality of your datasets using the: Inter-Annotator Agreement (IAA).

If you have labeled data and different people (or ML systems) have collaborated to label the same subsets of data (e.g. 4 subject-matter experts annotate separately the same subset of legal contracts), you can compare these annotations to have an idea of their quality. If all your annotators make the same annotations independently (high IAA), it means your guidelines are clear and your annotations are most likely correct.

Note that a high IAA doesn’t strictly mean the annotations are correct. It only indicates that the annotators are following the guidelines with a similar understanding.

**Inter-Annotator Agreement matrix** for a single task. It contains the scores between pairs of users. For example, Vega and Joao agree on the 87% of the cases. Vega and Gerard on the 47%. This visualization provides an overview of the agreement among annotators. It also helps find weak spots. In this example we can see how Gerard is not aligned with the rest of annotators (25%, 47%, 35%, 18%). A training might be required to have him aligned with the guidelines and the rest of the team. On the top left we find the annotation task name and the agreement average (59,30%). All these metrics measure the IAA in F1.

How to prevent a poor IAA?

There may be several reasons why your annotators do not agree on the annotation tasks. It is important to mitigate these risks as soon as possible by identifying the causes. If you find such an scenario we recommend you to review the following:

Guidelines are key. If you have a large group of annotators not agreeing on a specific annotation task, it means your guidelines for this task are not clear enough. Try to provide representative examples for different scenarios, discuss boundary cases and remove ambiguity. If it makes sense (e.g. for parts of a system or schemes), attach pictures to the guidelines.
Try to be specific. If annotation tasks are too broadly defined or ambiguous, there is room for different interpretations, and eventually disagreement. On the other hand, very rich and granular tasks can be difficult for annotators to annotate accurately. Depending on the scope of your project, find the best trade-off between high specific annotations and affordable annotations.
Test reliability. Before starting annotating large amounts of data, it is good to make several assessments with a sample of the data. Once the team members have annotated this data, check the IAA and improve your guidelines or train your team accordingly.
Train. Make sure you train appropriately the members joining the annotation project. If you find annotators that do not agree with most of members from the team, investigate the reasons, make your guidelines evolve and train them further.
Check how heterogeneous is your data. If your data/documents are very different from each other either in complexity or structure, a larger effort would be required to stabilize the agreement. It is recommended to split the data into homogeneous groups.

Bear in mind: Labeling data is an iterative process. As in Agile, Inspect, and Adapt.

👏👏👏 if you like the post, and want to share it with others! 💚🧡

Poor Inter-Annotator Agreement: what to do?

How to prevent a poor IAA?

Written by 🍃tagtog