LLMs as Data Annotators: How Close Are We to Human Performance?

May 8, 2025
Clarity Bot

A comprehensive study reveals how large language models (LLMs) are revolutionizing data annotation tasks in natural language processing, particularly Named Entity Recognition, with performance levels increasingly comparable to human annotators.

Abstract

Purpose

This research investigates how Large Language Models (LLMs)—including GPT-4, Qwen, and LLaMA—can be leveraged to automate data annotation for Named Entity Recognition (NER) tasks through an advanced retrieval-augmented generation (RAG) method.

Demonstrates the capability of LLMs to perform high-quality automated NER annotation
Introduces a novel RAG approach that selects relevant contextual examples using embeddings
Aims to reduce dependency on costly, manual annotation processes

Scope

The study addresses key challenges in scalable NLP model development by evaluating the annotation accuracy and efficiency of LLMs in more demanding and underexplored datasets.

Efficiency Gains: Provides a less resource-intensive method for producing annotated datasets
Annotation Quality: Enhances labeling precision through context-aware generation
Scalability: Offers a viable solution for large-scale NLP tasks in specialized domains
Dataset Diversity: Expands LLM evaluation to include complex, domain-specific data
Research Impact: Contributes to broader adoption of automated annotation in industrial and academic NLP workflows

This work is relevant to researchers, engineers, and practitioners seeking to streamline NLP model development while maintaining accuracy and domain relevance.

Summary

Human-Level Annotation with LLMs is Becoming a Reality

The study investigates whether LLMs can match human annotators in terms of data labeling quality across four datasets of varying complexity. The evaluation focuses on Named Entity Recognition (NER), one of the most foundational tasks in NLP. The models used include proprietary and open-source LLMs ranging in size from 7 billion to 70 billion parameters. Human annotations serve as the benchmark, and results from LLMs are compared using multiple strategies, such as zero-shot prompting, in-context learning (ICL), and retrieval-augmented generation (RAG).

Limitations of In-Context Learning

In traditional in-context learning, relevant examples are manually included within a prompt to help the model understand the task. However, the manual selection process introduces inefficiencies and inconsistencies, especially when the selected examples are not semantically aligned with the input text. This leads to a decline in model performance. Additionally, inconsistency in structured outputs from decoder-based LLMs often breaks the token-label alignment essential for NER tasks.

Retrieval-Augmented Data Annotation for Higher Accuracy

To overcome these shortcomings, the study proposes a RAG-based method. This technique automatically retrieves similar examples by leveraging embedding models—either OpenAI’s high-capacity embeddings or the more lightweight SentenceTransformer. These examples are then used as contextual examples in the prompt, enhancing the model’s ability to learn and generate more accurate annotations. The process optimizes annotation quality, particularly for datasets that vary in structure and complexity.

Comprehensive Testing with Complex Datasets

While most existing research focuses on popular datasets like CoNLL-2003 and WNUT-17, this study also includes underrepresented but more complex datasets such as GUM and SKILLSPAN. These datasets reflect real-world linguistic challenges, including ambiguous soft-skill entities and diverse domain-specific named entities. The study shows that expanding evaluation to these datasets is essential for a more holistic understanding of LLM capabilities.

Performance Comparison Across LLMs and Strategies

Extensive empirical results highlight the performance trade-offs between various LLMs and strategies. Notably, a well-optimized 7B model, such as Qwen2.5, paired with strong embeddings, can perform nearly as well as larger 70B models like GPT-4. RAG-based annotation with OpenAI embeddings provides the best performance, even reaching within 1–3% of human annotation levels on benchmark datasets like CoNLL-2003 and WNUT-17.

Embeddings and Context Size Matter

Embedding quality plays a pivotal role in performance. High-quality embeddings notably benefit LLMs in handling datasets with complex annotations. Moreover, larger context sizes generally increase accuracy, although each model has a saturation point beyond which more examples do not yield proportional benefits.

Trade-off Between LLM Size and Efficiency

Surprisingly, the study finds that smaller models equipped with the right context and embeddings can be nearly as effective as much larger models, offering a cost-effective solution, especially for institutions with computational constraints. These findings challenge the notion that only larger models guarantee better performance.

Evaluation Framework Using Fine-Tuned RoBERTa

To assess the quality of LLM-generated annotations, the study fine-tunes a RoBERTa model on the LLM-annotated dataset and evaluates it against a human-annotated test set. The F1 Score is used as the primary evaluation metric. This approach reinforces the reliability of LLM annotations in real-world training pipelines.

Statistical Validation Confirms Robust Results

The study applies rigorous statistical tests, including Friedman and Conover post hoc tests, to confirm the significance of differences between models. Results reveal that gpt-4o with OpenAI embeddings ranks highest, while some 7B models statistically match the performance of 70B models, supporting the claim that size is not the sole determinant of accuracy.

Key Use Cases: HR and Medical Domains

The researchers also note the strong applicability of LLM annotation in domains like Human Resources and medicine, where fine-grained, high-quality annotations are particularly valuable. Automating annotations in these fields can significantly reduce manual effort, cost, and privacy risks.

Future Directions

Although the RAG-based approach shows excellent promise, further work is needed to enhance the quality of embeddings for abstract and ambiguous entity types. Extending this methodology to other NLP tasks—in addition to NER—and exploring more advanced retrieval techniques could further improve annotation accuracy and efficiency.

This research strengthens the case for using LLMs in automated annotation pipelines and offers a roadmap for large institutions aiming to optimize their data-processing workflows with minimal human intervention while maintaining high standards of annotation quality.

Resource

Read more in LLMsasData Annotators: How Close Are We to Human Performance by MuhammadUzair Ul Haq, Davide Rigoni, Alessandro Sperduti

Summary Score

Completeness

Hallucination

Consistency

Fluency

Coherence

This output was generated using the G-Eval method.

Liked this post? Share with others!

Subscribe to our newsletter

Collect visitor’s submissions and store it directly in your Elementor account, or integrate your favorite marketing & CRM tools.