Abstract
Purpose
This research investigates how Large Language Models (LLMs)—including GPT-4, Qwen, and LLaMA—can be leveraged to automate data annotation for Named Entity Recognition (NER) tasks through an advanced retrieval-augmented generation (RAG) method.
-
Demonstrates the capability of LLMs to perform high-quality automated NER annotation
-
Introduces a novel RAG approach that selects relevant contextual examples using embeddings
-
Aims to reduce dependency on costly, manual annotation processes
Scope
The study addresses key challenges in scalable NLP model development by evaluating the annotation accuracy and efficiency of LLMs in more demanding and underexplored datasets.
-
Efficiency Gains: Provides a less resource-intensive method for producing annotated datasets
-
Annotation Quality: Enhances labeling precision through context-aware generation
-
Scalability: Offers a viable solution for large-scale NLP tasks in specialized domains
-
Dataset Diversity: Expands LLM evaluation to include complex, domain-specific data
-
Research Impact: Contributes to broader adoption of automated annotation in industrial and academic NLP workflows
This work is relevant to researchers, engineers, and practitioners seeking to streamline NLP model development while maintaining accuracy and domain relevance.
The study investigates whether LLMs can match human annotators in terms of data labeling quality across four datasets of varying complexity. The evaluation focuses on Named Entity Recognition (NER), one of the most foundational tasks in NLP. The models used include proprietary and open-source LLMs ranging in size from 7 billion to 70 billion parameters. Human annotations serve as the benchmark, and results from LLMs are compared using multiple strategies, such as zero-shot prompting, in-context learning (ICL), and retrieval-augmented generation (RAG).
In traditional in-context learning, relevant examples are manually included within a prompt to help the model understand the task. However, the manual selection process introduces inefficiencies and inconsistencies, especially when the selected examples are not semantically aligned with the input text. This leads to a decline in model performance. Additionally, inconsistency in structured outputs from decoder-based LLMs often breaks the token-label alignment essential for NER tasks.
To overcome these shortcomings, the study proposes a RAG-based method. This technique automatically retrieves similar examples by leveraging embedding models—either OpenAI’s high-capacity embeddings or the more lightweight SentenceTransformer. These examples are then used as contextual examples in the prompt, enhancing the model’s ability to learn and generate more accurate annotations. The process optimizes annotation quality, particularly for datasets that vary in structure and complexity.
While most existing research focuses on popular datasets like CoNLL-2003 and WNUT-17, this study also includes underrepresented but more complex datasets such as GUM and SKILLSPAN. These datasets reflect real-world linguistic challenges, including ambiguous soft-skill entities and diverse domain-specific named entities. The study shows that expanding evaluation to these datasets is essential for a more holistic understanding of LLM capabilities.
Extensive empirical results highlight the performance trade-offs between various LLMs and strategies. Notably, a well-optimized 7B model, such as Qwen2.5, paired with strong embeddings, can perform nearly as well as larger 70B models like GPT-4. RAG-based annotation with OpenAI embeddings provides the best performance, even reaching within 1–3% of human annotation levels on benchmark datasets like CoNLL-2003 and WNUT-17.
Embedding quality plays a pivotal role in performance. High-quality embeddings notably benefit LLMs in handling datasets with complex annotations. Moreover, larger context sizes generally increase accuracy, although each model has a saturation point beyond which more examples do not yield proportional benefits.
Surprisingly, the study finds that smaller models equipped with the right context and embeddings can be nearly as effective as much larger models, offering a cost-effective solution, especially for institutions with computational constraints. These findings challenge the notion that only larger models guarantee better performance.
To assess the quality of LLM-generated annotations, the study fine-tunes a RoBERTa model on the LLM-annotated dataset and evaluates it against a human-annotated test set. The F1 Score is used as the primary evaluation metric. This approach reinforces the reliability of LLM annotations in real-world training pipelines.
The study applies rigorous statistical tests, including Friedman and Conover post hoc tests, to confirm the significance of differences between models. Results reveal that gpt-4o with OpenAI embeddings ranks highest, while some 7B models statistically match the performance of 70B models, supporting the claim that size is not the sole determinant of accuracy.
The researchers also note the strong applicability of LLM annotation in domains like Human Resources and medicine, where fine-grained, high-quality annotations are particularly valuable. Automating annotations in these fields can significantly reduce manual effort, cost, and privacy risks.
Although the RAG-based approach shows excellent promise, further work is needed to enhance the quality of embeddings for abstract and ambiguous entity types. Extending this methodology to other NLP tasks—in addition to NER—and exploring more advanced retrieval techniques could further improve annotation accuracy and efficiency.
This research strengthens the case for using LLMs in automated annotation pipelines and offers a roadmap for large institutions aiming to optimize their data-processing workflows with minimal human intervention while maintaining high standards of annotation quality.
Read more in LLMsasData Annotators: How Close Are We to Human Performance by MuhammadUzair Ul Haq, Davide Rigoni, Alessandro Sperduti