LLM-powered active learning for cost-effective text classification
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This thesis presents an LLM-powered active learning framework for cost-effective text classification, addressing the challenge of potential LLM annotation errors while balancing annotation quality and model accuracy. Our methodology combines human and large language model (LLM) annotations using uncertainty sampling and confidence scoring. Starting with a small, labeled seed set, the model iteratively selects the most informative data points for annotation, reducing labeling costs while maximizing performance. To simulate real-world scenarios, a dynamically updated proxy validation set mirrors the distribution of the unlabeled pool, enabling reliable performance estimation throughout training. The Performance Improvement Cost Ratio (PICR) is introduced as an objective stopping criterion to optimize the balance between costs and accuracy gains. Additionally, role-based prompting enhances annotation quality, creating a scalable framework adaptable to diverse text classification tasks. Experimental results demonstrate that the proposed approach achieves human-comparable performance at reduced costs, underscoring its potential for practical applications.