Contextual topics: advancing text segmentation through pre-trained models and contextual keywords

Date

2024-09-01

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Text Segmentation (TS) is a Natural Language Processing based task that is aimed to divide paragraphs and bodies of text into topical, semantically aligned blocks of text. This can play an important role in creating structured, searchable text-based representations after digitizing paper-based documents. Traditionally, TS has been approached with sub-optimal feature engineering efforts and heuristic modelling. In this work, we explore novel supervised training procedures with a labeled text corpus along with a neural Deep Learning model for improved predictions. Results are evaluated with the Pĸ and WindowDiff metrics and show performance improvements beyond any previous unsupervised TS systems evaluated on similar datasets. The proposed system utilizes Bidirectional Encoder Representations from Transformers (BERT) as an encoding mechanism, which feeds to several downstream layers with a final classification output layer, and even shows promise for improved results with future iterations of BERT. It is also found that infusing sentence embeddings with unsupervised features, such as the ones gathered from Latent Dirichlet Allocation (LDA), provides comparable results to current state-of-the-art (SOTA) TS systems. In addition to this, unsupervised features derived from LDA give the proposed system the ability to generalize better than previous supervised systems in the space. Furthermore, it is shown that with the use of novel language models such as Generative Pre-trained Transformers (GPT) for text augmentation, training data can be multiplied, while continuing to see performance improvements. Although the proposed systems are supervised in nature, they have the capability of fine-tuning a threshold variable that allows the system to predict segments more frequently or sparingly, further bolstering the practical usability of it. Due to the increasing competition in the supervised TS space, creating competitive systems often see contributions from larger research companies with more available resources (e.g., Google, Meta, etc.). However, unsupervised TS has been relatively unexplored in comparison with supervised efforts, since it is much more challenging to build a generalizable TS system. To this end, strong word and sentence embeddings are used to create an unsupervised TS system called “Coherence”, that blends the best of pre-trained models and unsupervised features to create a system that is capable of generalizing across various datasets, while achieving competitive results in the space. Since Coherence is unsupervised, inference is quick and requires no upfront investment (i.e., this technique can be picked up and applied to a domain without the need for fine-tuning).

Description

Keywords

Citation