Contextual topics: advancing text segmentation through pre-trained models and contextual keywords

Maraj, Amit

Contextual topics: advancing text segmentation through pre-trained models and contextual keywords

dc.contributor.advisor	Vargas Martin, Miguel
dc.contributor.advisor	Makrehchi, Masoud
dc.contributor.author	Maraj, Amit
dc.date.accessioned	2024-10-21T20:44:26Z
dc.date.available	2024-10-21T20:44:26Z
dc.date.issued	2024-09-01
dc.description.abstract	Text Segmentation (TS) is a Natural Language Processing based task that is aimed to divide paragraphs and bodies of text into topical, semantically aligned blocks of text. This can play an important role in creating structured, searchable text-based representations after digitizing paper-based documents. Traditionally, TS has been approached with sub-optimal feature engineering efforts and heuristic modelling. In this work, we explore novel supervised training procedures with a labeled text corpus along with a neural Deep Learning model for improved predictions. Results are evaluated with the Pĸ and WindowDiff metrics and show performance improvements beyond any previous unsupervised TS systems evaluated on similar datasets. The proposed system utilizes Bidirectional Encoder Representations from Transformers (BERT) as an encoding mechanism, which feeds to several downstream layers with a final classification output layer, and even shows promise for improved results with future iterations of BERT. It is also found that infusing sentence embeddings with unsupervised features, such as the ones gathered from Latent Dirichlet Allocation (LDA), provides comparable results to current state-of-the-art (SOTA) TS systems. In addition to this, unsupervised features derived from LDA give the proposed system the ability to generalize better than previous supervised systems in the space. Furthermore, it is shown that with the use of novel language models such as Generative Pre-trained Transformers (GPT) for text augmentation, training data can be multiplied, while continuing to see performance improvements. Although the proposed systems are supervised in nature, they have the capability of fine-tuning a threshold variable that allows the system to predict segments more frequently or sparingly, further bolstering the practical usability of it. Due to the increasing competition in the supervised TS space, creating competitive systems often see contributions from larger research companies with more available resources (e.g., Google, Meta, etc.). However, unsupervised TS has been relatively unexplored in comparison with supervised efforts, since it is much more challenging to build a generalizable TS system. To this end, strong word and sentence embeddings are used to create an unsupervised TS system called “Coherence”, that blends the best of pre-trained models and unsupervised features to create a system that is capable of generalizing across various datasets, while achieving competitive results in the space. Since Coherence is unsupervised, inference is quick and requires no upfront investment (i.e., this technique can be picked up and applied to a domain without the need for fine-tuning).
dc.identifier.uri	https://ontariotechu.scholaris.ca/handle/10155/1860
dc.language.iso	en
dc.subject.other	Text Segmentation
dc.subject.other	Natural Language Processing
dc.subject.other	Topic segmentation
dc.subject.other	Word embeddings
dc.subject.other	Sentence embeddings
dc.title	Contextual topics: advancing text segmentation through pre-trained models and contextual keywords
dc.type	Dissertation
thesis.degree.discipline	Computer Science
thesis.degree.grantor	University of Ontario Institute of Technology
thesis.degree.name	Doctor of Philosophy (PhD)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Maraj_Amit.pdf
Size:: 1.68 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.89 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Electronic Theses and Dissertations
Doctoral Dissertations (FSCI)