Evaluating Text Pre-Processing Strategies for Clinical Document Classification with BERT

Miller, Sarah; Sharoff, Serge; Hall, Geoffrey; Arumugam, Prabhu

Home // AIHealth 2024, The First International Conference on AI-Health // View article

Evaluating Text Pre-Processing Strategies for Clinical Document Classification with BERT

Authors:
Sarah Miller
Serge Sharoff
Geoffrey Hall
Prabhu Arumugam

Keywords: BERT; Clinical Text; Natural Language Processing; Text classification

Abstract:
In many Natural Language Processing (NLP) tasks, Bidirectional Encoder Representations from Transformers BERT and BERT-based techniques have produced state of the art results. However, this increase in performance comes with a caveat, limitations in the size of the text input the model can process. There are few studies that discuss the constraints of BERTs input length in the context of clinical documents, and as a result, little is known about how effective BERT is in this regard. To overcome these constraints, we investigate techniques for modifying the input text size of pathology report documents. By utilizing various BERT variants, we evaluate these approaches and examine the relative significance of domain specificity versus generic vocabulary training. We demonstrate that BERT models trained on domain knowledge outperform the vocabulary of standard models. In the process of classifying a set of variable-length pathology report texts, BERTs standard truncation approach, which removes text longer than the maximum, performs as well as more sophisticated text pre-processing techniques.

Pages: 41 to 46

Copyright: Copyright (c) IARIA, 2024

Publication date: March 10, 2024

Published in: conference

ISBN: 978-1-68558-136-7

Location: Athens, Greece

Dates: from March 10, 2024 to March 14, 2024