Home // AIHealth 2024, The First International Conference on AI-Health // View article
Evaluating Text Pre-Processing Strategies for Clinical Document Classification with BERT
Authors:
Sarah Miller
Serge Sharoff
Geoffrey Hall
Prabhu Arumugam
Keywords: BERT; Clinical Text; Natural Language Processing; Text classification
Abstract:
In many Natural Language Processing (NLP) tasks, Bidirectional Encoder Representations from Transformers BERT and BERT-based techniques have produced state of the art results. However, this increase in performance comes with a caveat, limitations in the size of the text input the model can process. There are few studies that discuss the constraints of BERTs input length in the context of clinical documents, and as a result, little is known about how effective BERT is in this regard. To overcome these constraints, we investigate techniques for modifying the input text size of pathology report documents. By utilizing various BERT variants, we evaluate these approaches and examine the relative significance of domain specificity versus generic vocabulary training. We demonstrate that BERT models trained on domain knowledge outperform the vocabulary of standard models. In the process of classifying a set of variable-length pathology report texts, BERTs standard truncation approach, which removes text longer than the maximum, performs as well as more sophisticated text pre-processing techniques.
Pages: 41 to 46
Copyright: Copyright (c) IARIA, 2024
Publication date: March 10, 2024
Published in: conference
ISBN: 978-1-68558-136-7
Location: Athens, Greece
Dates: from March 10, 2024 to March 14, 2024