Home // COLLA 2022, The Twelfth International Conference on Advanced Collaborative Networks, Systems and Applications // View article


Automated Data Pre-processing for Machine Learning based Analyses

Authors:
Akshay Paranjape
Praneeth Katta
Markus Ohlenforst

Keywords: AutoML; Pre-processing; Feature Engineering; Feature Generation; Feature selection; Sampling

Abstract:
Data pre-processing is crucial for Machine learning (ML) analysis, as the quality of data can highly influence the model performance. In recent years, we have witnessed numerous literature for performance enhancement, such as AutoML libraries for tabular datasets, however, the field of data preprocessing hasn’t seen major advancement. AutoML libraries and baseline models like Random Forest are known for their easy-to-use implementation with data-cleaning and categorical encoding as the only required steps. In this paper, we investigate some advanced pre-processing steps such as feature engineering, feature selection, target discretization, and sampling for analyses on tabular datasets. Furthermore, we propose an automated pipeline for these advanced pre-processing steps, which are validated using Random Forest, as well as AutoML libraries. The proposed pre-processing pipeline can also be used for any ML-based algorithms and can be bundled into a Python package. The pipeline also includes a novel sampling method - “Bin-Based sampling” which can be used for general purpose data sampling. The validity of these pre-processing methods has been assessed on OpenML datasets using appropriate metrics such as KL-divergence, accuracy-score, and r2-score. Experimental results show significant performance improvement when modeling with baseline models such as Random Forest and marginal improvements when modeling with AutoML libraries.

Pages: 1 to 8

Copyright: Copyright (c) IARIA, 2022

Publication date: May 22, 2022

Published in: conference

ISSN: 2308-4227

ISBN: 978-1-61208-976-8

Location: Venice, Italy

Dates: from May 22, 2022 to May 26, 2022