Home // COLLA 2022, The Twelfth International Conference on Advanced Collaborative Networks, Systems and Applications // View article
Automated Data Pre-processing for Machine Learning based Analyses
Authors:
Akshay Paranjape
Praneeth Katta
Markus Ohlenforst
Keywords: AutoML; Pre-processing; Feature Engineering; Feature Generation; Feature selection; Sampling
Abstract:
Data pre-processing is crucial for Machine learning (ML) analysis, as the quality of data can highly influence the model performance. In recent years, we have witnessed numerous literature for performance enhancement, such as AutoML libraries for tabular datasets, however, the field of data preprocessing hasn’t seen major advancement. AutoML libraries and baseline models like Random Forest are known for their easy-to-use implementation with data-cleaning and categorical encoding as the only required steps. In this paper, we investigate some advanced pre-processing steps such as feature engineering, feature selection, target discretization, and sampling for analyses on tabular datasets. Furthermore, we propose an automated pipeline for these advanced pre-processing steps, which are validated using Random Forest, as well as AutoML libraries. The proposed pre-processing pipeline can also be used for any ML-based algorithms and can be bundled into a Python package. The pipeline also includes a novel sampling method - “Bin-Based sampling” which can be used for general purpose data sampling. The validity of these pre-processing methods has been assessed on OpenML datasets using appropriate metrics such as KL-divergence, accuracy-score, and r2-score. Experimental results show significant performance improvement when modeling with baseline models such as Random Forest and marginal improvements when modeling with AutoML libraries.
Pages: 1 to 8
Copyright: Copyright (c) IARIA, 2022
Publication date: May 22, 2022
Published in: conference
ISSN: 2308-4227
ISBN: 978-1-61208-976-8
Location: Venice, Italy
Dates: from May 22, 2022 to May 26, 2022