Automated Data Cleaning Techniques Using Machine Learning Algorithms in Big Data Pipelines
Abstract
In today's data-driven landscape, the integrity of insights derived from big data is crucial for informed decision-making. This paper explores automated data cleaning techniques using machine learning algorithms to address common data quality issues such as missing values, duplicates, and inconsistencies. By analyzing various machine learning approaches—including supervised, unsupervised, and semi-supervised learning—we demonstrate the efficacy of these techniques in enhancing data quality management within big data pipelines. Our findings indicate that machine learning not only automates but also improves the precision and efficiency of data cleaning processes, making it an invaluable tool for organizations aiming to harness the full potential of their data.
