Abstract:
In the era of data-driven decision-making, the fitness of data quality to meet its intended purpose is of paramount importance. The success of machine learning (ML) models hinges on the quality of the datasets used during training. Real-world datasets, however, are often riddled with imperfections such as label noise, outliers, missing values, and inconsistencies across features, all of which degrade model performance and generalization. Traditional data-cleaning frameworks, while effective in specific scenarios, struggle to adapt to dynamic data patterns, multi-modal formats, and resource-constrained environments due to their domain-specific design. Most frameworks were developed for structured numeric or textual data, rendering them inadequate for addressing the unique challenges of multimedia formats. Even existing multimedia data-cleaning models remain domain-specific, highlighting the critical need for an enhanced, adaptive solution to improve data quality in image-centric applications. This study introduces the Intelligent Image Forensic Analyzer Layer (iFAL) integrated into a CNN framework, a novel approach enabling adaptive, efficient, and robust data cleaning through iFAL-learning features. This study systematically evaluates iFAL-CNN’s capacity to address challenges across diverse datasets by integrating multi-modal features with capacity for extraction detailed dataset features missed by most of the outlined prior frameworks within and cross-domain generalization. This redefines automated data purification paradigms, offering a scalable algorithm for modern ML pipelines. Conventional rule-based models such as AutoClean (2019), CleanNet (2020), DCN-Clean (2021), and PurifiCNN (2022) are constrained by static heuristics, single-modality focus, and reliance on noise distribution assumptions. In contrast, the iFAL-CNN architecture overcomes these limitations by leveraging metadata analysis, error-level analysis, and authentication accuracy metrics. Designed to generalize across multimedia datasets, iFAL-CNN achieves state-of-the-art performance in accuracy, efficiency, and adaptability. Experimental results on a static dataset within various Data cleansing frameworks demonstrated the following accuracy percentages: Raw Data: 85.2%, AutoClean: 86.5% (+1.3), CleanNet: 87.8% (+2.6), Noise2Self: 88.1% (+2.9), DCN-Clean: 88.4% (+3.2), PurifiCNN: 89.3% (+4.1) and iFAL-CNN (Proposed): 90.7% (+5.5), making it the best performer. The results in this study demonstrates that incorporating hybrid feature extraction, metadata integrity verification, and adaptive error detection mechanisms enables superior cleansing, validation, and preparation of noisy large-scale image datasets for downstream machine learning tasks. iFAL-CNN framework (Proposed) offers a highly scalable, forensic-aware, and data-efficient purification framework that significantly enhances both dataset integrity and model learning stability. Future researches should prioritize Advanced Deep Learning Architectures (ADLA), such as Transformer-CNN hybrids with self-attention mechanisms, to extend these advancements into dynamic and virtual reality context.