SSN filtering method with pre-trained models for entity matching in data washing machine
Abstract
Entity Resolution (ER) is a vital process in data integration and quality improvement, aimed at identifying and linking records that refer to the same real-world entity. As data volumes and diversity grow, traditional ER methods face challenges such as scalability, poor data quality, and difficulties in handling sparse or inconsistent records. To address these limitations, this research introduces the Proof-of-Concept Data Washing Machine (DWM), developed under the National Science Foundation, Data Analytics that are Robust and Trusted (NSF DART) Data Life Cycle and Curation research theme, which automates the detection and correction of data quality errors through unsupervised entity resolution. The study focuses on advancing ER by replacing traditional rule-based approaches with machine learning (ML) and deep learning techniques, particularly for the linking process. Deep learning models like Bidirectional Encoder Representations from Transformers (BERT) and its variants are employed to enhance similarity scoring within Cluster ER methods. By integrating these models into the DWM framework, the research leverages attention mechanisms to generate reference embeddings and compute similarity score vectors. Additionally, it addresses optimization in candidate pair reduction during the ER blocking process to improve efficiency. A novel method for managing sensitive data, such as Social Security Numbers (SSNs), is proposed to streamline pair reduction in the linking stage. Comparative analysis between Linking_with_ML and SSN_Filtering_with_ML methods across diverse file types reveals that SSN_Filtering_with_ML achieves higher precision while maintaining a balanced trade-off between precision and recall. These findings highlight its robustness and accuracy in entity matching, significantly enhancing the DWM’s capacity for accurate record linkage while reducing unnecessary comparisons. This research contributes to advancing data quality practices, enabling better decision-making across organizations by providing scalable and efficient solutions for complex entity resolution challenges.
Copyright (c) 2025 Author(s)

This work is licensed under a Creative Commons Attribution 4.0 International License.