SSN filtering method with pre-trained models for entity matching in data washing machine

  • Bushra Sajid Department of Computer Science, The University of Arkansas at Little Rock, AR 72204, USA
  • Ahmed Abu-Halimeh Department of Information Science, The University of Arkansas at Little Rock, AR 72204, USA
  • John R. Talburt Department of Information Science, The University of Arkansas at Little Rock, AR 72204, USA
Article ID: 4609
4 Views
Keywords: data quality; machine learning; entity resolution; filtering method

Abstract

Entity Resolution (ER) is a vital process in data integration and quality improvement, aimed at identifying and linking records that refer to the same real-world entity. As data volumes and diversity grow, traditional ER methods face challenges such as scalability, poor data quality, and difficulties in handling sparse or inconsistent records. To address these limitations, this research introduces the Proof-of-Concept Data Washing Machine (DWM), developed under the National Science Foundation, Data Analytics that are Robust and Trusted (NSF DART) Data Life Cycle and Curation research theme, which automates the detection and correction of data quality errors through unsupervised entity resolution. The study focuses on advancing ER by replacing traditional rule-based approaches with machine learning (ML) and deep learning techniques, particularly for the linking process. Deep learning models like Bidirectional Encoder Representations from Transformers (BERT) and its variants are employed to enhance similarity scoring within Cluster ER methods. By integrating these models into the DWM framework, the research leverages attention mechanisms to generate reference embeddings and compute similarity score vectors. Additionally, it addresses optimization in candidate pair reduction during the ER blocking process to improve efficiency. A novel method for managing sensitive data, such as Social Security Numbers (SSNs), is proposed to streamline pair reduction in the linking stage. Comparative analysis between Linking_with_ML and SSN_Filtering_with_ML methods across diverse file types reveals that SSN_Filtering_with_ML achieves higher precision while maintaining a balanced trade-off between precision and recall. These findings highlight its robustness and accuracy in entity matching, significantly enhancing the DWM’s capacity for accurate record linkage while reducing unnecessary comparisons. This research contributes to advancing data quality practices, enabling better decision-making across organizations by providing scalable and efficient solutions for complex entity resolution challenges.

Published
2025-03-25
How to Cite
Sajid, B., Abu-Halimeh, A., & Talburt, J. R. (2025). SSN filtering method with pre-trained models for entity matching in data washing machine. AI Insights, 1(1), 4609. https://doi.org/10.18282/aii4609
Section
Article