Deduplication¶

Datamatch can not only match entities from different datasets but also deduplicate records from a single dataset. We can reuse the DBLP-ACM dataset 1 from Tutorial to demonstrate this capability. Again download and unzip if you haven’t:

unzip DBLP-ACM.zip -d DBLP-ACM

Load data¶

In [1]: import pandas as pd
   ...: from datamatch import (
   ...:     ThresholdMatcher, StringSimilarity, NoopIndex, ColumnsIndex
   ...: )
   ...: 

In [2]: dfa = pd.read_csv('DBLP-ACM/ACM.csv')
   ...: # set `id` as the index for this frame, this is important because the library
   ...: # relies on the frame's index to tell which row is which.
   ...: dfa = dfa.set_index('id', drop=True)
   ...: # make sure titles are all in lower case
   ...: dfa.loc[:, 'title'] = dfa.title.str.strip().str.lower()
   ...: # split author names by comma and sort them before joining them back with comma
   ...: dfa.loc[dfa.authors.notna(), 'authors'] = dfa.loc[dfa.authors.notna(), 'authors']\
   ...:     .map(lambda x: ', '.join(sorted(x.split(', '))))
   ...: dfa
   ...: 
Out[2]: 
                                                    title                                            authors                                              venue  year
id                                                                                                                                                                   
304586  the wasa2 object-oriented workflow management ...                    Gottfried Vossen, Mathias Weske     International Conference on Management of Data  1999
304587  a user-centered interface for querying distrib...                  Isabel F. Cruz, Kimberly M. James     International Conference on Management of Data  1999
304589  world wide database-integrating the web, corba...  Athman Bouguettaya, Boualem Benatallah, James ...     International Conference on Management of Data  1999
304590           xml-based information mediation with mix  Amarnath Gupta, Bertram Lud&#228;scher, Chaita...     International Conference on Management of Data  1999
304582  the ccube constraint object-oriented database ...  Alexander Brodsky, Jia Chen, Paval A. Exarkhop...     International Conference on Management of Data  1999
...                                                   ...                                                ...                                                ...   ...
672977          dual-buffering strategies in object bases                     Alfons Kemper, Donald Kossmann                              Very Large Data Bases  1994
950482                                    guest editorial  Philip A. Bernstein, Raghu Ramakrishnan, Yanni...  The VLDB Journal &mdash; The International Jou...  2003
672980  graphdb: modeling and querying graphs in datab...                           Ralf Hartmut G&#252;ting                              Very Large Data Bases  1994
945741  review of the data warehouse toolkit: the comp...                              Alexander A. Anisimov                                 ACM SIGMOD Record   2003
672979     bulk loading into an oodb: a performance study               Janet L. Wiener, Jeffrey F. Naughton                              Very Large Data Bases  1994

[2294 rows x 4 columns]

In [3]: dfb = pd.read_csv('DBLP-ACM/DBLP2.csv', encoding='latin_1')
   ...: # here we do the same cleaning step
   ...: dfb = dfb.set_index('id', drop=True)
   ...: dfb.loc[:, 'title'] = dfb.title.str.strip().str.lower()
   ...: dfb.loc[dfb.authors.notna(), 'authors'] = dfb.loc[dfb.authors.notna(), 'authors']\
   ...:     .map(lambda x: ', '.join(sorted(x.split(', '))))
   ...: dfb
   ...: 
Out[3]: 
                                                                            title                                            authors                      venue  year
id                                                                                                                                                                   
journals/sigmod/Mackay99        semantic integration of environmental models f...                                    D. Scott Mackay              SIGMOD Record  1999
conf/vldb/PoosalaI96            estimation of query-result distribution and it...             Viswanath Poosala, Yannis E. Ioannidis                       VLDB  1996
conf/vldb/PalpanasSCP02         incremental maintenance for non-distributive a...  Hamid Pirahesh, Richard Sidle, Roberta Cochran...                       VLDB  2002
conf/vldb/GardarinGT96          cost-based selection of path expression proces...  Georges Gardarin, Jean-Robert Gruser, Zhao-Hui...                       VLDB  1996
conf/vldb/HoelS95               benchmarking spatial join operations with spat...                          Erik G. Hoel, Hanan Samet                       VLDB  1995
...                                                                           ...                                                ...                        ...   ...
journals/tods/KarpSP03          a simple algorithm for finding frequent elemen...  Christos H. Papadimitriou, Richard M. Karp, Sc...  ACM Trans. Database Syst.  2003
conf/vldb/LimWV03               sash: a self-adaptive histogram set for dynami...        Jeffrey Scott Vitter, Lipyeow Lim, Min Wang                       VLDB  2003
journals/tods/ChakrabartiKMP02  locally adaptive dimensionality reduction for ...  Eamonn J. Keogh, Kaushik Chakrabarti, Michael ...  ACM Trans. Database Syst.  2002
journals/sigmod/Snodgrass01                                       chair's message                               Richard T. Snodgrass              SIGMOD Record  2001
conf/vldb/LiM01                 indexing and querying xml data for regular pat...                          Bongki Moon, Quanzhong Li                       VLDB  2001

[2616 rows x 4 columns]

Next let’s combine two sources into one and pretend that we have a single dataset with records to deduplicate:

In [4]: # The ACM dataset happens to use numbers as id. We need to convert it to
   ...: # str to ensure the final dataset's index has consistent type.
   ...: dfa.index = dfa.index.astype(str)
   ...: df = pd.concat([dfa, dfb])
   ...: df
   ...: 
Out[4]: 
                                                                            title                                            authors                                           venue  year
id                                                                                                                                                                                        
304586                          the wasa2 object-oriented workflow management ...                    Gottfried Vossen, Mathias Weske  International Conference on Management of Data  1999
304587                          a user-centered interface for querying distrib...                  Isabel F. Cruz, Kimberly M. James  International Conference on Management of Data  1999
304589                          world wide database-integrating the web, corba...  Athman Bouguettaya, Boualem Benatallah, James ...  International Conference on Management of Data  1999
304590                                   xml-based information mediation with mix  Amarnath Gupta, Bertram Lud&#228;scher, Chaita...  International Conference on Management of Data  1999
304582                          the ccube constraint object-oriented database ...  Alexander Brodsky, Jia Chen, Paval A. Exarkhop...  International Conference on Management of Data  1999
...                                                                           ...                                                ...                                             ...   ...
journals/tods/KarpSP03          a simple algorithm for finding frequent elemen...  Christos H. Papadimitriou, Richard M. Karp, Sc...                       ACM Trans. Database Syst.  2003
conf/vldb/LimWV03               sash: a self-adaptive histogram set for dynami...        Jeffrey Scott Vitter, Lipyeow Lim, Min Wang                                            VLDB  2003
journals/tods/ChakrabartiKMP02  locally adaptive dimensionality reduction for ...  Eamonn J. Keogh, Kaushik Chakrabarti, Michael ...                       ACM Trans. Database Syst.  2002
journals/sigmod/Snodgrass01                                       chair's message                               Richard T. Snodgrass                                   SIGMOD Record  2001
conf/vldb/LiM01                 indexing and querying xml data for regular pat...                          Bongki Moon, Quanzhong Li                                            VLDB  2001

[4910 rows x 4 columns]

Deduplicate¶

When ThresholdMatcher is given one dataset instead of two, it tries to deduplicate instead of matching:

In [5]: matcher = ThresholdMatcher(ColumnsIndex('year'), {
   ...:     'title': StringSimilarity(),
   ...:     'authors': StringSimilarity(),
   ...: }, df)
   ...: 

All other concepts such as the index and similarity functions apply equally. get_sample_pairs should still be used to review how pairs within certain thresholds look like. But in general, the desired result is clusters instead of pairs because unlike when matching two different datasets, there can be more than two rows that are identified as the same entity. Methods that return clusters are:

get_clusters_within_threshold: returns matching clusters as a multi-index frame. It has the following index levels:
- cluster_idx: cluster number.
- pair_idx: pair number within the cluster. All rows within a cluster are paired up, each pair are then ordered by descending similarity score.
- sim_score: the similarity score.
- row_key: the row index from the input dataset.

In [6]: matcher.get_clusters_within_threshold(0.7).head(30)
Out[6]: 
                                                                                                        title                                            authors                                              venue  year
cluster_idx pair_idx sim_score row_key                                                                                                                                                                                   
0           0        1.0       journals/sigmod/Ozsu02a                                        chair's message                                      M. Tamer Özsu                                      SIGMOD Record  2002
                               journals/sigmod/Ozsu02b                                        chair's message                                      M. Tamer Özsu                                      SIGMOD Record  2002
            1        1.0       journals/sigmod/Ozsu02                                         chair's message                                      M. Tamer Özsu                                      SIGMOD Record  2002
                               journals/sigmod/Ozsu02b                                        chair's message                                      M. Tamer Özsu                                      SIGMOD Record  2002
            2        1.0       journals/sigmod/Ozsu02                                         chair's message                                      M. Tamer Özsu                                      SIGMOD Record  2002
                               journals/sigmod/Ozsu02a                                        chair's message                                      M. Tamer Özsu                                      SIGMOD Record  2002
1           0        1.0       journals/sigmod/Liu02                                           editor's notes                                           Ling Liu                                      SIGMOD Record  2002
                               journals/sigmod/Liu02c                                          editor's notes                                           Ling Liu                                      SIGMOD Record  2002
            1        1.0       journals/sigmod/Liu02                                           editor's notes                                           Ling Liu                                      SIGMOD Record  2002
                               journals/sigmod/Liu02a                                          editor's notes                                           Ling Liu                                      SIGMOD Record  2002
            2        1.0       journals/sigmod/Liu02a                                          editor's notes                                           Ling Liu                                      SIGMOD Record  2002
                               journals/sigmod/Liu02c                                          editor's notes                                           Ling Liu                                      SIGMOD Record  2002
            3        1.0       journals/sigmod/Liu02b                                          editor's notes                                           Ling Liu                                      SIGMOD Record  2002
                               journals/sigmod/Liu02c                                          editor's notes                                           Ling Liu                                      SIGMOD Record  2002
            4        1.0       journals/sigmod/Liu02                                           editor's notes                                           Ling Liu                                      SIGMOD Record  2002
                               journals/sigmod/Liu02b                                          editor's notes                                           Ling Liu                                      SIGMOD Record  2002
            5        1.0       journals/sigmod/Liu02a                                          editor's notes                                           Ling Liu                                      SIGMOD Record  2002
                               journals/sigmod/Liu02b                                          editor's notes                                           Ling Liu                                      SIGMOD Record  2002
2           0        1.0       767131                       a template model for multidimensional inter-tr...   Hongjun Lu, Jeffrey Xu Yu, Jiawei Han, Ling Feng  The VLDB Journal &mdash; The International Jou...  2002
                               journals/vldb/FengYLH02      a template model for multidimensional inter-tr...   Hongjun Lu, Jeffrey Xu Yu, Jiawei Han, Ling Feng                                            VLDB J.  2002
3           0        1.0       767130                       efficient similarity search for market basket ...         Alexandros Nanopoulos, Yannis Manolopoulos  The VLDB Journal &mdash; The International Jou...  2002
                               journals/vldb/NanopoulosM02  efficient similarity search for market basket ...         Alexandros Nanopoulos, Yannis Manolopoulos                                            VLDB J.  2002
4           0        1.0       767129                       speeding up construction of pmr quadtree-based...                    Gisli R. Hjaltason, Hanan Samet  The VLDB Journal &mdash; The International Jou...  2002
                               journals/vldb/HjaltasonS02   speeding up construction of pmr quadtree-based...                    Gísli R. Hjaltason, Hanan Samet                                            VLDB J.  2002
5           0        1.0       767128                       spatial indexing of high-dimensional data base...  Haruhiko Kojima, Masatoshi Yoshikawa, Shunsuke...  The VLDB Journal &mdash; The International Jou...  2002
                               journals/vldb/SakuraiYUK02   spatial indexing of high-dimensional data base...  Haruhiko Kojima, Masatoshi Yoshikawa, Shunsuke...                                            VLDB J.  2002
6           0        1.0       767098                                  query processing techniques for arrays               Arunprasad P. Marathe, Kenneth Salem  The VLDB Journal &mdash; The International Jou...  2002
                               journals/vldb/MaratheS02                query processing techniques for arrays               Arunprasad P. Marathe, Kenneth Salem                                            VLDB J.  2002
7           0        1.0       767096                       locating and accessing data repositories with ...  Anthony Tomasic, George A. Mihaila, Louiqa Ras...  The VLDB Journal &mdash; The International Jou...  2002
                               journals/vldb/MihailaRT02    locating and accessing data repositories with ...  Anthony Tomasic, George A. Mihaila, Louiqa Ras...                                            VLDB J.  2002

save_clusters_to_excel: saves matching clusters to an Excel file for review. Output is similar to get_clusters_within_threshold.
get_index_clusters_within_thresholds: returns matching clusters as a list. Each cluster is represented by a frozenset of row indices. You’ll want to use this instead of get_index_pairs_within_thresholds.

In [7]: matcher.get_index_clusters_within_thresholds(0.7)[:10]
Out[7]: 
[frozenset({'journals/sigmod/Ozsu02',
            'journals/sigmod/Ozsu02a',
            'journals/sigmod/Ozsu02b'}),
 frozenset({'journals/sigmod/Liu02',
            'journals/sigmod/Liu02a',
            'journals/sigmod/Liu02b',
            'journals/sigmod/Liu02c'}),
 frozenset({'767131', 'journals/vldb/FengYLH02'}),
 frozenset({'767130', 'journals/vldb/NanopoulosM02'}),
 frozenset({'767129', 'journals/vldb/HjaltasonS02'}),
 frozenset({'767128', 'journals/vldb/SakuraiYUK02'}),
 frozenset({'767098', 'journals/vldb/MaratheS02'}),
 frozenset({'767096', 'journals/vldb/MihailaRT02'}),
 frozenset({'767095', 'journals/vldb/Navarro02'}),
 frozenset({'767094', 'journals/vldb/RafieiM02'})]

1: DBLP-ACM dataset by the database group of Prof. Erhard Rahm under the CC BY 4.0