Deduplication¶
Datamatch can not only match entities from different datasets but also
deduplicate records from a single dataset. We can reuse the
DBLP-ACM dataset
1 from Tutorial to
demonstrate this capability. Again download and unzip if you haven’t:
unzip DBLP-ACM.zip -d DBLP-ACM
Load data¶
In [1]: import pandas as pd
...: from datamatch import (
...: ThresholdMatcher, StringSimilarity, NoopIndex, ColumnsIndex
...: )
...:
In [2]: dfa = pd.read_csv('DBLP-ACM/ACM.csv')
...: # set `id` as the index for this frame, this is important because the library
...: # relies on the frame's index to tell which row is which.
...: dfa = dfa.set_index('id', drop=True)
...: # make sure titles are all in lower case
...: dfa.loc[:, 'title'] = dfa.title.str.strip().str.lower()
...: # split author names by comma and sort them before joining them back with comma
...: dfa.loc[dfa.authors.notna(), 'authors'] = dfa.loc[dfa.authors.notna(), 'authors']\
...: .map(lambda x: ', '.join(sorted(x.split(', '))))
...: dfa
...:
Out[2]:
title authors venue year
id
304586 the wasa2 object-oriented workflow management ... Gottfried Vossen, Mathias Weske International Conference on Management of Data 1999
304587 a user-centered interface for querying distrib... Isabel F. Cruz, Kimberly M. James International Conference on Management of Data 1999
304589 world wide database-integrating the web, corba... Athman Bouguettaya, Boualem Benatallah, James ... International Conference on Management of Data 1999
304590 xml-based information mediation with mix Amarnath Gupta, Bertram Ludäscher, Chaita... International Conference on Management of Data 1999
304582 the ccube constraint object-oriented database ... Alexander Brodsky, Jia Chen, Paval A. Exarkhop... International Conference on Management of Data 1999
... ... ... ... ...
672977 dual-buffering strategies in object bases Alfons Kemper, Donald Kossmann Very Large Data Bases 1994
950482 guest editorial Philip A. Bernstein, Raghu Ramakrishnan, Yanni... The VLDB Journal — The International Jou... 2003
672980 graphdb: modeling and querying graphs in datab... Ralf Hartmut Güting Very Large Data Bases 1994
945741 review of the data warehouse toolkit: the comp... Alexander A. Anisimov ACM SIGMOD Record 2003
672979 bulk loading into an oodb: a performance study Janet L. Wiener, Jeffrey F. Naughton Very Large Data Bases 1994
[2294 rows x 4 columns]
In [3]: dfb = pd.read_csv('DBLP-ACM/DBLP2.csv', encoding='latin_1')
...: # here we do the same cleaning step
...: dfb = dfb.set_index('id', drop=True)
...: dfb.loc[:, 'title'] = dfb.title.str.strip().str.lower()
...: dfb.loc[dfb.authors.notna(), 'authors'] = dfb.loc[dfb.authors.notna(), 'authors']\
...: .map(lambda x: ', '.join(sorted(x.split(', '))))
...: dfb
...:
Out[3]:
title authors venue year
id
journals/sigmod/Mackay99 semantic integration of environmental models f... D. Scott Mackay SIGMOD Record 1999
conf/vldb/PoosalaI96 estimation of query-result distribution and it... Viswanath Poosala, Yannis E. Ioannidis VLDB 1996
conf/vldb/PalpanasSCP02 incremental maintenance for non-distributive a... Hamid Pirahesh, Richard Sidle, Roberta Cochran... VLDB 2002
conf/vldb/GardarinGT96 cost-based selection of path expression proces... Georges Gardarin, Jean-Robert Gruser, Zhao-Hui... VLDB 1996
conf/vldb/HoelS95 benchmarking spatial join operations with spat... Erik G. Hoel, Hanan Samet VLDB 1995
... ... ... ... ...
journals/tods/KarpSP03 a simple algorithm for finding frequent elemen... Christos H. Papadimitriou, Richard M. Karp, Sc... ACM Trans. Database Syst. 2003
conf/vldb/LimWV03 sash: a self-adaptive histogram set for dynami... Jeffrey Scott Vitter, Lipyeow Lim, Min Wang VLDB 2003
journals/tods/ChakrabartiKMP02 locally adaptive dimensionality reduction for ... Eamonn J. Keogh, Kaushik Chakrabarti, Michael ... ACM Trans. Database Syst. 2002
journals/sigmod/Snodgrass01 chair's message Richard T. Snodgrass SIGMOD Record 2001
conf/vldb/LiM01 indexing and querying xml data for regular pat... Bongki Moon, Quanzhong Li VLDB 2001
[2616 rows x 4 columns]
Next let’s combine two sources into one and pretend that we have a single dataset with records to deduplicate:
In [4]: # The ACM dataset happens to use numbers as id. We need to convert it to
...: # str to ensure the final dataset's index has consistent type.
...: dfa.index = dfa.index.astype(str)
...: df = pd.concat([dfa, dfb])
...: df
...:
Out[4]:
title authors venue year
id
304586 the wasa2 object-oriented workflow management ... Gottfried Vossen, Mathias Weske International Conference on Management of Data 1999
304587 a user-centered interface for querying distrib... Isabel F. Cruz, Kimberly M. James International Conference on Management of Data 1999
304589 world wide database-integrating the web, corba... Athman Bouguettaya, Boualem Benatallah, James ... International Conference on Management of Data 1999
304590 xml-based information mediation with mix Amarnath Gupta, Bertram Ludäscher, Chaita... International Conference on Management of Data 1999
304582 the ccube constraint object-oriented database ... Alexander Brodsky, Jia Chen, Paval A. Exarkhop... International Conference on Management of Data 1999
... ... ... ... ...
journals/tods/KarpSP03 a simple algorithm for finding frequent elemen... Christos H. Papadimitriou, Richard M. Karp, Sc... ACM Trans. Database Syst. 2003
conf/vldb/LimWV03 sash: a self-adaptive histogram set for dynami... Jeffrey Scott Vitter, Lipyeow Lim, Min Wang VLDB 2003
journals/tods/ChakrabartiKMP02 locally adaptive dimensionality reduction for ... Eamonn J. Keogh, Kaushik Chakrabarti, Michael ... ACM Trans. Database Syst. 2002
journals/sigmod/Snodgrass01 chair's message Richard T. Snodgrass SIGMOD Record 2001
conf/vldb/LiM01 indexing and querying xml data for regular pat... Bongki Moon, Quanzhong Li VLDB 2001
[4910 rows x 4 columns]
Deduplicate¶
When ThresholdMatcher is given one dataset instead of two, it tries to deduplicate instead of matching:
In [5]: matcher = ThresholdMatcher(ColumnsIndex('year'), {
...: 'title': StringSimilarity(),
...: 'authors': StringSimilarity(),
...: }, df)
...:
All other concepts such as the index and similarity functions apply equally.
get_sample_pairs
should still be used to review how pairs within certain thresholds look like.
But in general, the desired result is clusters instead of pairs because unlike
when matching two different datasets, there can be more than two rows that are
identified as the same entity. Methods that return clusters are:
get_clusters_within_threshold
: returns matching clusters as a multi-index frame. It has the following index levels:cluster_idx: cluster number.
pair_idx: pair number within the cluster. All rows within a cluster are paired up, each pair are then ordered by descending similarity score.
sim_score: the similarity score.
row_key: the row index from the input dataset.
In [6]: matcher.get_clusters_within_threshold(0.7).head(30)
Out[6]:
title authors venue year
cluster_idx pair_idx sim_score row_key
0 0 1.0 journals/sigmod/Ozsu02a chair's message M. Tamer Özsu SIGMOD Record 2002
journals/sigmod/Ozsu02b chair's message M. Tamer Özsu SIGMOD Record 2002
1 1.0 journals/sigmod/Ozsu02 chair's message M. Tamer Özsu SIGMOD Record 2002
journals/sigmod/Ozsu02b chair's message M. Tamer Özsu SIGMOD Record 2002
2 1.0 journals/sigmod/Ozsu02 chair's message M. Tamer Özsu SIGMOD Record 2002
journals/sigmod/Ozsu02a chair's message M. Tamer Özsu SIGMOD Record 2002
1 0 1.0 journals/sigmod/Liu02 editor's notes Ling Liu SIGMOD Record 2002
journals/sigmod/Liu02c editor's notes Ling Liu SIGMOD Record 2002
1 1.0 journals/sigmod/Liu02 editor's notes Ling Liu SIGMOD Record 2002
journals/sigmod/Liu02a editor's notes Ling Liu SIGMOD Record 2002
2 1.0 journals/sigmod/Liu02a editor's notes Ling Liu SIGMOD Record 2002
journals/sigmod/Liu02c editor's notes Ling Liu SIGMOD Record 2002
3 1.0 journals/sigmod/Liu02b editor's notes Ling Liu SIGMOD Record 2002
journals/sigmod/Liu02c editor's notes Ling Liu SIGMOD Record 2002
4 1.0 journals/sigmod/Liu02 editor's notes Ling Liu SIGMOD Record 2002
journals/sigmod/Liu02b editor's notes Ling Liu SIGMOD Record 2002
5 1.0 journals/sigmod/Liu02a editor's notes Ling Liu SIGMOD Record 2002
journals/sigmod/Liu02b editor's notes Ling Liu SIGMOD Record 2002
2 0 1.0 767131 a template model for multidimensional inter-tr... Hongjun Lu, Jeffrey Xu Yu, Jiawei Han, Ling Feng The VLDB Journal — The International Jou... 2002
journals/vldb/FengYLH02 a template model for multidimensional inter-tr... Hongjun Lu, Jeffrey Xu Yu, Jiawei Han, Ling Feng VLDB J. 2002
3 0 1.0 767130 efficient similarity search for market basket ... Alexandros Nanopoulos, Yannis Manolopoulos The VLDB Journal — The International Jou... 2002
journals/vldb/NanopoulosM02 efficient similarity search for market basket ... Alexandros Nanopoulos, Yannis Manolopoulos VLDB J. 2002
4 0 1.0 767129 speeding up construction of pmr quadtree-based... Gisli R. Hjaltason, Hanan Samet The VLDB Journal — The International Jou... 2002
journals/vldb/HjaltasonS02 speeding up construction of pmr quadtree-based... Gísli R. Hjaltason, Hanan Samet VLDB J. 2002
5 0 1.0 767128 spatial indexing of high-dimensional data base... Haruhiko Kojima, Masatoshi Yoshikawa, Shunsuke... The VLDB Journal — The International Jou... 2002
journals/vldb/SakuraiYUK02 spatial indexing of high-dimensional data base... Haruhiko Kojima, Masatoshi Yoshikawa, Shunsuke... VLDB J. 2002
6 0 1.0 767098 query processing techniques for arrays Arunprasad P. Marathe, Kenneth Salem The VLDB Journal — The International Jou... 2002
journals/vldb/MaratheS02 query processing techniques for arrays Arunprasad P. Marathe, Kenneth Salem VLDB J. 2002
7 0 1.0 767096 locating and accessing data repositories with ... Anthony Tomasic, George A. Mihaila, Louiqa Ras... The VLDB Journal — The International Jou... 2002
journals/vldb/MihailaRT02 locating and accessing data repositories with ... Anthony Tomasic, George A. Mihaila, Louiqa Ras... VLDB J. 2002
save_clusters_to_excel
: saves matching clusters to an Excel file for review. Output is similar toget_clusters_within_threshold
.get_index_clusters_within_thresholds
: returns matching clusters as a list. Each cluster is represented by a frozenset of row indices. You’ll want to use this instead ofget_index_pairs_within_thresholds
.
In [7]: matcher.get_index_clusters_within_thresholds(0.7)[:10]
Out[7]:
[frozenset({'journals/sigmod/Ozsu02',
'journals/sigmod/Ozsu02a',
'journals/sigmod/Ozsu02b'}),
frozenset({'journals/sigmod/Liu02',
'journals/sigmod/Liu02a',
'journals/sigmod/Liu02b',
'journals/sigmod/Liu02c'}),
frozenset({'767131', 'journals/vldb/FengYLH02'}),
frozenset({'767130', 'journals/vldb/NanopoulosM02'}),
frozenset({'767129', 'journals/vldb/HjaltasonS02'}),
frozenset({'767128', 'journals/vldb/SakuraiYUK02'}),
frozenset({'767098', 'journals/vldb/MaratheS02'}),
frozenset({'767096', 'journals/vldb/MihailaRT02'}),
frozenset({'767095', 'journals/vldb/Navarro02'}),
frozenset({'767094', 'journals/vldb/RafieiM02'})]
- 1
DBLP-ACM dataset by the database group of Prof. Erhard Rahm under the CC BY 4.0