Tutorial¶
First, download the DBLP-ACM dataset
1 to your
working directory and unzip it.
unzip DBLP-ACM.zip -d DBLP-ACM
This dataset contains article names, authors, and years from two different sources, and the perfect matching between them. Our job is to match articles from those two sources such that we can produce a result as close to the provided perfect matching as possible.
Load and clean data¶
Preceding the matching step is the data cleaning and standardization step, which is of great importance. We’re keeping this step as simple as possible:
In [1]: import pandas as pd
...: from datamatch import (
...: ThresholdMatcher, StringSimilarity, NoopIndex, ColumnsIndex
...: )
...:
In [2]: dfa = pd.read_csv('DBLP-ACM/ACM.csv')
...: # set `id` as the index for this frame, this is important because the library
...: # relies on the frame's index to tell which row is which.
...: dfa = dfa.set_index('id', drop=True)
...: # make sure titles are all in lower case
...: dfa.loc[:, 'title'] = dfa.title.str.strip().str.lower()
...: # split author names by comma and sort them before joining them back with comma
...: dfa.loc[dfa.authors.notna(), 'authors'] = dfa.loc[dfa.authors.notna(), 'authors']\
...: .map(lambda x: ', '.join(sorted(x.split(', '))))
...: dfa
...:
Out[2]:
title authors venue year
id
304586 the wasa2 object-oriented workflow management ... Gottfried Vossen, Mathias Weske International Conference on Management of Data 1999
304587 a user-centered interface for querying distrib... Isabel F. Cruz, Kimberly M. James International Conference on Management of Data 1999
304589 world wide database-integrating the web, corba... Athman Bouguettaya, Boualem Benatallah, James ... International Conference on Management of Data 1999
304590 xml-based information mediation with mix Amarnath Gupta, Bertram Ludäscher, Chaita... International Conference on Management of Data 1999
304582 the ccube constraint object-oriented database ... Alexander Brodsky, Jia Chen, Paval A. Exarkhop... International Conference on Management of Data 1999
... ... ... ... ...
672977 dual-buffering strategies in object bases Alfons Kemper, Donald Kossmann Very Large Data Bases 1994
950482 guest editorial Philip A. Bernstein, Raghu Ramakrishnan, Yanni... The VLDB Journal — The International Jou... 2003
672980 graphdb: modeling and querying graphs in datab... Ralf Hartmut Güting Very Large Data Bases 1994
945741 review of the data warehouse toolkit: the comp... Alexander A. Anisimov ACM SIGMOD Record 2003
672979 bulk loading into an oodb: a performance study Janet L. Wiener, Jeffrey F. Naughton Very Large Data Bases 1994
[2294 rows x 4 columns]
In [3]: dfb = pd.read_csv('DBLP-ACM/DBLP2.csv', encoding='latin_1')
...: # here we do the same cleaning step
...: dfb = dfb.set_index('id', drop=True)
...: dfb.loc[:, 'title'] = dfb.title.str.strip().str.lower()
...: dfb.loc[dfb.authors.notna(), 'authors'] = dfb.loc[dfb.authors.notna(), 'authors']\
...: .map(lambda x: ', '.join(sorted(x.split(', '))))
...: dfb
...:
Out[3]:
title authors venue year
id
journals/sigmod/Mackay99 semantic integration of environmental models f... D. Scott Mackay SIGMOD Record 1999
conf/vldb/PoosalaI96 estimation of query-result distribution and it... Viswanath Poosala, Yannis E. Ioannidis VLDB 1996
conf/vldb/PalpanasSCP02 incremental maintenance for non-distributive a... Hamid Pirahesh, Richard Sidle, Roberta Cochran... VLDB 2002
conf/vldb/GardarinGT96 cost-based selection of path expression proces... Georges Gardarin, Jean-Robert Gruser, Zhao-Hui... VLDB 1996
conf/vldb/HoelS95 benchmarking spatial join operations with spat... Erik G. Hoel, Hanan Samet VLDB 1995
... ... ... ... ...
journals/tods/KarpSP03 a simple algorithm for finding frequent elemen... Christos H. Papadimitriou, Richard M. Karp, Sc... ACM Trans. Database Syst. 2003
conf/vldb/LimWV03 sash: a self-adaptive histogram set for dynami... Jeffrey Scott Vitter, Lipyeow Lim, Min Wang VLDB 2003
journals/tods/ChakrabartiKMP02 locally adaptive dimensionality reduction for ... Eamonn J. Keogh, Kaushik Chakrabarti, Michael ... ACM Trans. Database Syst. 2002
journals/sigmod/Snodgrass01 chair's message Richard T. Snodgrass SIGMOD Record 2001
conf/vldb/LiM01 indexing and querying xml data for regular pat... Bongki Moon, Quanzhong Li VLDB 2001
[2616 rows x 4 columns]
Try matching for the first time¶
Here’s a quick primer on how threshold-based classification work. For each pair of records, produce a similarity score (ranges from 0 to 1) between each corresponding field, then combine to produce a final similarity score (also ranges from 0 to 1). You can select different similarity functions for each field depending on their characteristics (see more similarity functions here). Finally which pairs count as matches depends on an arbitrary threshold (for similarity score) that you specify. While this classification method is not the gold standard in any way, it is simple and does not require any training data, which makes it a great fit for many problems. To learn in-depth details, see 2.
You can now start matching data using
ThresholdMatcher
. Notice how
simple it all is, you just need to specify the datasets to match and which
similarity function to use for each field:
In [4]: matcher = ThresholdMatcher(NoopIndex(), {
...: 'title': StringSimilarity(),
...: 'authors': StringSimilarity(),
...: }, dfa, dfb)
...:
And let’s wait… Actually, if you have been waiting for like 5 minutes you can stop it now. We’re comparing 6 million pairs of records so it would help tremendously if only there are some ways to increase performance.
Introducing the index¶
The index (not to be confused with Pandas Index) is a data structure that helps to reduce the number of pairs to be compared. It does this by deriving an indexing key from each record and only attempt to match records that have the same key. Without this technique, matching two datasets with n and m records, respectively, would take n x m detailed comparisons, which is probably infeasible for most non-trivial use cases. To learn more about indexing, see 3. Another technique to reduce the number of pairs but works the opposite way of indexing is filtering.
We have been using NoopIndex
which
is the same as using no index whatsoever. We can do better. Notice how the
year column in both datasets denote the year in which the article was
published. It is very unlikely then that two articles within different years
could be the same. Let’s employ this year column with
ColumnsIndex
:
In [5]: matcher = ThresholdMatcher(ColumnsIndex('year'), {
...: 'title': StringSimilarity(),
...: 'authors': StringSimilarity(),
...: }, dfa, dfb)
...:
Now, this should run for under 1 or 2 minutes. This is not the best performance that we can wring out of this dataset but very good for how little effort it requires.
Select a threshold¶
The ThresholdMatcher
class does
not require a threshold up-front because usually, it is useful to be able to
experiment with different thresholds after the matching is done. Let’s see what
the pairs look like:
In [6]: matcher.get_sample_pairs()
Out[6]:
title authors venue year
score_range pair_idx sim_score row_key
1.00-0.95 0 0.950948 377665 temporal statement modifiers Christian S. Jensen, Michael H. Böhlen, R... ACM Transactions on Database Systems (TODS) 2000
journals/tods/BohlenJS00 temporal statement modifiers Christian S. Jensen, Michael H. Böhlen, Richar... ACM Trans. Database Syst. 2000
1 0.950881 375745 data management: lasting impact on wild, wild,... Reed M. Meseck International Conference on Management of Data 2001
conf/sigmod/Meseck01 data management: lasting impact of the wild wi... Reed M. Meseck SIGMOD Conference 2001
2 0.950630 671031 plan-per-tuple optimization solution - paralle... Felipe Cariño, William O'Connell Very Large Data Bases 1998
conf/vldb/CarinoO98 plan-per-tuple optimization solution - paralle... Felipe Cariño, William O'Connell VLDB 1998
3 0.950350 233341 estimating alphanumeric selectivity in the pre... Bala Iyer, Jeffrey Scott Vitter, P. Krishnan International Conference on Management of Data 1996
conf/sigmod/KrishnanVI96 estimating alphanumeric selectivity in the pre... Balakrishna R. Iyer, Jeffrey Scott Vitter, P. ... SIGMOD Conference 1996
4 0.950198 637418 the ρ operator: discovering and ranking a... Amit Sheth, Kemafor Anyanwu ACM SIGMOD Record 2002
journals/sigmod/AnyanwuS02 the p operator: discovering and ranking associ... Amit P. Sheth, Kemafor Anyanwu SIGMOD Record 2002
0.95-0.90 0 0.901101 191901 from structured documents to novel query facil... M. Scholl, S. Abiteboul, S. Cluet, V. Christop... International Conference on Management of Data 1994
conf/sigmod/ChristophidesACS94 from structured documents to novel query facil... Michel Scholl, Serge Abiteboul, Sophie Cluet, ... SIGMOD Conference 1994
1 0.900908 764214 functional-join processing A. Kemper, D. Kossmann, J. Claussen, R. Braumandl The VLDB Journal — The International Jou... 2000
journals/vldb/BraumandlCKK00 functional-join processing Alfons Kemper, Donald Kossmann, Jens Claußen, ... VLDB J. 2000
2 0.900754 333608 engineering federated information systems: rep... F. Saltor, G. Saake, M. Roantree, R.-D. Kutsch... ACM SIGMOD Record 1999
journals/sigmod/ConradHHKRSS99 engineering federated information systems: rep... Fèlix Saltor, Gunter Saake, Mark Roantree, Ral... SIGMOD Record 1999
3 0.900632 336570 concept based design of data warehouses: the d... C. Quix, D. Calvanese, E. Franconi, M. Jarke, ... International Conference on Management of Data 2000
conf/sigmod/JarkeQCLFLVV00 concept based design of data warehouses: the d... Christoph Quix, Diego Calvanese, Enrico Franco... SIGMOD Conference 2000
4 0.900209 357776 a cost model for query processing in high dime... Christian Böhm ACM Transactions on Database Systems (TODS) 2000
journals/tods/Bohm00 a cost model for query processing in high dime... Christian Böhm ACM Trans. Database Syst. 2000
0.90-0.85 0 0.854710 673136 the clustra telecom database: high availabilit... Øystein Torbjørnsen, Per Holager, Sv... Very Large Data Bases 1995
conf/vldb/HvasshovdTBH95 the clustra telecom database: high availabilit... Per Holager, Svein Erik Bratsberg, Svein-Olaf ... VLDB 1995
1 0.854665 202670 acm multimedia '94 conference workshop on mult... Bhavani Thuraisingham, Bruce Berra, Kingsley N... ACM SIGMOD Record 1995
journals/sigmod/BerraNT95 acm multimedia '94 conference workshop on mult... Bhavani M. Thuraisingham, Kingsley C. Nwosu, P... SIGMOD Record 1995
2 0.854294 765557 synchronization and recovery in a client-serve... A. Biliris, E. Panagos The VLDB Journal — The International Jou... 1997
journals/vldb/PanagosB97 synchronization and recovery in a client-serve... Alexandros Biliris, Euthimios Panagos VLDB J. 1997
3 0.853059 362091 workshop on performance and architecture of we... Krishna Kant, Prasant Mohapatra ACM SIGMOD Record 2000
journals/sigmod/KantM00 workshop on performance and architecture of we... Krishna Kant, Prasant Mohapatra SIGMOD Record 2000
4 0.852737 304574 evolvable view environment (eve): non-equivale... A. J. Lee, A. Koeller, A. Nica, A. Van Wyk, E.... International Conference on Management of Data 1999
conf/sigmod/RundensteinerKZWLLN99 evolvable view environment (eve): non-equivale... Amber van Wyk, Amy J. Lee, Andreas Koeller, An... SIGMOD Conference 1999
0.85-0.80 0 0.805457 604275 transactional information systems: theory, alg... Marc H. Scholl ACM SIGMOD Record 2001
journals/sigmod/Scholl01 transactional information systems - book review Marc H. Scholl SIGMOD Record 2001
1 0.805067 223867 carnot and infosleuth: database technology and... B. Bohrer, C. Tomlinson, C. Unnikrishnan, D. W... International Conference on Management of Data 1995
conf/sigmod/WoelkBJOTU95 carnot and infosleuth: database technology and... C. Unnikrishnan, Christine Tomlinson, Darrell ... SIGMOD Conference 1995
2 0.802181 641001 reminiscences on influential papers Kenneth A. Ross ACM SIGMOD Record 2003
journals/sigmod/RossGR03 reminiscences on influential papers Johannes Gehrke, Jun Rao, Kenneth A. Ross SIGMOD Record 2003
3 0.801827 507340 data management issues in electronic commerce:... Asuman Dogac ACM SIGMOD Record 2002
journals/sigmod/Dogac02 guest editor's introduction Asuman Dogac SIGMOD Record 2002
4 0.801388 309897 semantic integration of semistructured and str... M. Vincini, S. Bergamaschi, S. Castano ACM SIGMOD Record 1999
journals/sigmod/BergamaschiCV99 semantic integration of semistructured and str... Maurizio Vincini, Silvana Castano, Sonia Berga... SIGMOD Record 1999
0.80-0.75 0 0.756679 640999 jim gray speaks out Marianne Winslett ACM SIGMOD Record 2003
journals/sigmod/Winslett03 interview with jim gray Marianne Winslett SIGMOD Record 2003
1 0.756175 304212 mind your vocabulary: query mapping across het... Chen-Chuan K. Chang, Héctor García-M... International Conference on Management of Data 1999
conf/sigmod/ChangG99 mind your vocabulary: query mapping across het... Hector Garcia-Molina, Kevin Chen-Chuan Chang SIGMOD Conference 1999
2 0.754047 565126 distinguished database profiles Marianne Winslett ACM SIGMOD Record 2002
journals/sigmod/Winslett02a david dewitt speaks out Marianne Winslett SIGMOD Record 2002
3 0.753769 253294 infosleuth: agent-based semantic integration o... A. Cichocki, A. Helal, A. Unruh, C. Unnikrishn... International Conference on Management of Data 1997
conf/sigmod/BohrerBBCFHKKMNRRSUUW97 infosleuth: semantic integration of informatio... Abdelsalam Helal, Amy Unruh, Andrzej Cichocki,... SIGMOD Conference 1997
4 0.750850 277629 ensuring consistency in multidatabases by pres... Abraham Silberschatz, Henry F. Korth, Rajeev R... ACM Transactions on Database Systems (TODS) 1998
journals/tods/MehrotraRKS98 ensuring consistency in multidatabases by pres... Sharad Mehrotra ACM Trans. Database Syst. 1998
0.75-0.70 0 0.707107 671838 high-performance and scalability through appli... NaN Very Large Data Bases 2000
conf/vldb/Team00 high-performance and scalability through appli... ? VLDB 2000
1 0.707107 671674 ordering information, conference organizers, p... NaN Very Large Data Bases 2000
conf/vldb/X00 ordering information, conference organizers, p... ? VLDB 2000
2 0.707049 375796 will database researchers have any role in dat... Arnon Rosenthal, Bill Maimone, Jim Donahue, Kl... International Conference on Management of Data 2001
conf/sigmod/Rosenthal01 will database researchers have any role in dat... Arnon Rosenthal SIGMOD Conference 2001
3 0.704427 190649 unix rdbms: the next generation what are the u... Bill Rosenblatt ACM SIGMOD Record 1994
journals/sigmod/Rosneblatt94 unix rdbms: the next generation Bill Rosneblatt SIGMOD Record 1994
4 0.703912 637426 small worlds: the dynamics of networks between... Duncan J. Watts, Jie Wu ACM SIGMOD Record 2002
journals/sigmod/Wu02 small worlds: the dynamics of networks between... Jie Wu SIGMOD Record 2002
This returns a multi-index frame that shows five pairs under each threshold range, ordered by descending similarity score. The purpose is to give you an overview of what the matching records are like under different thresholds. The returned frame has four index levels:
score_range: the range of score. By default, each range has a width of 0.05. You can tweak this value with the
step
argument.pair_idx: the index of each pair within the range. By default, it shows a maximum of 5 pairs within each range. You can tweak this value with the
sample_counts
argument.sim_score: the similarity score of this pair.
row_key: the row index from the input datasets. Usually, the desired output of the matching process is a list of matching pairs, each represented by a tuple of indices from the input datasets. You can get this list with
get_index_pairs_within_thresholds
as will be demonstrated below.
The columns of this frame are the same as the input datasets regardless of whether they were used to compute the similarity score.
To experiment with thresholds, there are more tools at our disposal:
get_all_pairs
: Returns matching pairs as a multi-index frame. It has the following levels:pair_idx: the pair number.
sim_score: the similarity score.
row_key: the row index from the input dataset.
In [7]: matcher.get_all_pairs(0.577).head(30)
Out[7]:
title authors venue year
pair_idx sim_score row_key
0 1.0 767131 a template model for multidimensional inter-tr... Hongjun Lu, Jeffrey Xu Yu, Jiawei Han, Ling Feng The VLDB Journal — The International Jou... 2002
journals/vldb/FengYLH02 a template model for multidimensional inter-tr... Hongjun Lu, Jeffrey Xu Yu, Jiawei Han, Ling Feng VLDB J. 2002
1 1.0 767130 efficient similarity search for market basket ... Alexandros Nanopoulos, Yannis Manolopoulos The VLDB Journal — The International Jou... 2002
journals/vldb/NanopoulosM02 efficient similarity search for market basket ... Alexandros Nanopoulos, Yannis Manolopoulos VLDB J. 2002
2 1.0 767129 speeding up construction of pmr quadtree-based... Gisli R. Hjaltason, Hanan Samet The VLDB Journal — The International Jou... 2002
journals/vldb/HjaltasonS02 speeding up construction of pmr quadtree-based... Gísli R. Hjaltason, Hanan Samet VLDB J. 2002
3 1.0 767128 spatial indexing of high-dimensional data base... Haruhiko Kojima, Masatoshi Yoshikawa, Shunsuke... The VLDB Journal — The International Jou... 2002
journals/vldb/SakuraiYUK02 spatial indexing of high-dimensional data base... Haruhiko Kojima, Masatoshi Yoshikawa, Shunsuke... VLDB J. 2002
4 1.0 767098 query processing techniques for arrays Arunprasad P. Marathe, Kenneth Salem The VLDB Journal — The International Jou... 2002
journals/vldb/MaratheS02 query processing techniques for arrays Arunprasad P. Marathe, Kenneth Salem VLDB J. 2002
5 1.0 767096 locating and accessing data repositories with ... Anthony Tomasic, George A. Mihaila, Louiqa Ras... The VLDB Journal — The International Jou... 2002
journals/vldb/MihailaRT02 locating and accessing data repositories with ... Anthony Tomasic, George A. Mihaila, Louiqa Ras... VLDB J. 2002
6 1.0 767095 searching in metric spaces by spatial approxim... Gonzalo Navarro The VLDB Journal — The International Jou... 2002
journals/vldb/Navarro02 searching in metric spaces by spatial approxim... Gonzalo Navarro VLDB J. 2002
7 1.0 767094 efficient retrieval of similar shapes Alberto O. Mendelzon, Davood Rafiei The VLDB Journal — The International Jou... 2002
journals/vldb/RafieiM02 efficient retrieval of similar shapes Alberto O. Mendelzon, Davood Rafiei VLDB J. 2002
8 1.0 764200 guest editorial Alon Y. Halevy The VLDB Journal — The International Jou... 2002
journals/vldb/Halevy02 guest editorial Alon Y. Halevy VLDB J. 2002
9 1.0 641274 database indexing for large dna and protein se... Ela Hunt, Malcolm P. Atkinson, Robert W. Irving The VLDB Journal — The International Jou... 2002
journals/vldb/HuntAI02 database indexing for large dna and protein se... Ela Hunt, Malcolm P. Atkinson, Robert W. Irving VLDB J. 2002
10 1.0 641273 views in a large-scale xml repository Dan Vodislav, Pierangelo Veltri, Sophie Cluet,... The VLDB Journal — The International Jou... 2002
journals/vldb/AguileraCMVV02 views in a large-scale xml repository Dan Vodislav, Pierangelo Veltri, Sophie Cluet,... VLDB J. 2002
11 1.0 641272 a formal perspective on the view selection pro... Alon Y. Halevy, Dan Suciu, Rada Chirkova The VLDB Journal — The International Jou... 2002
journals/vldb/ChirkovaHS02 a formal perspective on the view selection pro... Alon Y. Halevy, Dan Suciu, Rada Chirkova VLDB J. 2002
12 1.0 641271 data page layouts for relational databases on ... Anastassia Ailamaki, David J. DeWitt, Mark D. ... The VLDB Journal — The International Jou... 2002
journals/vldb/AilamakiDH02 data page layouts for relational databases on ... Anastassia Ailamaki, David J. DeWitt, Mark D. ... VLDB J. 2002
13 1.0 637435 research in information managment at dublin ci... Alan F. Smeaton, Mark Roantree ACM SIGMOD Record 2002
journals/sigmod/RoantreeS02 research in information managment at dublin ci... Alan F. Smeaton, Mark Roantree SIGMOD Record 2002
14 1.0 637433 an early look at xquery Andrew Eisenberg, Jim Melton ACM SIGMOD Record 2002
journals/sigmod/MeltonE02 an early look at xquery Andrew Eisenberg, Jim Melton SIGMOD Record 2002
save_pairs_to_excel
: Save matching pairs to Excel for reviewing.
After a bit of experimentation, I selected 0.577 as my threshold. Let’s see the result:
In [8]: # this will return each pair as a tuple of index from both datasets
...: pairs = matcher.get_index_pairs_within_thresholds(0.577)
...:
In [9]: # we can construct a dataframe out of it with similar column names
...: # to this dataset's perfect mapping CSV.
...: res = pd.DataFrame(pairs, columns=['idACM', 'idDBLP'])\
...: .set_index(['idACM', 'idDBLP'], drop=False)
...:
In [10]: # load the perfect mapping
....: pm = pd.read_csv('DBLP-ACM/DBLP-ACM_perfectMapping.csv')\
....: .set_index(['idACM', 'idDBLP'], drop=False)
....:
In [11]: total = len(dfa) * len(dfb)
....: total
....:
Out[11]: 6001104
In [12]: sensitivity = len(pm[pm.index.isin(res.index)]) / len(pm)
....: sensitivity
....:
Out[12]: 0.9932553956834532
In [13]: specificity = 1 - len(res[~res.index.isin(pm.index)]) / (total - len(pm))
....: specificity
....:
Out[13]: 0.9999978329288134
The sensitivity and specificity are not perfect, but they’re still great considering how simple this matching script is.
- 1
DBLP-ACM dataset by the database group of Prof. Erhard Rahm under the CC BY 4.0
- 2
Peter Christen. “6.2 Threshold-Based Classification” In Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, 131-133. Springer, 2012.
- 3
Peter Christen. “4.1 Why Indexing?” In Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, 1. Springer, 2012.