Tutorial¶

First, download the DBLP-ACM dataset 1 to your working directory and unzip it.

unzip DBLP-ACM.zip -d DBLP-ACM

This dataset contains article names, authors, and years from two different sources, and the perfect matching between them. Our job is to match articles from those two sources such that we can produce a result as close to the provided perfect matching as possible.

Load and clean data¶

Preceding the matching step is the data cleaning and standardization step, which is of great importance. We’re keeping this step as simple as possible:

In [1]: import pandas as pd
   ...: from datamatch import (
   ...:     ThresholdMatcher, StringSimilarity, NoopIndex, ColumnsIndex
   ...: )
   ...: 

In [2]: dfa = pd.read_csv('DBLP-ACM/ACM.csv')
   ...: # set `id` as the index for this frame, this is important because the library
   ...: # relies on the frame's index to tell which row is which.
   ...: dfa = dfa.set_index('id', drop=True)
   ...: # make sure titles are all in lower case
   ...: dfa.loc[:, 'title'] = dfa.title.str.strip().str.lower()
   ...: # split author names by comma and sort them before joining them back with comma
   ...: dfa.loc[dfa.authors.notna(), 'authors'] = dfa.loc[dfa.authors.notna(), 'authors']\
   ...:     .map(lambda x: ', '.join(sorted(x.split(', '))))
   ...: dfa
   ...: 
Out[2]: 
                                                    title                                            authors                                              venue  year
id                                                                                                                                                                   
304586  the wasa2 object-oriented workflow management ...                    Gottfried Vossen, Mathias Weske     International Conference on Management of Data  1999
304587  a user-centered interface for querying distrib...                  Isabel F. Cruz, Kimberly M. James     International Conference on Management of Data  1999
304589  world wide database-integrating the web, corba...  Athman Bouguettaya, Boualem Benatallah, James ...     International Conference on Management of Data  1999
304590           xml-based information mediation with mix  Amarnath Gupta, Bertram Lud&#228;scher, Chaita...     International Conference on Management of Data  1999
304582  the ccube constraint object-oriented database ...  Alexander Brodsky, Jia Chen, Paval A. Exarkhop...     International Conference on Management of Data  1999
...                                                   ...                                                ...                                                ...   ...
672977          dual-buffering strategies in object bases                     Alfons Kemper, Donald Kossmann                              Very Large Data Bases  1994
950482                                    guest editorial  Philip A. Bernstein, Raghu Ramakrishnan, Yanni...  The VLDB Journal &mdash; The International Jou...  2003
672980  graphdb: modeling and querying graphs in datab...                           Ralf Hartmut G&#252;ting                              Very Large Data Bases  1994
945741  review of the data warehouse toolkit: the comp...                              Alexander A. Anisimov                                 ACM SIGMOD Record   2003
672979     bulk loading into an oodb: a performance study               Janet L. Wiener, Jeffrey F. Naughton                              Very Large Data Bases  1994

[2294 rows x 4 columns]

In [3]: dfb = pd.read_csv('DBLP-ACM/DBLP2.csv', encoding='latin_1')
   ...: # here we do the same cleaning step
   ...: dfb = dfb.set_index('id', drop=True)
   ...: dfb.loc[:, 'title'] = dfb.title.str.strip().str.lower()
   ...: dfb.loc[dfb.authors.notna(), 'authors'] = dfb.loc[dfb.authors.notna(), 'authors']\
   ...:     .map(lambda x: ', '.join(sorted(x.split(', '))))
   ...: dfb
   ...: 
Out[3]: 
                                                                            title                                            authors                      venue  year
id                                                                                                                                                                   
journals/sigmod/Mackay99        semantic integration of environmental models f...                                    D. Scott Mackay              SIGMOD Record  1999
conf/vldb/PoosalaI96            estimation of query-result distribution and it...             Viswanath Poosala, Yannis E. Ioannidis                       VLDB  1996
conf/vldb/PalpanasSCP02         incremental maintenance for non-distributive a...  Hamid Pirahesh, Richard Sidle, Roberta Cochran...                       VLDB  2002
conf/vldb/GardarinGT96          cost-based selection of path expression proces...  Georges Gardarin, Jean-Robert Gruser, Zhao-Hui...                       VLDB  1996
conf/vldb/HoelS95               benchmarking spatial join operations with spat...                          Erik G. Hoel, Hanan Samet                       VLDB  1995
...                                                                           ...                                                ...                        ...   ...
journals/tods/KarpSP03          a simple algorithm for finding frequent elemen...  Christos H. Papadimitriou, Richard M. Karp, Sc...  ACM Trans. Database Syst.  2003
conf/vldb/LimWV03               sash: a self-adaptive histogram set for dynami...        Jeffrey Scott Vitter, Lipyeow Lim, Min Wang                       VLDB  2003
journals/tods/ChakrabartiKMP02  locally adaptive dimensionality reduction for ...  Eamonn J. Keogh, Kaushik Chakrabarti, Michael ...  ACM Trans. Database Syst.  2002
journals/sigmod/Snodgrass01                                       chair's message                               Richard T. Snodgrass              SIGMOD Record  2001
conf/vldb/LiM01                 indexing and querying xml data for regular pat...                          Bongki Moon, Quanzhong Li                       VLDB  2001

[2616 rows x 4 columns]

Try matching for the first time¶

Here’s a quick primer on how threshold-based classification work. For each pair of records, produce a similarity score (ranges from 0 to 1) between each corresponding field, then combine to produce a final similarity score (also ranges from 0 to 1). You can select different similarity functions for each field depending on their characteristics (see more similarity functions here). Finally which pairs count as matches depends on an arbitrary threshold (for similarity score) that you specify. While this classification method is not the gold standard in any way, it is simple and does not require any training data, which makes it a great fit for many problems. To learn in-depth details, see 2.

You can now start matching data using ThresholdMatcher. Notice how simple it all is, you just need to specify the datasets to match and which similarity function to use for each field:

In [4]: matcher = ThresholdMatcher(NoopIndex(), {
   ...:     'title': StringSimilarity(),
   ...:     'authors': StringSimilarity(),
   ...: }, dfa, dfb)
   ...: 

And let’s wait… Actually, if you have been waiting for like 5 minutes you can stop it now. We’re comparing 6 million pairs of records so it would help tremendously if only there are some ways to increase performance.

Introducing the index¶

The index (not to be confused with Pandas Index) is a data structure that helps to reduce the number of pairs to be compared. It does this by deriving an indexing key from each record and only attempt to match records that have the same key. Without this technique, matching two datasets with n and m records, respectively, would take n x m detailed comparisons, which is probably infeasible for most non-trivial use cases. To learn more about indexing, see 3. Another technique to reduce the number of pairs but works the opposite way of indexing is filtering.

We have been using NoopIndex which is the same as using no index whatsoever. We can do better. Notice how the year column in both datasets denote the year in which the article was published. It is very unlikely then that two articles within different years could be the same. Let’s employ this year column with ColumnsIndex:

In [5]: matcher = ThresholdMatcher(ColumnsIndex('year'), {
   ...:     'title': StringSimilarity(),
   ...:     'authors': StringSimilarity(),
   ...: }, dfa, dfb)
   ...: 

Now, this should run for under 1 or 2 minutes. This is not the best performance that we can wring out of this dataset but very good for how little effort it requires.

Select a threshold¶

The ThresholdMatcher class does not require a threshold up-front because usually, it is useful to be able to experiment with different thresholds after the matching is done. Let’s see what the pairs look like:

In [6]: matcher.get_sample_pairs()
Out[6]: 
                                                                                                                title                                            authors                                              venue  year
score_range pair_idx sim_score row_key                                                                                                                                                                                           
1.00-0.95   0        0.950948  377665                                                    temporal statement modifiers  Christian S. Jensen, Michael H. B&#246;hlen, R...       ACM Transactions on Database Systems (TODS)   2000
                               journals/tods/BohlenJS00                                  temporal statement modifiers  Christian S. Jensen, Michael H. Böhlen, Richar...                          ACM Trans. Database Syst.  2000
            1        0.950881  375745                               data management: lasting impact on wild, wild,...                                     Reed M. Meseck     International Conference on Management of Data  2001
                               conf/sigmod/Meseck01                 data management: lasting impact of the wild wi...                                     Reed M. Meseck                                  SIGMOD Conference  2001
            2        0.950630  671031                               plan-per-tuple optimization solution - paralle...              Felipe Cari&#241;o, William O'Connell                              Very Large Data Bases  1998
                               conf/vldb/CarinoO98                  plan-per-tuple optimization solution - paralle...                   Felipe Cariño, William O'Connell                                               VLDB  1998
            3        0.950350  233341                               estimating alphanumeric selectivity in the pre...       Bala Iyer, Jeffrey Scott Vitter, P. Krishnan     International Conference on Management of Data  1996
                               conf/sigmod/KrishnanVI96             estimating alphanumeric selectivity in the pre...  Balakrishna R. Iyer, Jeffrey Scott Vitter, P. ...                                  SIGMOD Conference  1996
            4        0.950198  637418                               the &#961; operator: discovering and ranking a...                        Amit Sheth, Kemafor Anyanwu                                 ACM SIGMOD Record   2002
                               journals/sigmod/AnyanwuS02           the p operator: discovering and ranking associ...                     Amit P. Sheth, Kemafor Anyanwu                                      SIGMOD Record  2002
0.95-0.90   0        0.901101  191901                               from structured documents to novel query facil...  M. Scholl, S. Abiteboul, S. Cluet, V. Christop...     International Conference on Management of Data  1994
                               conf/sigmod/ChristophidesACS94       from structured documents to novel query facil...  Michel Scholl, Serge Abiteboul, Sophie Cluet, ...                                  SIGMOD Conference  1994
            1        0.900908  764214                                                      functional-join processing  A. Kemper, D. Kossmann, J. Claussen, R. Braumandl  The VLDB Journal &mdash; The International Jou...  2000
                               journals/vldb/BraumandlCKK00                                functional-join processing  Alfons Kemper, Donald Kossmann, Jens Claußen, ...                                            VLDB J.  2000
            2        0.900754  333608                               engineering federated information systems: rep...  F. Saltor, G. Saake, M. Roantree, R.-D. Kutsch...                                 ACM SIGMOD Record   1999
                               journals/sigmod/ConradHHKRSS99       engineering federated information systems: rep...  Fèlix Saltor, Gunter Saake, Mark Roantree, Ral...                                      SIGMOD Record  1999
            3        0.900632  336570                               concept based design of data warehouses: the d...  C. Quix, D. Calvanese, E. Franconi, M. Jarke, ...     International Conference on Management of Data  2000
                               conf/sigmod/JarkeQCLFLVV00           concept based design of data warehouses: the d...  Christoph Quix, Diego Calvanese, Enrico Franco...                                  SIGMOD Conference  2000
            4        0.900209  357776                               a cost model for query processing in high dime...                                Christian B&#246;hm       ACM Transactions on Database Systems (TODS)   2000
                               journals/tods/Bohm00                 a cost model for query processing in high dime...                                     Christian Böhm                          ACM Trans. Database Syst.  2000
0.90-0.85   0        0.854710  673136                               the clustra telecom database: high availabilit...  &#216;ystein Torbj&#248;rnsen, Per Holager, Sv...                              Very Large Data Bases  1995
                               conf/vldb/HvasshovdTBH95             the clustra telecom database: high availabilit...  Per Holager, Svein Erik Bratsberg, Svein-Olaf ...                                               VLDB  1995
            1        0.854665  202670                               acm multimedia '94 conference workshop on mult...  Bhavani Thuraisingham, Bruce Berra, Kingsley N...                                 ACM SIGMOD Record   1995
                               journals/sigmod/BerraNT95            acm multimedia '94 conference workshop on mult...  Bhavani M. Thuraisingham, Kingsley C. Nwosu, P...                                      SIGMOD Record  1995
            2        0.854294  765557                               synchronization and recovery in a client-serve...                             A. Biliris, E. Panagos  The VLDB Journal &mdash; The International Jou...  1997
                               journals/vldb/PanagosB97             synchronization and recovery in a client-serve...              Alexandros Biliris, Euthimios Panagos                                            VLDB J.  1997
            3        0.853059  362091                               workshop on performance and architecture of we...                    Krishna Kant, Prasant Mohapatra                                 ACM SIGMOD Record   2000
                               journals/sigmod/KantM00              workshop on performance and architecture of we...                    Krishna Kant, Prasant Mohapatra                                      SIGMOD Record  2000
            4        0.852737  304574                               evolvable view environment (eve): non-equivale...  A. J. Lee, A. Koeller, A. Nica, A. Van Wyk, E....     International Conference on Management of Data  1999
                               conf/sigmod/RundensteinerKZWLLN99    evolvable view environment (eve): non-equivale...  Amber van Wyk, Amy J. Lee, Andreas Koeller, An...                                  SIGMOD Conference  1999
0.85-0.80   0        0.805457  604275                               transactional information systems: theory, alg...                                     Marc H. Scholl                                 ACM SIGMOD Record   2001
                               journals/sigmod/Scholl01               transactional information systems - book review                                     Marc H. Scholl                                      SIGMOD Record  2001
            1        0.805067  223867                               carnot and infosleuth: database technology and...  B. Bohrer, C. Tomlinson, C. Unnikrishnan, D. W...     International Conference on Management of Data  1995
                               conf/sigmod/WoelkBJOTU95             carnot and infosleuth: database technology and...  C. Unnikrishnan, Christine Tomlinson, Darrell ...                                  SIGMOD Conference  1995
            2        0.802181  641001                                             reminiscences on influential papers                                    Kenneth A. Ross                                 ACM SIGMOD Record   2003
                               journals/sigmod/RossGR03                           reminiscences on influential papers          Johannes Gehrke, Jun Rao, Kenneth A. Ross                                      SIGMOD Record  2003
            3        0.801827  507340                               data management issues in electronic commerce:...                                       Asuman Dogac                                 ACM SIGMOD Record   2002
                               journals/sigmod/Dogac02                                    guest editor's introduction                                       Asuman Dogac                                      SIGMOD Record  2002
            4        0.801388  309897                               semantic integration of semistructured and str...             M. Vincini, S. Bergamaschi, S. Castano                                 ACM SIGMOD Record   1999
                               journals/sigmod/BergamaschiCV99      semantic integration of semistructured and str...  Maurizio Vincini, Silvana Castano, Sonia Berga...                                      SIGMOD Record  1999
0.80-0.75   0        0.756679  640999                                                             jim gray speaks out                                  Marianne Winslett                                 ACM SIGMOD Record   2003
                               journals/sigmod/Winslett03                                     interview with jim gray                                  Marianne Winslett                                      SIGMOD Record  2003
            1        0.756175  304212                               mind your vocabulary: query mapping across het...  Chen-Chuan K. Chang, H&#233;ctor Garc&#237;a-M...     International Conference on Management of Data  1999
                               conf/sigmod/ChangG99                 mind your vocabulary: query mapping across het...       Hector Garcia-Molina, Kevin Chen-Chuan Chang                                  SIGMOD Conference  1999
            2        0.754047  565126                                                 distinguished database profiles                                  Marianne Winslett                                 ACM SIGMOD Record   2002
                               journals/sigmod/Winslett02a                                    david dewitt speaks out                                  Marianne Winslett                                      SIGMOD Record  2002
            3        0.753769  253294                               infosleuth: agent-based semantic integration o...  A. Cichocki, A. Helal, A. Unruh, C. Unnikrishn...     International Conference on Management of Data  1997
                               conf/sigmod/BohrerBBCFHKKMNRRSUUW97  infosleuth: semantic integration of informatio...  Abdelsalam Helal, Amy Unruh, Andrzej Cichocki,...                                  SIGMOD Conference  1997
            4        0.750850  277629                               ensuring consistency in multidatabases by pres...  Abraham Silberschatz, Henry F. Korth, Rajeev R...       ACM Transactions on Database Systems (TODS)   1998
                               journals/tods/MehrotraRKS98          ensuring consistency in multidatabases by pres...                                    Sharad Mehrotra                          ACM Trans. Database Syst.  1998
0.75-0.70   0        0.707107  671838                               high-performance and scalability through appli...                                                NaN                              Very Large Data Bases  2000
                               conf/vldb/Team00                     high-performance and scalability through appli...                                                  ?                                               VLDB  2000
            1        0.707107  671674                               ordering information, conference organizers, p...                                                NaN                              Very Large Data Bases  2000
                               conf/vldb/X00                        ordering information, conference organizers, p...                                                  ?                                               VLDB  2000
            2        0.707049  375796                               will database researchers have any role in dat...  Arnon Rosenthal, Bill Maimone, Jim Donahue, Kl...     International Conference on Management of Data  2001
                               conf/sigmod/Rosenthal01              will database researchers have any role in dat...                                    Arnon Rosenthal                                  SIGMOD Conference  2001
            3        0.704427  190649                               unix rdbms: the next generation what are the u...                                    Bill Rosenblatt                                 ACM SIGMOD Record   1994
                               journals/sigmod/Rosneblatt94                           unix rdbms: the next generation                                    Bill Rosneblatt                                      SIGMOD Record  1994
            4        0.703912  637426                               small worlds: the dynamics of networks between...                            Duncan J. Watts, Jie Wu                                 ACM SIGMOD Record   2002
                               journals/sigmod/Wu02                 small worlds: the dynamics of networks between...                                             Jie Wu                                      SIGMOD Record  2002

This returns a multi-index frame that shows five pairs under each threshold range, ordered by descending similarity score. The purpose is to give you an overview of what the matching records are like under different thresholds. The returned frame has four index levels:

score_range: the range of score. By default, each range has a width of 0.05. You can tweak this value with the step argument.
pair_idx: the index of each pair within the range. By default, it shows a maximum of 5 pairs within each range. You can tweak this value with the sample_counts argument.
sim_score: the similarity score of this pair.
row_key: the row index from the input datasets. Usually, the desired output of the matching process is a list of matching pairs, each represented by a tuple of indices from the input datasets. You can get this list with get_index_pairs_within_thresholds as will be demonstrated below.

The columns of this frame are the same as the input datasets regardless of whether they were used to compute the similarity score.

To experiment with thresholds, there are more tools at our disposal:

get_all_pairs: Returns matching pairs as a multi-index frame. It has the following levels:
- pair_idx: the pair number.
- sim_score: the similarity score.
- row_key: the row index from the input dataset.

In [7]: matcher.get_all_pairs(0.577).head(30)
Out[7]: 
                                                                                             title                                            authors                                              venue  year
pair_idx sim_score row_key                                                                                                                                                                                    
0        1.0       767131                        a template model for multidimensional inter-tr...   Hongjun Lu, Jeffrey Xu Yu, Jiawei Han, Ling Feng  The VLDB Journal &mdash; The International Jou...  2002
                   journals/vldb/FengYLH02       a template model for multidimensional inter-tr...   Hongjun Lu, Jeffrey Xu Yu, Jiawei Han, Ling Feng                                            VLDB J.  2002
1        1.0       767130                        efficient similarity search for market basket ...         Alexandros Nanopoulos, Yannis Manolopoulos  The VLDB Journal &mdash; The International Jou...  2002
                   journals/vldb/NanopoulosM02   efficient similarity search for market basket ...         Alexandros Nanopoulos, Yannis Manolopoulos                                            VLDB J.  2002
2        1.0       767129                        speeding up construction of pmr quadtree-based...                    Gisli R. Hjaltason, Hanan Samet  The VLDB Journal &mdash; The International Jou...  2002
                   journals/vldb/HjaltasonS02    speeding up construction of pmr quadtree-based...                    Gísli R. Hjaltason, Hanan Samet                                            VLDB J.  2002
3        1.0       767128                        spatial indexing of high-dimensional data base...  Haruhiko Kojima, Masatoshi Yoshikawa, Shunsuke...  The VLDB Journal &mdash; The International Jou...  2002
                   journals/vldb/SakuraiYUK02    spatial indexing of high-dimensional data base...  Haruhiko Kojima, Masatoshi Yoshikawa, Shunsuke...                                            VLDB J.  2002
4        1.0       767098                                   query processing techniques for arrays               Arunprasad P. Marathe, Kenneth Salem  The VLDB Journal &mdash; The International Jou...  2002
                   journals/vldb/MaratheS02                 query processing techniques for arrays               Arunprasad P. Marathe, Kenneth Salem                                            VLDB J.  2002
5        1.0       767096                        locating and accessing data repositories with ...  Anthony Tomasic, George A. Mihaila, Louiqa Ras...  The VLDB Journal &mdash; The International Jou...  2002
                   journals/vldb/MihailaRT02     locating and accessing data repositories with ...  Anthony Tomasic, George A. Mihaila, Louiqa Ras...                                            VLDB J.  2002
6        1.0       767095                        searching in metric spaces by spatial approxim...                                    Gonzalo Navarro  The VLDB Journal &mdash; The International Jou...  2002
                   journals/vldb/Navarro02       searching in metric spaces by spatial approxim...                                    Gonzalo Navarro                                            VLDB J.  2002
7        1.0       767094                                    efficient retrieval of similar shapes                Alberto O. Mendelzon, Davood Rafiei  The VLDB Journal &mdash; The International Jou...  2002
                   journals/vldb/RafieiM02                   efficient retrieval of similar shapes                Alberto O. Mendelzon, Davood Rafiei                                            VLDB J.  2002
8        1.0       764200                                                          guest editorial                                     Alon Y. Halevy  The VLDB Journal &mdash; The International Jou...  2002
                   journals/vldb/Halevy02                                          guest editorial                                     Alon Y. Halevy                                            VLDB J.  2002
9        1.0       641274                        database indexing for large dna and protein se...    Ela Hunt, Malcolm P. Atkinson, Robert W. Irving  The VLDB Journal &mdash; The International Jou...  2002
                   journals/vldb/HuntAI02        database indexing for large dna and protein se...    Ela Hunt, Malcolm P. Atkinson, Robert W. Irving                                            VLDB J.  2002
10       1.0       641273                                    views in a large-scale xml repository  Dan Vodislav, Pierangelo Veltri, Sophie Cluet,...  The VLDB Journal &mdash; The International Jou...  2002
                   journals/vldb/AguileraCMVV02              views in a large-scale xml repository  Dan Vodislav, Pierangelo Veltri, Sophie Cluet,...                                            VLDB J.  2002
11       1.0       641272                        a formal perspective on the view selection pro...           Alon Y. Halevy, Dan Suciu, Rada Chirkova  The VLDB Journal &mdash; The International Jou...  2002
                   journals/vldb/ChirkovaHS02    a formal perspective on the view selection pro...           Alon Y. Halevy, Dan Suciu, Rada Chirkova                                            VLDB J.  2002
12       1.0       641271                        data page layouts for relational databases on ...  Anastassia Ailamaki, David J. DeWitt, Mark D. ...  The VLDB Journal &mdash; The International Jou...  2002
                   journals/vldb/AilamakiDH02    data page layouts for relational databases on ...  Anastassia Ailamaki, David J. DeWitt, Mark D. ...                                            VLDB J.  2002
13       1.0       637435                        research in information managment at dublin ci...                     Alan F. Smeaton, Mark Roantree                                 ACM SIGMOD Record   2002
                   journals/sigmod/RoantreeS02   research in information managment at dublin ci...                     Alan F. Smeaton, Mark Roantree                                      SIGMOD Record  2002
14       1.0       637433                                                  an early look at xquery                       Andrew Eisenberg, Jim Melton                                 ACM SIGMOD Record   2002
                   journals/sigmod/MeltonE02                               an early look at xquery                       Andrew Eisenberg, Jim Melton                                      SIGMOD Record  2002

save_pairs_to_excel: Save matching pairs to Excel for reviewing.

After a bit of experimentation, I selected 0.577 as my threshold. Let’s see the result:

In [8]: # this will return each pair as a tuple of index from both datasets
   ...: pairs = matcher.get_index_pairs_within_thresholds(0.577)
   ...: 

In [9]: # we can construct a dataframe out of it with similar column names
   ...: # to this dataset's perfect mapping CSV.
   ...: res = pd.DataFrame(pairs, columns=['idACM', 'idDBLP'])\
   ...:     .set_index(['idACM', 'idDBLP'], drop=False)
   ...: 

In [10]: # load the perfect mapping
   ....: pm = pd.read_csv('DBLP-ACM/DBLP-ACM_perfectMapping.csv')\
   ....:     .set_index(['idACM', 'idDBLP'], drop=False)
   ....: 

In [11]: total = len(dfa) * len(dfb)
   ....: total
   ....: 
Out[11]: 6001104

In [12]: sensitivity = len(pm[pm.index.isin(res.index)]) / len(pm)
   ....: sensitivity
   ....: 
Out[12]: 0.9932553956834532

In [13]: specificity = 1 - len(res[~res.index.isin(pm.index)]) / (total - len(pm))
   ....: specificity
   ....: 
Out[13]: 0.9999978329288134

The sensitivity and specificity are not perfect, but they’re still great considering how simple this matching script is.

1: DBLP-ACM dataset by the database group of Prof. Erhard Rahm under the CC BY 4.0
2: Peter Christen. “6.2 Threshold-Based Classification” In Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, 131-133. Springer, 2012.
3: Peter Christen. “4.1 Why Indexing?” In Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, 1. Springer, 2012.