Using variations¶

Sometimes it is useful to derive multiple variations from a single record and try each variation during matching while retaining the highest similarity score. An example of how this might be useful is when a person’s first name and last name are swapped due to clerical mistakes. You might want to produce one additional variation for each record where the name columns are swapped before matching.

The variator classes do just that while saving you the extra hassle of adding and rearranging the records. This contrived example demonstrates how to use a variator:

In [1]: import pandas as pd
   ...: from datamatch import ThresholdMatcher, JaroWinklerSimilarity, Swap
   ...: 

In [2]: df = pd.DataFrame([
   ...:     ['blake', 'lauri'],
   ...:     ['lauri', 'blake'],
   ...:     ['robinson', 'alexis'],
   ...:     ['robertson', 'alexis'],
   ...:     ['haynes', 'terry'],
   ...:     ['terry', 'hayes']
   ...: ], columns=['last', 'first'])
   ...: df
   ...: 
Out[2]: 
        last   first
0      blake   lauri
1      lauri   blake
2   robinson  alexis
3  robertson  alexis
4     haynes   terry
5      terry   hayes

In [3]: # here we uses Swap to produce a variation that has first and last swapped
   ...: matcher = ThresholdMatcher(NoopIndex(), {
   ...:     'last': JaroWinklerSimilarity(),
   ...:     'first': JaroWinklerSimilarity()
   ...: }, df, variator=Swap('first', 'last'))
   ...: matcher.get_all_pairs()
   ...: 
Out[3]: 
                                 last   first
pair_idx sim_score row_key                   
0        1.000000  0            blake   lauri
                   1            lauri   blake
1        0.980748  4           haynes   terry
                   5            terry   hayes
2        0.907998  2         robinson  alexis
                   3        robertson  alexis