Using variations¶
Sometimes it is useful to derive multiple variations from a single record and try each variation during matching while retaining the highest similarity score. An example of how this might be useful is when a person’s first name and last name are swapped due to clerical mistakes. You might want to produce one additional variation for each record where the name columns are swapped before matching.
The variator classes do just that while saving you the extra hassle of adding and rearranging the records. This contrived example demonstrates how to use a variator:
In [1]: import pandas as pd
...: from datamatch import ThresholdMatcher, JaroWinklerSimilarity, Swap
...:
In [2]: df = pd.DataFrame([
...: ['blake', 'lauri'],
...: ['lauri', 'blake'],
...: ['robinson', 'alexis'],
...: ['robertson', 'alexis'],
...: ['haynes', 'terry'],
...: ['terry', 'hayes']
...: ], columns=['last', 'first'])
...: df
...:
Out[2]:
last first
0 blake lauri
1 lauri blake
2 robinson alexis
3 robertson alexis
4 haynes terry
5 terry hayes
In [3]: # here we uses Swap to produce a variation that has first and last swapped
...: matcher = ThresholdMatcher(NoopIndex(), {
...: 'last': JaroWinklerSimilarity(),
...: 'first': JaroWinklerSimilarity()
...: }, df, variator=Swap('first', 'last'))
...: matcher.get_all_pairs()
...:
Out[3]:
last first
pair_idx sim_score row_key
0 1.000000 0 blake lauri
1 lauri blake
1 0.980748 4 haynes terry
5 terry hayes
2 0.907998 2 robinson alexis
3 robertson alexis