FilteringΒΆ
Sometimes it is easier to express what pairs should be matched in terms of what conditions can be used to mark a pair as unmatchable instead of what conditions can be used to mark a pair as matchable, which is the approach of indexing. Filtering is a technique that also aims to improve matching performance, but using conditions to reject a pair instead. You can employ both filtering and indexing or just one of them.
This example demonstrates how filtering work:
In [1]: import pandas as pd
...: from datamatch import (
...: ThresholdMatcher, JaroWinklerSimilarity, DissimilarFilter, NonOverlappingFilter
...: )
...:
In [2]: df = pd.DataFrame([
...: ['1', 'john', 'slidell pd', 0, 10],
...: ['2', 'john', 'slidell pd', 10, 20],
...: ['3', 'john', 'slidell pd', 20, 30],
...: ['4', 'john', 'gretna pd', 11, 21],
...: ['5', 'john', 'gretna pd', 0, 7],
...: ['6', 'john', 'gretna pd', 10, 18],
...: ], columns=['uid', 'first', 'agency', 'start', 'end'])
...: df
...:
Out[2]:
uid first agency start end
0 1 john slidell pd 0 10
1 2 john slidell pd 10 20
2 3 john slidell pd 20 30
3 4 john gretna pd 11 21
4 5 john gretna pd 0 7
5 6 john gretna pd 10 18
In [3]: # we can use multiple filters as demonstrated here
...: matcher = ThresholdMatcher(NoopIndex(), {
...: 'first': JaroWinklerSimilarity()
...: }, df, filters=[
...: DissimilarFilter('agency'),
...: NonOverlappingFilter('start', 'end')
...: ])
...: matcher.get_all_pairs()
...:
Out[3]:
uid first agency start end
pair_idx sim_score row_key
0 1.0 2 3 john slidell pd 20 30
5 6 john gretna pd 10 18
1 1.0 2 3 john slidell pd 20 30
4 5 john gretna pd 0 7
2 1.0 1 2 john slidell pd 10 20
4 5 john gretna pd 0 7
3 1.0 0 1 john slidell pd 0 10
3 4 john gretna pd 11 21
See Filters to find out more.