Filtering¶

Sometimes it is easier to express what pairs should be matched in terms of what conditions can be used to mark a pair as unmatchable instead of what conditions can be used to mark a pair as matchable, which is the approach of indexing. Filtering is a technique that also aims to improve matching performance, but using conditions to reject a pair instead. You can employ both filtering and indexing or just one of them.

This example demonstrates how filtering work:

In [1]: import pandas as pd
   ...: from datamatch import (
   ...:     ThresholdMatcher, JaroWinklerSimilarity, DissimilarFilter, NonOverlappingFilter
   ...: )
   ...: 

In [2]: df = pd.DataFrame([
   ...:     ['1', 'john', 'slidell pd', 0, 10],
   ...:     ['2', 'john', 'slidell pd', 10, 20],
   ...:     ['3', 'john', 'slidell pd', 20, 30],
   ...:     ['4', 'john', 'gretna pd', 11, 21],
   ...:     ['5', 'john', 'gretna pd', 0, 7],
   ...:     ['6', 'john', 'gretna pd', 10, 18],
   ...: ], columns=['uid', 'first', 'agency', 'start', 'end'])
   ...: df
   ...: 
Out[2]: 
  uid first      agency  start  end
0   1  john  slidell pd      0   10
1   2  john  slidell pd     10   20
2   3  john  slidell pd     20   30
3   4  john   gretna pd     11   21
4   5  john   gretna pd      0    7
5   6  john   gretna pd     10   18

In [3]: # we can use multiple filters as demonstrated here
   ...: matcher = ThresholdMatcher(NoopIndex(), {
   ...:     'first': JaroWinklerSimilarity()
   ...: }, df, filters=[
   ...:     DissimilarFilter('agency'),
   ...:     NonOverlappingFilter('start', 'end')
   ...: ])
   ...: matcher.get_all_pairs()
   ...: 
Out[3]: 
                           uid first      agency  start  end
pair_idx sim_score row_key                                  
0        1.0       2         3  john  slidell pd     20   30
                   5         6  john   gretna pd     10   18
1        1.0       2         3  john  slidell pd     20   30
                   4         5  john   gretna pd      0    7
2        1.0       1         2  john  slidell pd     10   20
                   4         5  john   gretna pd      0    7
3        1.0       0         1  john  slidell pd      0   10
                   3         4  john   gretna pd     11   21

See Filters to find out more.