Filters

A filter discards pairs from the matching process. It does the opposite of an index which dictates which pair can be compared. They are both employed to increase matching performance.

class datamatch.filters.BaseFilter

Base class of all filter classes.

Sub-class should implement the valid() method.

abstract valid(a, b)

Returns true if a pair of records is valid (can be matched).

Parameters
Returns

whether these 2 records can be matched

Return type

bool

class datamatch.filters.DissimilarFilter(col)

Eliminates pairs with the same value for a specific field.

Parameters

col (str) – the column to check

class datamatch.filters.NonOverlappingFilter(start, end)

Eliminates pairs with overlapping ranges.

This is usually used over time ranges, which ensures time exclusivity of a record.

Both start and end columns must be of the same type and must be comparable.

e.g. df[end] < df[start] should produce a boolean series.

Parameters
  • start (str) – the range start column

  • end (str) – the range end column