Filters

A filter discards pairs from the matching process. An index, which dictates which pair can be compared, does the opposite. They are both employed to increase matching performance.

class datamatch.filters.BaseFilter

Base class of all filter classes.

Sub-class should implement the valid() method.

abstract valid(a, b)

Returns true if a pair of records is valid (can be matched).

Parameters
Returns

Whether these two records can be matched.

Return type

bool

class datamatch.filters.DissimilarFilter(col, ignore_key_error=False)

Eliminates pairs with the same value for a specific field.

Parameters
  • col (str) – The column to check.

  • ignore_key_error – When set to True, if the column is not found, acts like a no-op filter instead of raising KeyError.

  • ignore_key_errorbool

class datamatch.filters.NonOverlappingFilter(start, end)

Eliminates pairs with overlapping ranges.

This is usually used over time ranges, which ensures time exclusivity of a record.

Both start and end columns must be of the same type and must be comparable.

e.g. df[end] < df[start] should produce a boolean series.

Parameters
  • start (str) – the range start column.

  • end (str) – the range end column.