Pairers

A pairer produces pairs of records for matching. Most users should never need to touch a pairer but this is exposed anyway for the sake of customization.

There are two pairers corresponding to two strategies:

  • MatchPairer: Takes in two datasets and produces pairs of records
    such that each pair contain one record from one dataset and one record from
    the other dataset. This pairer is utilized when ThresholdMatcher
    is given two datasets. It is useful for matching records between
    two datasets.
  • DeduplicatePairer: Takes in one dataset and produces pairs of records
    each having only records from the input dataset. This pairer is utilized
    when ThresholdMatcher is given only one dataset. It is
    useful for deduplication tasks.
class datamatch.pairers.BasePairer

Abstract base class for all pairer classes.

Sub-class must implement frame_a(), frame_b(), and pairs().

frame_a() should produce the left set of records, frame_b() should produce the right set of records, whereas pairs() should produce pairs of records (one from frame_a(), one from frame_b()).

abstract property frame_a

Returns the left set of records.

Return type

pandas.DataFrame

abstract property frame_b

Returns the right set of records.

Return type

pandas.DataFrame

abstract pairs()

Returns an iterator over pairs of records that should be compared.

Each pair is a tuple of left record and right record. Each record is a tuple of two elements: the row index and the row data which is a pandas.Series.

Return type

Iterator

class datamatch.pairers.MatchPairer(dfa, dfb, index)

Pairs records from two frames using the provided index.

Parameters
class datamatch.pairers.DeduplicatePairer(df, index)

Pairs records from a single frame for deduplication.

As this class is only initialized with a single frame, both frame_a and frame_b returns this same frame.

Parameters