Pairers

A pairer produces pairs of records for matching. Most users should never need to touch a pairer but this is exposed anyway for the sake of customization.

There are 2 pairers corresponding to 2 strategies:

  • MatchPairer: Takes in 2 datasets and produces pairs of records
    such that each pair contain 1 record from 1 dataset and 1 record from
    the other dataset. This pairer is utilized when ThresholdMatcher
    is given 2 datasets. It is useful for matching records between
    2 datasets.
  • DeduplicatePairer: Takes in 1 dataset and produces pairs of records
    each having only records from the input dataset. This pairer is utilized
    when ThresholdMatcher is given only 1 dataset. It is
    useful for deduplication tasks.
class datamatch.pairers.BasePairer

Abstract base class for all pairer classes

Sub-class must implement frame_a(), frame_b(), and pairs().

frame_a() should produce the left set of records, frame_b() should produce the right set of records, whereas pairs() should produce pairs of records (one from frame_a(), one from frame_b()).

abstract property frame_a

Returns the left set of records

Return type

pandas.DataFrame

abstract property frame_b

Returns the right set of records

Return type

pandas.DataFrame

abstract pairs()

Returns an iterator over pairs of records that should be compared

Each pair is a tuple of left record and right record. Each record is a tuple of 2 elements: the row index and the row data which is a pandas.Series.

Return type

Iterator

class datamatch.pairers.MatchPairer(dfa, dfb, index)

Pair records from 2 frames using the provided index

Parameters
  • dfa (pandas.DataFrame) – The left dataset

  • dfb (pandas.DataFrame) – The right dataset

  • index (sub-class of BaseIndex) – The index to divide datasets into buckets

class datamatch.pairers.DeduplicatePairer(df, index)

Pairs records from a single frame to deduplicate

As this class is only initialized with a single frame, both frame_a and frame_b returns this same frame.

Parameters
  • df (pandas.DataFrame) – The dataset to deduplicate

  • index (sub-class of BaseIndex) – The index to divide datasets into buckets