Pairers¶

A pairer produces pairs of records for matching. Most users should never need to touch a pairer but this is exposed anyway for the sake of customization.

There are two pairers corresponding to two strategies:

MatchPairer: Takes in two datasets and produces pairs of records

such that each pair contain one record from one dataset and one record from

the other dataset. This pairer is utilized when ThresholdMatcher

is given two datasets. It is useful for matching records between

two datasets.
DeduplicatePairer: Takes in one dataset and produces pairs of records

each having only records from the input dataset. This pairer is utilized

when ThresholdMatcher is given only one dataset. It is

useful for deduplication tasks.

class datamatch.pairers.BasePairer¶

Abstract base class for all pairer classes.

Sub-class must implement frame_a(), frame_b(), and pairs().

frame_a() should produce the left set of records, frame_b() should produce the right set of records, whereas pairs() should produce pairs of records (one from frame_a(), one from frame_b()).

abstract property frame_a¶

Returns the left set of records.

Return type: pandas.DataFrame

abstract property frame_b¶

Returns the right set of records.

Return type: pandas.DataFrame

abstract pairs()¶

Returns an iterator over pairs of records that should be compared.

Each pair is a tuple of left record and right record. Each record is a tuple of two elements: the row index and the row data which is a pandas.Series.

Return type: Iterator

class datamatch.pairers.MatchPairer(dfa, dfb, index)¶

Pairs records from two frames using the provided index.

Parameters

dfa (pandas.DataFrame) – The left dataset.
dfb (pandas.DataFrame) – The right dataset.
index (sub-class of datamatch.indices.BaseIndex) – The index to divide datasets into buckets.

class datamatch.pairers.DeduplicatePairer(df, index)¶

Pairs records from a single frame for deduplication.

As this class is only initialized with a single frame, both frame_a and frame_b returns this same frame.

Parameters

df (pandas.DataFrame) – The dataset to deduplicate.
index (sub-class of datamatch.indices.BaseIndex) – The index to divide datasets into buckets.