Pairers¶

A pairer produces pairs of records for matching. Most users should never need to touch a pairer but this is exposed anyway for the sake of customization.

There are 2 pairers corresponding to 2 strategies:

MatchPairer: Takes in 2 datasets and produces pairs of records

such that each pair contain 1 record from 1 dataset and 1 record from

the other dataset. This pairer is utilized when ThresholdMatcher

is given 2 datasets. It is useful for matching records between

2 datasets.
DeduplicatePairer: Takes in 1 dataset and produces pairs of records

each having only records from the input dataset. This pairer is utilized

when ThresholdMatcher is given only 1 dataset. It is

useful for deduplication tasks.

class datamatch.pairers.BasePairer¶

Abstract base class for all pairer classes

Sub-class must implement frame_a(), frame_b(), and pairs().

frame_a() should produce the left set of records, frame_b() should produce the right set of records, whereas pairs() should produce pairs of records (one from frame_a(), one from frame_b()).

abstract property frame_a¶

Returns the left set of records

Return type: pandas.DataFrame

abstract property frame_b¶

Returns the right set of records

Return type: pandas.DataFrame

abstract pairs()¶

Returns an iterator over pairs of records that should be compared

Each pair is a tuple of left record and right record. Each record is a tuple of 2 elements: the row index and the row data which is a pandas.Series.

Return type: Iterator

class datamatch.pairers.MatchPairer(dfa, dfb, index)¶

Pair records from 2 frames using the provided index

Parameters

dfa (pandas.DataFrame) – The left dataset
dfb (pandas.DataFrame) – The right dataset
index (sub-class of BaseIndex) – The index to divide datasets into buckets

class datamatch.pairers.DeduplicatePairer(df, index)¶

Pairs records from a single frame to deduplicate

As this class is only initialized with a single frame, both frame_a and frame_b returns this same frame.

Parameters

df (pandas.DataFrame) – The dataset to deduplicate
index (sub-class of BaseIndex) – The index to divide datasets into buckets