Pairers¶
A pairer produces pairs of records for matching. Most users should never need to touch a pairer but this is exposed anyway for the sake of customization.
There are two pairers corresponding to two strategies:
MatchPairer
: Takes in two datasets and produces pairs of recordssuch that each pair contain one record from one dataset and one record fromthe other dataset. This pairer is utilized whenThresholdMatcher
is given two datasets. It is useful for matching records betweentwo datasets.DeduplicatePairer
: Takes in one dataset and produces pairs of recordseach having only records from the input dataset. This pairer is utilizedwhenThresholdMatcher
is given only one dataset. It isuseful for deduplication tasks.
- class datamatch.pairers.BasePairer¶
Abstract base class for all pairer classes.
Sub-class must implement
frame_a()
,frame_b()
, andpairs()
.frame_a()
should produce the left set of records,frame_b()
should produce the right set of records, whereaspairs()
should produce pairs of records (one fromframe_a()
, one fromframe_b()
).- abstract property frame_a¶
Returns the left set of records.
- Return type
- abstract property frame_b¶
Returns the right set of records.
- Return type
- abstract pairs()¶
Returns an iterator over pairs of records that should be compared.
Each pair is a tuple of left record and right record. Each record is a tuple of two elements: the row index and the row data which is a
pandas.Series
.- Return type
- class datamatch.pairers.MatchPairer(dfa, dfb, index)¶
Pairs records from two frames using the provided index.
- Parameters
dfa (
pandas.DataFrame
) – The left dataset.dfb (
pandas.DataFrame
) – The right dataset.index (sub-class of
datamatch.indices.BaseIndex
) – The index to divide datasets into buckets.
- class datamatch.pairers.DeduplicatePairer(df, index)¶
Pairs records from a single frame for deduplication.
As this class is only initialized with a single frame, both frame_a and frame_b returns this same frame.
- Parameters
df (
pandas.DataFrame
) – The dataset to deduplicate.index (sub-class of
datamatch.indices.BaseIndex
) – The index to divide datasets into buckets.