Pairers¶
A pairer produces pairs of records for matching. Most users should never need to touch a pairer but this is exposed anyway for the sake of customization.
There are 2 pairers corresponding to 2 strategies:
MatchPairer
: Takes in 2 datasets and produces pairs of recordssuch that each pair contain 1 record from 1 dataset and 1 record fromthe other dataset. This pairer is utilized whenThresholdMatcher
is given 2 datasets. It is useful for matching records between2 datasets.DeduplicatePairer
: Takes in 1 dataset and produces pairs of recordseach having only records from the input dataset. This pairer is utilizedwhenThresholdMatcher
is given only 1 dataset. It isuseful for deduplication tasks.
- class datamatch.pairers.BasePairer¶
Abstract base class for all pairer classes
Sub-class must implement
frame_a()
,frame_b()
, andpairs()
.frame_a()
should produce the left set of records,frame_b()
should produce the right set of records, whereaspairs()
should produce pairs of records (one fromframe_a()
, one fromframe_b()
).- abstract property frame_a¶
Returns the left set of records
- Return type
- abstract property frame_b¶
Returns the right set of records
- Return type
- abstract pairs()¶
Returns an iterator over pairs of records that should be compared
Each pair is a tuple of left record and right record. Each record is a tuple of 2 elements: the row index and the row data which is a
pandas.Series
.- Return type
- class datamatch.pairers.MatchPairer(dfa, dfb, index)¶
Pair records from 2 frames using the provided index
- Parameters
dfa (
pandas.DataFrame
) – The left datasetdfb (
pandas.DataFrame
) – The right datasetindex (sub-class of
BaseIndex
) – The index to divide datasets into buckets
- class datamatch.pairers.DeduplicatePairer(df, index)¶
Pairs records from a single frame to deduplicate
As this class is only initialized with a single frame, both frame_a and frame_b returns this same frame.
- Parameters
df (
pandas.DataFrame
) – The dataset to deduplicateindex (sub-class of
BaseIndex
) – The index to divide datasets into buckets