Matchers¶
A matcher matches records from two datasets (deduplicate if it is only given one dataset). It is also the main entry for this library. Through its constructor, you can configure which index to use, which and how each column should be compared, etc…
- class datamatch.matchers.ThresholdMatcher(index, scorer, dfa, dfb=None, variator=None, filters=[], show_progress=False)¶
Matchs records by computing similarity score for each pair and discard those that fall below a threshold.
This matcher does not require any training data, therefore it is perfect for when there is not too much data or if training data is not available.
If it is given two datasets then it will try to match records between them. If given only one dataset then it attempts to detect duplicates instead.
- Parameters
index (sub-class of
datamatch.indices.BaseIndex
) – The index to divide the dataset into distinct buckets.scorer (Callable[[
pandas.Series
,pandas.Series
],float
] or sub-class ofdatamatch.scorers.BaseScorer
ordict
of similarity classes) – The scorer class to score each pair. If it is a dict then create adatamatch.scorers.SimSumScorer
with that dict and use it.dfa (
pandas.DataFrame
) – The left dataset to match. Its index must not contain duplicates.dfb (
pandas.DataFrame
, optional) – The right dataset to match. Its index must not contain duplicates and its column must match dfa’s. If this is not given then the matcher will attempt to deduplicate dfa instead.variator (sub-class of
datamatch.variators.Variator
, optional) – The variator to use.filters (
list
of sub-class ofdatamatch.filters.BaseFilter
, optional) – The list of Filters to use.show_progress (
bool
, optional) – Prints a tqdm progress bar to console during matching, defaults to False.
- get_index_clusters_within_thresholds(lower_bound=0.7, upper_bound=1)¶
Returns index clusters with similarity scores within the specified thresholds.
- get_clusters_within_threshold(lower_bound=0.7, upper_bound=1, include_exact_matches=True)¶
Returns all clusters between a lower bound and upper bound.
- Parameters
- Returns
A multi-indexed frame that contains all matched clusters.
- Return type
- get_index_pairs_within_thresholds(lower_bound=0.7, upper_bound=1)¶
Returns index pairs with similarity scores within specified thresholds.
- get_sample_pairs(sample_counts=5, lower_bound=0.7, upper_bound=1, step=0.05, include_exact_matches=True)¶
Returns samples of record pairs for each range of similarity scores.
- Parameters
sample_counts (
int
) – The number of samples in each range.lower_bound (
float
) – The lower threshold below which pairs won’t be included, defaults to 0.7.upper_bound (
float
) – The upper threshold above which pairs won’t be included, defaults to 1.step (
float
) – The width of each range.include_exact_matches (
bool
) – Includes pairs with score = 1.0.
- Returns
A multi-indexed frame that only contain samples for each range.
- Return type
- get_all_pairs(lower_bound=0.7, upper_bound=1, include_exact_matches=True)¶
Returns all matching pairs between a lower bound and an upper bound.
- Parameters
- Returns
A multi-indexed dataframe that only contain samples for each range.
- Return type
- save_pairs_to_excel(name, match_threshold, sample_counts=5, lower_bound=0.7, step=0.05, include_exact_matches=True)¶
Saves matching pairs to an Excel file.
This will create an Excel file with 3 sheets:
Sample pairs: sample pairs for each score range, similar to output of
get_sample_pairs()
.All pairs: all pairs that score higher than lower_bound ordered by the similarity score.
Decision: selected threshold and how many pairs are counted as matched.
- Parameters
name (
str
) – The excel file to save to.match_threshold (
float
) – The score above which a pair is considered a match.sample_counts (
int
) – The number of samples per score range in the “Sample pairs” sheet.lower_bound (
float
) – The lower threshold below which pairs won’t be included, defaults to 0.7.step (
float
) – The width of each range in the “Sample pairs” sheet.include_exact_matches (
bool
) – Includes pairs with score = 1.0.
- Return type
- save_clusters_to_excel(name, match_threshold, lower_bound=0.7, include_exact_matches=True)¶
Save matched clusters to an Excel file.
This will create an Excel file with two sheets:
All clusters: all clusters that score higher than lower bound ordered by score.
Decision: selected threshold and how many pairs are counted as matched.
- Parameters
lower_bound (
float
) – The lower threshold below which pairs won’t be included, defaults to 0.7.name (
str
) – The excel file to save to.match_threshold (
float
) – The score above which a pair is considered a match.lower_bound – The lower threshold below which pairs won’t be included, defaults to 0.7.
include_exact_matches (
bool
) – Includes clusters with score = 1.0.
- Return type