Matchers¶

A matcher matches records from two datasets (deduplicate if it is only given one dataset). It is also the main entry for this library. Through its constructor, you can configure which index to use, which and how each column should be compared, etc…

class datamatch.matchers.ThresholdMatcher(index, scorer, dfa, dfb=None, variator=None, filters=[], show_progress=False)¶

Matchs records by computing similarity score for each pair and discard those that fall below a threshold.

This matcher does not require any training data, therefore it is perfect for when there is not too much data or if training data is not available.

If it is given two datasets then it will try to match records between them. If given only one dataset then it attempts to detect duplicates instead.

Parameters

index (sub-class of datamatch.indices.BaseIndex) – The index to divide the dataset into distinct buckets.
scorer (Callable[[pandas.Series, pandas.Series], float] or sub-class of datamatch.scorers.BaseScorer or dict of similarity classes) – The scorer class to score each pair. If it is a dict then create a datamatch.scorers.SimSumScorer with that dict and use it.
dfa (pandas.DataFrame) – The left dataset to match. Its index must not contain duplicates.
dfb (pandas.DataFrame, optional) – The right dataset to match. Its index must not contain duplicates and its column must match dfa’s. If this is not given then the matcher will attempt to deduplicate dfa instead.
variator (sub-class of datamatch.variators.Variator, optional) – The variator to use.
filters (list of sub-class of datamatch.filters.BaseFilter, optional) – The list of Filters to use.
show_progress (bool, optional) – Prints a tqdm progress bar to console during matching, defaults to False.

get_index_clusters_within_thresholds(lower_bound=0.7, upper_bound=1)¶

Returns index clusters with similarity scores within the specified thresholds.

Parameters

lower_bound (float) – The lower threshold below which pairs won’t be included, defaults to 0.7.
upper_bound (float) – The upper threshold above which pairs won’t be included, defaults to 1.

Returns

A list of clusters, each cluster is a set of indices.

Return type

list of frozenset

get_clusters_within_threshold(lower_bound=0.7, upper_bound=1)¶

Returns all clusters between a lower bound and upper bound.

Parameters

lower_bound (float) – The lower threshold below which pairs won’t be included, defaults to 0.7.
upper_bound (float) – The upper threshold above which pairs won’t be included, defaults to 1.

Returns

A multi-indexed frame that contains all matched clusters.

Return type

pandas.DataFrame

get_index_pairs_within_thresholds(lower_bound=0.7, upper_bound=1)¶

Returns index pairs with similarity scores within specified thresholds.

Parameters

lower_bound (float) – The lower threshold below which pairs won’t be included, defaults to 0.7.
upper_bound (float) – The upper threshold above which pairs won’t be included, defaults to 1.

Returns

A list of pairs of indices of matching records.

Return type

list of tuple

get_sample_pairs(sample_counts=5, lower_bound=0.7, upper_bound=1, step=0.05)¶

Returns samples of record pairs for each range of similarity scores.

Parameters

sample_counts (int) – The number of samples in each range.
lower_bound (float) – The lower threshold below which pairs won’t be included, defaults to 0.7.
upper_bound (float) – The upper threshold above which pairs won’t be included, defaults to 1.
step (float) – The width of each range.

Returns

A multi-indexed frame that only contain samples for each range.

Return type

pandas.DataFrame

get_all_pairs(lower_bound=0.7, upper_bound=1)¶

Returns all matching pairs between a lower bound and an upper bound.

Parameters

lower_bound (float) – The lower threshold below which pairs won’t be included, defaults to 0.7.
upper_bound (float) – The upper threshold above which pairs won’t be included, defaults to 1.

Returns

A multi-indexed dataframe that only contain samples for each range.

Return type

pandas.DataFrame

save_pairs_to_excel(name, match_threshold, sample_counts=5, lower_bound=0.7, step=0.05)¶

Saves matching pairs to an Excel file.

This will create an Excel file with 3 sheets:

Sample pairs: sample pairs for each score range, similar to output of get_sample_pairs().
All pairs: all pairs that score higher than lower_bound ordered by the similarity score.
Decision: selected threshold and how many pairs are counted as matched.

Parameters

name (str) – The excel file to save to.
match_threshold (float) – The score above which a pair is considered a match.
sample_counts (int) – The number of samples per score range in the “Sample pairs” sheet.
lower_bound (float) – The lower threshold below which pairs won’t be included, defaults to 0.7.
step (float) – The width of each range in the “Sample pairs” sheet.

Return type

None

save_clusters_to_excel(name, match_threshold, lower_bound=0.7)¶

Save matched clusters to an Excel file.

This will create an Excel file with two sheets:

All clusters: all clusters that score higher than lower bound ordered by score.
Decision: selected threshold and how many pairs are counted as matched.

Parameters

lower_bound (float) – The lower threshold below which pairs won’t be included, defaults to 0.7.
name (str) – The excel file to save to.
match_threshold (float) – The score above which a pair is considered a match.
lower_bound – The lower threshold below which pairs won’t be included, defaults to 0.7.

Return type

None

print_decision(match_threshold)¶

Prints number and percentage of matched pairs for selected threshold.

Parameters: match_threshold (float) – The score above which a pair is considered a match.
Return type: None