Matchers

A matcher matches records from two datasets (deduplicate if it is only given one dataset). It is also the main entry for this library. Through its constructor, you can configure which index to use, which and how each column should be compared, etc…

class datamatch.matchers.ThresholdMatcher(index, scorer, dfa, dfb=None, variator=None, filters=[], show_progress=False)

Matchs records by computing similarity score for each pair and discard those that fall below a threshold.

This matcher does not require any training data, therefore it is perfect for when there is not too much data or if training data is not available.

If it is given two datasets then it will try to match records between them. If given only one dataset then it attempts to detect duplicates instead.

Parameters
get_index_clusters_within_thresholds(lower_bound=0.7, upper_bound=1)

Returns index clusters with similarity scores within the specified thresholds.

Parameters
  • lower_bound (float) – The lower threshold below which pairs won’t be included, defaults to 0.7.

  • upper_bound (float) – The upper threshold above which pairs won’t be included, defaults to 1.

Returns

A list of clusters, each cluster is a set of indices.

Return type

list of frozenset

get_clusters_within_threshold(lower_bound=0.7, upper_bound=1)

Returns all clusters between a lower bound and upper bound.

Parameters
  • lower_bound (float) – The lower threshold below which pairs won’t be included, defaults to 0.7.

  • upper_bound (float) – The upper threshold above which pairs won’t be included, defaults to 1.

Returns

A multi-indexed frame that contains all matched clusters.

Return type

pandas.DataFrame

get_index_pairs_within_thresholds(lower_bound=0.7, upper_bound=1)

Returns index pairs with similarity scores within specified thresholds.

Parameters
  • lower_bound (float) – The lower threshold below which pairs won’t be included, defaults to 0.7.

  • upper_bound (float) – The upper threshold above which pairs won’t be included, defaults to 1.

Returns

A list of pairs of indices of matching records.

Return type

list of tuple

get_sample_pairs(sample_counts=5, lower_bound=0.7, upper_bound=1, step=0.05)

Returns samples of record pairs for each range of similarity scores.

Parameters
  • sample_counts (int) – The number of samples in each range.

  • lower_bound (float) – The lower threshold below which pairs won’t be included, defaults to 0.7.

  • upper_bound (float) – The upper threshold above which pairs won’t be included, defaults to 1.

  • step (float) – The width of each range.

Returns

A multi-indexed frame that only contain samples for each range.

Return type

pandas.DataFrame

get_all_pairs(lower_bound=0.7, upper_bound=1)

Returns all matching pairs between a lower bound and an upper bound.

Parameters
  • lower_bound (float) – The lower threshold below which pairs won’t be included, defaults to 0.7.

  • upper_bound (float) – The upper threshold above which pairs won’t be included, defaults to 1.

Returns

A multi-indexed dataframe that only contain samples for each range.

Return type

pandas.DataFrame

save_pairs_to_excel(name, match_threshold, sample_counts=5, lower_bound=0.7, step=0.05)

Saves matching pairs to an Excel file.

This will create an Excel file with 3 sheets:

  • Sample pairs: sample pairs for each score range, similar to output of get_sample_pairs().

  • All pairs: all pairs that score higher than lower_bound ordered by the similarity score.

  • Decision: selected threshold and how many pairs are counted as matched.

Parameters
  • name (str) – The excel file to save to.

  • match_threshold (float) – The score above which a pair is considered a match.

  • sample_counts (int) – The number of samples per score range in the “Sample pairs” sheet.

  • lower_bound (float) – The lower threshold below which pairs won’t be included, defaults to 0.7.

  • step (float) – The width of each range in the “Sample pairs” sheet.

Return type

None

save_clusters_to_excel(name, match_threshold, lower_bound=0.7)

Save matched clusters to an Excel file.

This will create an Excel file with two sheets:

  • All clusters: all clusters that score higher than lower bound ordered by score.

  • Decision: selected threshold and how many pairs are counted as matched.

Parameters
  • lower_bound (float) – The lower threshold below which pairs won’t be included, defaults to 0.7.

  • name (str) – The excel file to save to.

  • match_threshold (float) – The score above which a pair is considered a match.

  • lower_bound – The lower threshold below which pairs won’t be included, defaults to 0.7.

Return type

None

print_decision(match_threshold)

Prints number and percentage of matched pairs for selected threshold.

Parameters

match_threshold (float) – The score above which a pair is considered a match.

Return type

None