Similarities¶
When given a pair of values, a similarity class produces a similarity score that ranges between 0 and 1. A similarity score of 1 means the two values are completely identical while 0 means there are no similarities.
Note that these classes only compute similarity scores between scalar values or native Python objects such as
datetime.datetime, not the entire row (which is handled by ThresholdMatcher).
- class datamatch.similarities.StringSimilarity¶
Computes a similarity score between two strings using Levenshtein distance.
- class datamatch.similarities.JaroWinklerSimilarity(prefix_weight=0.1)¶
Similar to
StringSimilaritybut gives extra weight to common prefixes.This class is very good at matching people’s names because mistaking the first letter in a person’s name should be a rare event.
- Parameters
prefix_weight (
float) – The extra weight given to common prefixes, defaults to 0.1.
- class datamatch.similarities.AbsoluteNumericalSimilarity(d_max)¶
Computes similarity score between two numbers, extrapolated from a maximum absolute difference.
Maximum absolute difference d_max (greater than 0) is the maximum tolerated difference between two numbers regardless of their actual values. If the difference between the two values are less than d_max then the similarity score between two values a and b is
1.0 - abs(a - b) / d_max. Otherwise, the score is 0.Implementation follows strategy for numerical values given in the Data Matching book 1
- Parameters
d_max (
float) – The maximum absolute difference.
- class datamatch.similarities.RelativeNumericalSimilarity(pc_max)¶
Computes similarity score between two numbers, extrapolated from a maximum percentage difference.
This class serves a similar purpose to
AbsoluteNumericalSimilaritybut is more dependent on the actual values being compared.Percentage difference pc between two values a and b is defined as
abs(a - b) / max(abs(a), abs(b)) * 100.Maximum percentage difference pc_max (0 < pc_max < 100) is the maximum tolerated percentage difference between the two numbers. If the percentage difference pc is less than pc_max then the similarity score is calculated with
1.0 - pc / pc_max. Otherwise, the score is 0.Implementation follows strategy for numerical values given in the Data Matching book 1
- Parameters
pc_max (
int) – The maximum percentage difference.
- class datamatch.similarities.DateSimilarity(d_max=30)¶
Computes similarity score between two dates, extrapolated from a maximum absolute difference in days.
Maximum absolute difference in days d_max is the maximum tolerated difference in days between two dates. Similar to
AbsoluteNumericalSimilarityif both dates a and b are less than d_max days apart then the similarity score is1 - (a - b) / d_max.If however
(a - b) >= d_maxthen we employs two alternative strategies to hedge against typos:If the year values are the same but the month and day values are swapped, then the similarity score is 0.5.
- The last resort is to write each date in YYYYMMDD format and computes the similarity score betweentwo strings.
Implementation follows strategy for date/time given in the Data Matching book 2
- Parameters
d_max (
int) – Dates that are less than this number of days apart will have similarity score as1 - <difference in days> / d_max. For dates that are further apart, this class employs alternative methods to compute the similarity score to hedge against typos. This defaults to 30.
- sim(a, b)¶
Returns a similarity score.
- Parameters
a (
datetime.datetime) – The left date.b (
datetime.datetime) – The right date.
- Returns
The similarity score.
- Return type
- 1(1,2)
Peter Christen. “5.12 Numerical Comparison” In Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, 121-122. Springer, 2012.
- 2
Peter Christen. “5.13 Date, Age and Time Comparison” In Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, 122-123. Springer, 2012.