Similarities¶
When given a pair of values, a similarity class produces a similarity score that ranges between 0 and 1. A similarity score of 1 means the two values are completely identical while 0 means there are no similarities.
Note that these classes only compute similarity scores between scalar values or native Python objects such as
datetime.datetime
, not the entire row (which is handled by ThresholdMatcher
).
- class datamatch.similarities.StringSimilarity¶
Computes a similarity score between two strings using Levenshtein distance.
- class datamatch.similarities.JaroWinklerSimilarity(prefix_weight=0.1)¶
Similar to
StringSimilarity
but gives extra weight to common prefixes.This class is very good at matching people’s names because mistaking the first letter in a person’s name should be a rare event.
- Parameters
prefix_weight (
float
) – The extra weight given to common prefixes, defaults to 0.1.
- class datamatch.similarities.AbsoluteNumericalSimilarity(d_max)¶
Computes similarity score between two numbers, extrapolated from a maximum absolute difference.
Maximum absolute difference d_max (greater than 0) is the maximum tolerated difference between two numbers regardless of their actual values. If the difference between the two values are less than d_max then the similarity score between two values a and b is
1.0 - abs(a - b) / d_max
. Otherwise, the score is 0.Implementation follows strategy for numerical values given in the Data Matching book 1
- Parameters
d_max (
float
) – The maximum absolute difference.
- class datamatch.similarities.RelativeNumericalSimilarity(pc_max)¶
Computes similarity score between two numbers, extrapolated from a maximum percentage difference.
This class serves a similar purpose to
AbsoluteNumericalSimilarity
but is more dependent on the actual values being compared.Percentage difference pc between two values a and b is defined as
abs(a - b) / max(abs(a), abs(b)) * 100
.Maximum percentage difference pc_max (0 < pc_max < 100) is the maximum tolerated percentage difference between the two numbers. If the percentage difference pc is less than pc_max then the similarity score is calculated with
1.0 - pc / pc_max
. Otherwise, the score is 0.Implementation follows strategy for numerical values given in the Data Matching book 1
- Parameters
pc_max (
int
) – The maximum percentage difference.
- class datamatch.similarities.DateSimilarity(d_max=30)¶
Computes similarity score between two dates, extrapolated from a maximum absolute difference in days.
Maximum absolute difference in days d_max is the maximum tolerated difference in days between two dates. Similar to
AbsoluteNumericalSimilarity
if both dates a and b are less than d_max days apart then the similarity score is1 - (a - b) / d_max
.If however
(a - b) >= d_max
then we employs two alternative strategies to hedge against typos:If the year values are the same but the month and day values are swapped, then the similarity score is 0.5.
- The last resort is to write each date in YYYYMMDD format and computes the similarity score betweentwo strings.
Implementation follows strategy for date/time given in the Data Matching book 2
- Parameters
d_max (
int
) – Dates that are less than this number of days apart will have similarity score as1 - <difference in days> / d_max
. For dates that are further apart, this class employs alternative methods to compute the similarity score to hedge against typos. This defaults to 30.
- sim(a, b)¶
Returns a similarity score.
- Parameters
a (
datetime.datetime
) – The left date.b (
datetime.datetime
) – The right date.
- Returns
The similarity score.
- Return type
- 1(1,2)
Peter Christen. “5.12 Numerical Comparison” In Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, 121-122. Springer, 2012.
- 2
Peter Christen. “5.13 Date, Age and Time Comparison” In Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, 122-123. Springer, 2012.