Similarities¶

When given a pair of values, a similarity class produces a similarity score that ranges between 0 and 1. A similarity score of 1 means the two values are completely identical while 0 means there are no similarities.

Note that these classes only compute similarity scores between scalar values or native Python objects such as datetime.datetime, not the entire row (which is handled by ThresholdMatcher).

class datamatch.similarities.StringSimilarity¶

Computes a similarity score between two strings using Levenshtein distance.

sim(a, b)¶

Returns a similarity score.

Parameters

a (str) – The left string.
b (str) – The right string.

Returns

The similarity score.

Return type

float

class datamatch.similarities.JaroWinklerSimilarity(prefix_weight=0.1)¶

Similar to StringSimilarity but gives extra weight to common prefixes.

This class is very good at matching people’s names because mistaking the first letter in a person’s name should be a rare event.

Parameters: prefix_weight (float) – The extra weight given to common prefixes, defaults to 0.1.

sim(a, b)¶

Returns a similarity score.

Parameters

a (str) – The left string.
b (str) – The right string.

Returns

The similarity score.

Return type

float

class datamatch.similarities.AbsoluteNumericalSimilarity(d_max)¶

Computes similarity score between two numbers, extrapolated from a maximum absolute difference.

Maximum absolute difference d_max (greater than 0) is the maximum tolerated difference between two numbers regardless of their actual values. If the difference between the two values are less than d_max then the similarity score between two values a and b is 1.0 - abs(a - b) / d_max. Otherwise, the score is 0.

Implementation follows strategy for numerical values given in the Data Matching book 1

Parameters: d_max (float) – The maximum absolute difference.

sim(a, b)¶

Returns a similarity score.

Parameters

a (float or int) – The left number.
b (float or int) – The right number.

Returns

The similarity score.

Return type

float

class datamatch.similarities.RelativeNumericalSimilarity(pc_max)¶

Computes similarity score between two numbers, extrapolated from a maximum percentage difference.

This class serves a similar purpose to AbsoluteNumericalSimilarity but is more dependent on the actual values being compared.

Percentage difference pc between two values a and b is defined as abs(a - b) / max(abs(a), abs(b)) * 100.

Maximum percentage difference pc_max (0 < pc_max < 100) is the maximum tolerated percentage difference between the two numbers. If the percentage difference pc is less than pc_max then the similarity score is calculated with 1.0 - pc / pc_max. Otherwise, the score is 0.

Implementation follows strategy for numerical values given in the Data Matching book 1

Parameters: pc_max (int) – The maximum percentage difference.

sim(a, b)¶

Returns a similarity score

Parameters

a (float or int) – The left number.
b (float or int) – The right number.

Returns

The similarity score.

Return type

float

class datamatch.similarities.DateSimilarity(d_max=30)¶

Computes similarity score between two dates, extrapolated from a maximum absolute difference in days.

Maximum absolute difference in days d_max is the maximum tolerated difference in days between two dates. Similar to AbsoluteNumericalSimilarity if both dates a and b are less than d_max days apart then the similarity score is 1 - (a - b) / d_max.

If however (a - b) >= d_max then we employs two alternative strategies to hedge against typos:

If the year values are the same but the month and day values are swapped, then the similarity score is 0.5.
The last resort is to write each date in YYYYMMDD format and computes the similarity score between

two strings.

Implementation follows strategy for date/time given in the Data Matching book 2

Parameters: d_max (int) – Dates that are less than this number of days apart will have similarity score as 1 - <difference in days> / d_max. For dates that are further apart, this class employs alternative methods to compute the similarity score to hedge against typos. This defaults to 30.

sim(a, b)¶

Returns a similarity score.

Parameters

a (datetime.datetime) – The left date.
b (datetime.datetime) – The right date.

Returns

The similarity score.

Return type

float

1(1,2): Peter Christen. “5.12 Numerical Comparison” In Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, 121-122. Springer, 2012.
2: Peter Christen. “5.13 Date, Age and Time Comparison” In Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, 122-123. Springer, 2012.