# Similarities¶

When given a pair of values, a similarity class produces a similarity score that ranges between 0 and 1. A similarity score of 1 means the two values are completely identical while 0 means there are no similarities.

Note that these classes only compute similarity scores between scalar values or native Python objects such as `datetime.datetime`, not the entire row (which is handled by `ThresholdMatcher`).

class datamatch.similarities.StringSimilarity

Computes a similarity score between two strings using Levenshtein distance.

sim(a, b)

Returns a similarity score.

Parameters
Returns

The similarity score.

Return type

`float`

class datamatch.similarities.JaroWinklerSimilarity(prefix_weight=0.1)

Similar to `StringSimilarity` but gives extra weight to common prefixes.

This class is very good at matching people’s names because mistaking the first letter in a person’s name should be a rare event.

Parameters

prefix_weight (`float`) – The extra weight given to common prefixes, defaults to 0.1.

sim(a, b)

Returns a similarity score.

Parameters
Returns

The similarity score.

Return type

`float`

class datamatch.similarities.AbsoluteNumericalSimilarity(d_max)

Computes similarity score between two numbers, extrapolated from a maximum absolute difference.

Maximum absolute difference d_max (greater than 0) is the maximum tolerated difference between two numbers regardless of their actual values. If the difference between the two values are less than d_max then the similarity score between two values a and b is `1.0 - abs(a - b) / d_max`. Otherwise, the score is 0.

Implementation follows strategy for numerical values given in the Data Matching book 1

Parameters

d_max (`float`) – The maximum absolute difference.

sim(a, b)

Returns a similarity score.

Parameters
Returns

The similarity score.

Return type

`float`

class datamatch.similarities.RelativeNumericalSimilarity(pc_max)

Computes similarity score between two numbers, extrapolated from a maximum percentage difference.

This class serves a similar purpose to `AbsoluteNumericalSimilarity` but is more dependent on the actual values being compared.

Percentage difference pc between two values a and b is defined as `abs(a - b) / max(abs(a), abs(b)) * 100`.

Maximum percentage difference pc_max (0 < pc_max < 100) is the maximum tolerated percentage difference between the two numbers. If the percentage difference pc is less than pc_max then the similarity score is calculated with `1.0 - pc / pc_max`. Otherwise, the score is 0.

Implementation follows strategy for numerical values given in the Data Matching book 1

Parameters

pc_max (`int`) – The maximum percentage difference.

sim(a, b)

Returns a similarity score

Parameters
Returns

The similarity score.

Return type

`float`

class datamatch.similarities.DateSimilarity(d_max=30)

Computes similarity score between two dates, extrapolated from a maximum absolute difference in days.

Maximum absolute difference in days d_max is the maximum tolerated difference in days between two dates. Similar to `AbsoluteNumericalSimilarity` if both dates a and b are less than d_max days apart then the similarity score is `1 - (a - b) / d_max`.

If however `(a - b) >= d_max` then we employs two alternative strategies to hedge against typos:

• If the year values are the same but the month and day values are swapped, then the similarity score is 0.5.

• The last resort is to write each date in YYYYMMDD format and computes the similarity score between
two strings.

Implementation follows strategy for date/time given in the Data Matching book 2

Parameters

d_max (`int`) – Dates that are less than this number of days apart will have similarity score as `1 - <difference in days> / d_max`. For dates that are further apart, this class employs alternative methods to compute the similarity score to hedge against typos. This defaults to 30.

sim(a, b)

Returns a similarity score.

Parameters
Returns

The similarity score.

Return type

`float`

1(1,2)

Peter Christen. “5.12 Numerical Comparison” In Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, 121-122. Springer, 2012.

2

Peter Christen. “5.13 Date, Age and Time Comparison” In Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, 122-123. Springer, 2012.