Similarities

A similarity class when given a pair of values, produces a similarity score that ranges between 0 and 1. A similarity score of 1 means the 2 values are completely identical while 0 means there are no similarities.

Note that these classes only compute similarity scores between scalar values or native Python objects such as datetime.datetime, not the entire row (which is handled by ThresholdMatcher).

class datamatch.similarities.StringSimilarity

Computes similarity score between 2 strings using Levenshtein distance

sim(a, b)

Returns a similarity score

Parameters
  • a (str) – The left string

  • b (str) – The right string

Returns

The similarity score

Return type

float

class datamatch.similarities.JaroWinklerSimilarity(prefix_weight=0.1)

Similar to StringSimilarity but gives extra weight to common prefixes.

This class is very good at matching people’s names because mistaking the first letter in a person’s name should be a rare event.

Parameters

prefix_weight (float) – The extra weight given to common prefixes, defaults to 0.1

sim(a, b)

Returns a similarity score

Parameters
  • a (str) – The left string

  • b (str) – The right string

Returns

The similarity score

Return type

float

class datamatch.similarities.AbsoluteNumericalSimilarity(d_max)

Computes similarity score between two numbers, extrapolated from a maximum absolute difference

Maximum absolute difference d_max (greater than 0) is the maximum tolerated difference between two numbers regardless of their actual values. If the difference between the two values are less than d_max then the similarity score between two values a and b is 1.0 - abs(a - b) / d_max. Otherwise, the score is 0.

Implementation follows strategy for numerical values given in the Data Matching book 1

Parameters

d_max (float) – The maximum absolute difference

sim(a, b)

Returns a similarity score

Parameters
Returns

The similarity score

Return type

float

class datamatch.similarities.RelativeNumericalSimilarity(pc_max)

Computes similarity score between two numbers, extrapolated from a maximum percentage difference

This class serves a similar purpose to AbsoluteNumericalSimilarity but is more dependent on the actual values being compared.

Percentage difference pc between two values a and b is defined as abs(a - b) / max(abs(a), abs(b)) * 100.

Maximum percentage difference pc_max (0 < pc_max < 100) is the maximum tolerated percentage difference between the two numbers. If the percentage difference pc is less than pc_max then the similarity score is calculated with 1.0 - pc / pc_max. Otherwise, the score is 0.

Implementation follows strategy for numerical values given in the Data Matching book 1

Parameters

pc_max (int) – The maximum percentage difference

sim(a, b)

Returns a similarity score

Parameters
Returns

The similarity score

Return type

float

class datamatch.similarities.DateSimilarity(d_max=30)

Computes similarity score between two dates, extrapolated from a maximum absolute difference in days

Maximum absolute difference in days d_max is the maximum tolerated difference in days between two dates. Similar to AbsoluteNumericalSimilarity if both dates a and b are less than d_max days apart then the similarity score is 1 - (a - b) / d_max.

If however (a - b) >= d_max then we employs 2 alternative strategies to hedge against typos:

  • If the year values are the same but the month and day values are swapped, then the similarity score is 0.5.

  • The last resort is to write each date in YYYYMMDD format and computes the similarity score from
    the Levenshtein distance between the resulting strings.

Implementation follows strategy for date/time given in the Data Matching book 2

Parameters

d_max (int) – Dates that are less than this number of days apart will have similarity score as 1 - <difference in days> / d_max. For dates that are further apart, this class employs alternative methods to compute the similarity score to hedge against typos. This defaults to 30.

sim(a, b)

Returns a similarity score

Parameters
Returns

The similarity score

Return type

float

1(1,2)

Peter Christen. “5.12 Numerical Comparison” In Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, 121-122. Springer, 2012.

2

Peter Christen. “5.13 Date, Age and Time Comparison” In Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, 122-123. Springer, 2012.