Indices

An index divides the data up into one or more buckets. Only records in the same bucket are matched against each other. When used correctly this decreases the number of pairs to compare and speeds up the matching process significantly.

class datamatch.indices.BaseIndex

Abstract base class for all index classes.

Sub-class should implement _key_ind_map() method.

abstract _key_ind_map(df)

Returns a mapping between bucket keys and row indices.

Parameters

df (pandas.DataFrame) – the data to index

Returns

a mapping between bucket key and all row indices that belong to the bucket. Key could be anything hashable but the value must always be a list even if there is only one row.

Return type

dict

keys(df)

Returns a set of keys that could be used to retrieve buckets

Parameters

df (pandas.DataFrame) – the data to index

Returns

a set of bucket keys

Return type

set

bucket(df, key)

Retrieves a bucket given the original data and a bucket key

Parameters
Returns

rows in bucket

Return type

pandas.DataFrame

class datamatch.indices.NoopIndex

Returns all data as a single bucket.

Using this is like using no index at all. Useful for when you don’t care about performance (e.g. when there are not too many rows).

class datamatch.indices.ColumnsIndex(cols)

Split data into multiple buckets based on one or more columns.

Parameters

cols (str or list of str) – single column name or list of column names to index

class datamatch.indices.MultiIndex(indices, combine_keys=False)

Creates bucket keys by combining bucket keys from 2 or more indices.

This has 2 modes of operation:

  • When combine_keys is False: the key sets of each index are concatenated together, this is like OR-ing the keys.

  • When combine_keys is True: the final key set is the cartesian product of all key sets, this is like AND-ing the keys.

Parameters
  • indices (list of BaseIndex subclass) – list of indices to combine

  • combine_keys (bool) – whether the final key set should be the cartesian product of all key sets, defaults to False