Indices

An index divides the data up into one or more buckets. Only records in the same bucket are then matched against each other. When used correctly, indexing decreases the number of pairs to compare and speeds up the matching process significantly.

class datamatch.indices.BaseIndex

Abstract base class for all index classes.

Sub-class should implement _key_ind_map() method.

abstract _key_ind_map(df)

Returns a mapping between bucket keys and row indices.

Parameters

df (pandas.DataFrame) – the data to index.

Returns

A mapping between bucket key and all row indices that belong to the bucket. Key could be anything hashable but the value must always be a list even if there is only one row.

Return type

dict

keys(df)

Returns a set of keys that could be used to retrieve buckets.

Parameters

df (pandas.DataFrame) – the data to index.

Returns

A set of bucket keys.

Return type

set

bucket(df, key)

Retrieves a bucket given the original data and a bucket key.

Parameters
Returns

Rows in bucket.

Return type

pandas.DataFrame

class datamatch.indices.NoopIndex

Returns all data as a single bucket.

Using this is like using no index at all. Useful for when you don’t care about performance (e.g. when there are not too many rows).

class datamatch.indices.ColumnsIndex(cols, ignore_key_error=False, index_elements=False)

Split data into multiple buckets based on one or more columns.

Parameters
  • cols (str or list of str) – single column name or list of column names to index.

  • ignore_key_error (bool) – When set to True, a column does not exist in the frame, don’t produce any bucket instead of raising a KeyError.

  • index_elements (bool) – Set this to True when each value in the column to index is a list, and you want to index using the list elements.

class datamatch.indices.MultiIndex(indices, combine_keys=False)

Creates bucket keys by combining bucket keys from two or more indices.

This has two modes of operation:

  • When combine_keys is False: the key sets of each index are concatenated together, this is like OR-ing the keys.

  • When combine_keys is True: the final key set is the cartesian product of all key sets, this is like AND-ing the keys.

Parameters
  • indices (list of BaseIndex subclass) – list of indices to combine.

  • combine_keys (bool) – whether the final key set should be the cartesian product of all key sets, defaults to False.