An index divides the data up into one or more buckets. Only records in the same bucket are then matched against each other. When used correctly, indexing decreases the number of pairs to compare and speeds up the matching process significantly.
- class datamatch.indices.BaseIndex¶
Abstract base class for all index classes.
Sub-class should implement
- abstract _key_ind_map(df)¶
Returns a mapping between bucket keys and row indices.
Returns a set of keys that could be used to retrieve buckets.
- class datamatch.indices.NoopIndex¶
Returns all data as a single bucket.
Using this is like using no index at all. Useful for when you don’t care about performance (e.g. when there are not too many rows).
- class datamatch.indices.ColumnsIndex(cols, ignore_key_error=False, index_elements=False)¶
Split data into multiple buckets based on one or more columns.
bool) – When set to True, a column does not exist in the frame, don’t produce any bucket instead of raising a KeyError.
bool) – Set this to True when each value in the column to index is a list, and you want to index using the list elements.
- class datamatch.indices.MultiIndex(indices, combine_keys=False)¶
Creates bucket keys by combining bucket keys from two or more indices.
This has two modes of operation:
When combine_keys is False: the key sets of each index are concatenated together, this is like OR-ing the keys.
When combine_keys is True: the final key set is the cartesian product of all key sets, this is like AND-ing the keys.