Welcome to Datamatch’s documentation!

Datamatch is a library that facilitates data matching (also known as entity resolution) and deduplication process. One of the core design goals of this library is to be as extensible as possible, therefore each sub-task is defined as a separate class, which makes it easy to swap components of the same type and even to write your component that fit your purpose.

For now, the only classification method supported is threshold-based classification (implemented with ThresholdMatcher). However, no matter what methods of classification are eventually added to this library, concepts such as Indices and Filters will still apply. Therefore this library is reasonably prepared to be extended to eventually support most data matching use cases.