CiviCRM offers a function to check for duplicate contacts either manually, or on user submission. Duplicate matching is very useful for preventing duplicate records from being created when you allow anonymous users to register for events, or make donations. Without it, ever time an anonymous user did one of those two things, you would get a new record for them in the database.
Depending on your business case, you may have a different ideas of what constitutes a duplicate contact, and therefor may have the need to set different match criteria and thresholds. In this post I will describe a few scenarios where you may want to get really creative about your match criteria and threshold.
It is worth nothing that duplicate matching works better now than a few weeks ago thanks to Jim Taylor from Rooty Hollow who squashed a long standing and undetected bug in CiviCRM’s duplicate matching that could have result in false positives, a bug we discovered through our work together for a mutual client.
You can find support documentation for the Find and Merge Duplicate feature on the CiviCRM Wiki, what follows is supplemental based on my experience.
Strict Matching and Individuals
In CiviCRM if you go to Administer > Manage > Find And Merge Duplicates you will see a series of duplicate matching rules that are identified as either “Strict” or “Fuzzy” in the level column. Rules that are set as strict are employed by CiviCRM when an untrusted (anonymous) user is filling out a form. If a submission is identified as a match, information in an existing contact will be changed. It is important to get Strict rules set properly.
A rule consists of a set of criteria, each with a “score” and then a threshold. If the the threshold is met or crossed when adding up the criteria scores a match is made and a record updated. Typical criteria you will want to match on are first name, last name, email address, postal code, city, state, street address, phone number. However you may not want to give all of these the same weight, and you may want to be able to achieve a match even if all are not included.
By creatively and carefully setting different weights you can achieve a positive match based on partial criteria. For example, lets say you do not have data in each of those fields for all of your contacts, and you don’t expect that someone is going to provide data in all of those fields. To you, a contact is a match if it has the same first name, last name, postal code AND either email or phone number. Setting the score for the first and last name to 10 each, postal code to 5, email and phone number each to 2 and the threshold to 27 would mean that a match must have the same exact first and last name (20 points), postal code (5 points) and either phone number OR email (2 points each) to reach the 27 point threshold.
When you start to think about how data gets collected and entered, you might realize that first name is not the best criteria: people have nicknames, or shortened names or maybe sometimes just use their first initial. So if I am in your database as “Greg Heller” because your data entry intern didn’t want to spell out my full name, and then I come to your site to register for an event using my full name, which is my preference, “Gregory Heller” a match won’t get made.
There is a solution to this. CiviCRM allows you to trim a field, so we could just compare the first initial, however that might not work in situations where you have a “Robert” who goes by “Bob” some times. in this case, you may assign a lower score to the first name (or first initial) and add in another criteria with a higher score, perhaps email, since in most cases it is a fairly unique identifier.
There are situations where you cannot trust email to be a unique identifier: sometimes husbands and wives share an email, or a parent might sign their child up for an event using the parent’s email address. This is another case where using complex criteria may be the solution.
Strict Matching and Organizations
If you are using CiviCRM to allow people to register an organization for something (and event, a coalition, or a mailing list or directory), you will want to be very careful about your strict matching criteria. There may be a business case for requiring a human internal to your organization to actually review the submission, which will involve fuzzy matching discussed below in some more detail. But for the purpose of this explanation, we’ll assume you are letting an untrusted user enter information for an organization and you want to ensure that you do not get duplicates (as much as possible).
Organizations have full names, legal names, acronyms in place of names, multiple locations, sometimes multiple URLS, and phone numbers. All of this poses a challenge when trying to stop the duplicates at the door (so to speak). Setting up criteria such that there are multiple combination that can get you to the threshold is the key here, but you have to be careful that you don’t get false positives. Sometimes organizations share offices and perhaps phones, or addresses in that case. So neither can be the base for a score that will get you close to a match alone.
If the name is exactly the same, we can assume (unlike and individual) that we have a match — perhaps, or maybe it is an organization with different chapters or offices that you do want to keep separate records on. Name is a good place to start. Lets say 20 points. Since we may want to be sure that this is the same location of the organization, lets give postal code 6 points, and the first 6 characters of street address 5 points (thus getting street number and maybe the first few characters of street name, but hopefully avoiding tricky suffixes like East (E) or Avenue (Ave)). That adds up to 31 points. If we are comfortable with that being a match, then lets figure out a few other ways to get there that are not likely to result in a false positive. Phone number and URL might be good, at 6 points and 5 points respectively. Now we have a few ways to get to 31 that require some specific criteria that identifies the organization, and its location, but omission of some of the information, or providing not matching information (like a different phone number, or URL) will not trigger a duplicate.
Fuzzy Rules for Individuals and Organizations
Fuzzy rules are employed by CiviCRM when a trusted user is using a new contact creation form. The idea is to provide a lower bar for flagging a potential duplicate. The user is then given the choice of whether or not to create the new record, or load the existing one. If you have a really small database, fuzzy rules that match on partial first name and full last name might be sufficient. If you have thousands or more contact records, you may want to add in an additional criteria, like city. Email address is a good idea here, but not as a criteria on its own. I have plenty of email addresses, and often (accidentally) provide two or more to one organization.
Manual Finding and Matching of Duplicates
Fuzzy and Strict rules can both be used when you want to manually cleanse your database of potential duplicates. Often times you want a Fuzzy rule that is stricter than the one used by trusted users on forms, but not as strict and the strict. You are given the option of merging or not merging the potential duplicates the system finds.
I hope that this article is helpful, and welcome any comments, corrections, or other examples of duplicate matching criteria you have used in CiviCRM.