Data De-duplication and Fuzzy Matching

Data Processing /  23 September 16 / by Dale Brett    
Data De-duplication and Fuzzy Matching

Duplicates

When collecting names and addresses together for use in direct mail or other communication activities you may have a customer mailing list, a database of prospects, or you might purchase data from third party data suppliers; depending on your criteria it might be that once you combine all of your data you end up with mailing contacts appearing more than once.

The process of removing duplicate entries and creating a unique mailing file is known as de-duplication.

Pattern Matching

As human beings we can easily look at a list of names and addresses and spot duplicates; pattern recognition is built into our brains and spotting that two entries have the same name and address, even if those names and addresses are spelled differently or have slightly different information, is not that difficult.

Human intervention is great if you only have a small amount of data, we can look at 10, 20 or even 50 names and addresses and remove the offending contact information easily and quickly enough; but what about 10,000 names and addresses? This is where we resort to specialised computer programs to do our dirty work.

In an ideal world the personal information we allow companies to hold about us would be up to date, accurate, and would be spelled perfectly, unfortunately in the real world that's not what happens.

The people who enter your details into computers are human and sometimes they make mistakes, whether that's because they couldn't read your handwriting, misheard you on the phone, or they don't know how to spell very well; but whatever the reason, sometimes your details are captured into computer systems incorrectly.

The accuracy of de-duplication depends upon the accuracy of the data being processed and if the data is not 100% accurate then this makes de-duplication harder, but not impossible.

Data de-duplication software uses fuzzy matching to make sure that misspellings and other data inaccuracies don't get in the way of matching duplicates.

Software algorithms are applied to values in the data to produce match keys which are then used to identify duplicates instead of the original data values.

e.g. We could apply a squashing algorithm which simply removes any spaces or vowels and then makes all the characters uppercase; if we ran this on the address line: "123 Sample Road", we'd get "123SMPLRD".

If we ran the same algorithm on the address line "123 Simpl Rod", we'd still get "123SMPLRD".

Software comparing the original data values would not find a match because the values are different, but by comparing the keys instead a match would be found.

Different de-duplication software will use different algorithms on different data types with varying results of accuracy.

More often than not, more than one data value must match between data records in order to consider those records duplicates, e.g. just matching the first line of an address wouldn't be good enough because there could be a "123 Sample Road" in Leeds and a "123 Sample Road" in Wakefield; instead, multiple keys are combined in key sets to produce matching results.


Data De-duplication

At this point it sounds simple, you run your data through the software which will use fuzzy matching to find duplicate records and remove them so only a unique data file remains, job done; but what actually counts as a duplicate data record also depends on the level of contextual detail you have in your data.

Consider the following:

Title Forename Surname Add1 Add2 Add3 Town County Postcode
Mr John Smith 123 Sample Road Harrow Add3 Gronginton West Yorkshire GR1 2AB
Mr John Smith 123 Sample Road Harrow Add3 Gronginton West Yorkshire GR1 2AB

It's easy to see that those records are identical and one of them should be removed so that we only have a single entry for John Smith at that address, however what if we had more information to go on:


Title Forename Surname Add1 Add2 Add3 Town County Postcode DOB
Mr John Smith 123 Sample Road Harrow Add3 Gronginton West Yorkshire GR1 2AB 01/02/1965
Mr John Smith 123 Sample Road Harrow Add3 Gronginton West Yorkshire GR1 2AB 05/06/1992

If we take the date of birth for those records into account, we now see that this is probably a father and son living at the same address.

We could also have a situation where we have the following:

Forename Surname Add1 Add2 Add3 Town County Postcode
John Smith 123 Sample Road Harrow Add3 Gronginton West Yorkshire GR1 2AB
J Smith 123 Sample Road Harrow Add3 Gronginton West Yorkshire GR1 2AB

In this case we could decide that J Smith is likely to be John Smith and remove it from our data set, however it could also be Jack Smith, or Joan Smith, or Julie Smith.

Because data such as this has a certain level of ambiguity about it we have to have different levels of de-duplication depending on the situation.


Levels of De-duplication

We create the different levels of de-duplication by specifying which data or key fields must match between records. Data de-duplication can be wide or tight, that is fewer details can match with lower accuracy or more details must match with higher accuracy to identify duplicates.

When dealing with data that will be printed and mailed the widest level of de-duplication is known as household level; at this level we try to identify duplicate households which means we only want a single recipient at a single address. We can do this by only matching the various address parts of the data records.

The tightest level of de-duplication is known as person level and at this level we try to identify only duplicate individuals in the data, which means we can have more than one person living at the same address. Along with address details, person level matching will involve the recipient's title, their forename or initial, and any other specific information such as date of birth, or ideally a unique reference number.


Summary

De-duplication of a mailing list is not a straightforward process; even after creating custom algorithms, key sets, and taking into account contextual data you can never guarantee that a mailing file will be 100% unique, or that the software didn't drop somebody from the data file because their details were too close to another recipient's in the file.

Rate this Article:

Comments are provided by Disqus, you can find out more in our Privacy Policy.

Chat