At a glance
Question
In my dataset I need to identify duplicates across four separate fields. My problem is the data is incomplete and some records may not have entries in all four fields. Any suggestions?
Answer
Excel can help to identify duplicates and possible duplicates. The fact that the data is incomplete means you will have to manually check some records. This is a good example where Excel can’t supply a complete solution, but it can make the job a lot easier and quicker. We will use the data in Figure 1 to demonstrate and explain the technique.
For the purposes of this demonstration I will ignore problems associated with leading and trailing spaces. The TRIM function can remove those types of spaces.
For a duplicate to be identified we need a match in the four fields (columns): First Name, Last Name, Email and DOB (date of birth).
First and Last Name fields are assumed to be complete. Problems are caused by a lack of a DOB and a different or missing email address.
In recent years email addresses have become a default identifier. This works when people use a single email, but many people use different emails for different purposes.
We can see in Figure 1 that rows 2 and 7 are identical in all four fields. Rows 4 and 9 are an example of a missing DOB. Rows 5 and 11 are an example of a missing email address.
To assist in solving our problem we will add four columns to the data to identify different types of duplicates. We can then use those new columns to filter the list to show only the duplicates. The COUNTIFS function can identify duplicates across multiple columns.
Figure 2 shows the results of all the added columns.
Column E will be used to identify perfect matches where all four fields match.
The formula in E2 is: =AND(COUNTA(A2:D2)=4,COUNTIFS(A:A,A2,B:B,B2,D:D,D2,C:C,C2)>1)
This formula will display TRUE if there is a duplication based on all four fields. The AND function is commonly used with an IF function, but it can be used on its own to display TRUE and FALSE.
The AND function will only display TRUE if all the logical tests between its bracket are TRUE. There are two logical tests in this AND function.
The first is: COUNTA(A2:D2)=4
The COUNTA function counts how cells have entries in a range. (Note the COUNT function only counts cells containing numeric entries). The COUNTA result is compared to 4. This will return TRUE if there are four entries in the range. By confirming this we ignore any rows with blanks cells.
The second is: COUNTIFS(A:A,A2,B:B,B2,D:D,D2,C:C,C2)>1
The COUNTIFS function counts how many entries match multiple criteria within multiple ranges. All criteria must be met to return a value. In our case we are checking every cell against the whole column.
Syntax
COUNTIFS(Range_1,Criteria_1,Range_2,Criteria_2…)
Each Range and Criteria are matched together. Ranges tend to be one column wide or one row high. Criteria tends to be a cell reference containing the criteria to search for.
The COUNTIFS formula above counts how many times Fred appears in column A, when Smith is in column B and [email protected] is in column C and 6/9/1962 is in column D.
The result must be at least one. If it appears more than once it is duplicated, so we use >1 at the end of the COUNTIFS to return TRUE if the row is duplicated.
The full formula in column E will return TRUE when there are four entries in the range and those entries are duplicated.
Column F uses the same technique as column E, but only on columns A,B and C. It checks three of the four fields.
Again it checks there are entries in all the cells, three this time. The COUNTIFS is similar but only looks at the first three columns. Given that emails are unique this is probably a safe test of duplication when there are missing entries in the DOB field.
Column G uses the same technique to find duplicates based on columns A, B and D. It also checks three of the four fields. This works if there are missing email addresses by matching the name and DOB.
The formula in cell G2 is: =AND(COUNTA(A2:B2,D2)=3,COUNTIFS(A:A,A2,B:B,B2,D:D,D2)>1)
Note: you can have multiple ranges separated by commas within the COUNTA function brackets as shown above.
Finally we will add a formula to column H that will make it easy to filter those columns that have matches on three out of the four fields.
The formula for cell H2 is: =COUNTIF(F2:G2,TRUE)=1
This will return TRUE if there is a single TRUE in columns F and G. If columns F and G both have TRUE it means it will be a perfect match and column E already provides that result.
All the above formulas have been copied down the columns – see Figure 2
Filtering in Excel
You can add filters to a table by selecting a cell in the table and pressing Ctrl + SHIFT+ L.
Filtering by TRUE in Column E will display the perfect matches as shown in Figure 3 (below).
Column H will allow us to filter so that we display rows that have a TRUE in column F or a TRUE in Column G, but not TRUE in both, see Figure 4.
Without column H you can’t use the filter arrows in row 1 to achieve the filter result shown in Figure 4.
Removing duplicates
Excel has a built-in feature in the Data ribbon to remove duplicates, see Figure 5 (below). In our case it would only have removed row 7 as it was a duplicate of row 2. It keeps the first occurrence and deletes the second and further occurrences.
The danger with this feature is if there are blank cells in the data. For example, if two rows in our example had the same first and last names but had blank emails and blank DOB fields, they would be treated as duplicates and the last one removed. Since first and last name combinations are not unique, this is not ideal.
Neale Blackwood CPA runs A4 Accounting, providing Excel training, webinars and consulting services to various organisations. Questions can be sent to [email protected].