login
deduping in Access 2000 posted: Fri 2012-05-11 16:35:14 tags: tech
Let's say you have an Access 2000 table, one field of which is email addresses, and due to the nature of the data collection, you have some email addresses entered multiple times. And you want to send a message to all your email contacts, but obviously you don't want to send to the same address more than once. So you need to do what's known in the industry as deduplication, or "deduping".

If your data set is small enough, you can do it manually. First you'd query the table:

SELECT tblContacts.email, Count(tblContacts.email) AS [dupe-count]
FROM tblContacts
GROUP BY tblContacts.email
HAVING ((Count(tblContacts.email))>1)
   [AND ((unsubs.email) Is Not Null)];

Armed with this info, you'd then sort your original table by email address, deleting unwanted duplicates on visual identification.

But what if your data set is prohibitively large? Microsoft's website suggests a procedure where you copy the table structure, change the primary key in the copy to accept only unique values in the email field, and append data from the original table to the copy. The database engine will automatically reject duplicate entries... and then their procedure tells you to drop the original table. Woah.

Couple of problems here: first, what if your table isn't just a list of bare email addresses, but a contact list where you haven't necessarily entered an email address in every record? You'd have to start juggling partial data sets to preserve the records with no email address, because the table copy with the primary key redefinition won't accept records with an empty/null email address field.

Second, what if your contact table participates in referential integrity relationships? You'll have to re-key records in each table related to your contact table, or risk losing them. Then you'll have to copy all the referential integrity relationships from the source table to the copy.

So we want a more flexible and generalized solution: one that preserves referential integrity rules because it doesn't require table deletion, and can be adjusted to accomodate records with no email address entered.

Assuming the simplest case, where we don't have to worry about re-keying records in related tables, we can accomplish it with 4 queries. The SQL below assumes that excluding records with Null entries isn't necessary; extending the HAVING clause to exclude Nulls, in accordance with the example SQL above, is left as an exercise to the reader.

The first query SELECTs email addresses where the Count of occurrences is greater than 1. It also grabs the Max(id) related to each email address. Let's call it "01_dupeflag".

SELECT tblContacts.email, Count(tblContacts.email) AS [dupe-count]
FROM tblContacts
GROUP BY tblContacts.email
HAVING ((Count(tblContacts.email))>1);

The second query SELECTs all records containing addresses in our "flagged addresses" query. Let's call it "02_dupes".

SELECT tblContacts.id, tblContacts.email
FROM tblContacts INNER JOIN 01_dupeflag 
   ON tblContacts.email = 01_dupeflag.email
ORDER BY tblContacts.id;

The third query SELECTs all records with "flagged" addresses where the id doesn't match Max(id) from the first query - in essence, inverting the selection within the set of "flagged" records. Let's call it "03_slated" (as in "slated for deletion").

SELECT 02_dupes.id, 01_dupeflag.Last_ID
FROM 02_dupes LEFT JOIN 01_dupeflag 
   ON 02_dupes.id = unsubs_01_dupeflag.Last_ID
WHERE ((unsubs_01_dupeflag.Last_ID) Is Null);

...And our 4th query performs the actual record deletion, call it "04_del_dupes:

DELETE tblContacts.*
FROM tblContacts RIGHT JOIN 03_slated 
   ON tblContacts.id = 03_slated.id;

* * *

The "inverted selection" technique is the same way you'd UPDATE foreign keys in related tables, in a referential-integrity-rules scenario.