I have a table consisting of a record_id, email_address, and a handful of other attributes....
I have an index as follows.. CREATE INDEX "email_record_id" ON "mailing_list" ( "email_address" ASC, "record_id" ASC ) IN "indexes"; I have duplicate records in there -- more than one record per email address. I want to delete all duplicates, saving the one with the largest record ID. This works pretty well...
I'm joining the table to itself on email address, and the record ID bit is what preserves the first record of a set of duplicates, so all the others get marked for deletion. I'll go out on a limb and say it works perfectly if my table has 100,000 records. It runs in about 10-12 seconds. But the problem is that if I run it over a larger set (500,000 records in my next step up), it apparently gets bogged down -- I kill it when we reach 700 seconds. I need to be able to run it over a couple million records. I can quickly identify my dupes (and also the record_id that I want to keep from a set with duplicates) with this...
But while this query identifies the keepers from a group of dupes, it does not identify the keepers where there are no dupes. So... I need a efficient way to pull this together... probably using a GROUP BY instead of JOINing two big record sets. Thanks! p.s. Um, no, I'm not a spammer! This is for opt-in lists! People have opted in multiple times, and we don't want to send them multiple emails! p.p.s.... wait... don't answer this! I just found the answer! I'll be right back. |