I have a table consisting of a record_id, email_address, and a handful of other attributes....
I have an index as follows.. CREATE INDEX "email_record_id" ON "mailing_list" ( "email_address" ASC, "record_id" ASC ) IN "indexes";
I have duplicate records in there -- more than one record per email address. I want to delete all duplicates, saving the one with the largest record ID. This works pretty well...
I'm joining the table to itself on email address, and the record ID bit is what preserves the first record of a set of duplicates, so all the others get marked for deletion.
I'll go out on a limb and say it works perfectly if my table has 100,000 records. It runs in about 10-12 seconds.
But the problem is that if I run it over a larger set (500,000 records in my next step up), it apparently gets bogged down -- I kill it when we reach 700 seconds. I need to be able to run it over a couple million records.
I can quickly identify my dupes (and also the record_id that I want to keep from a set with duplicates) with this...
But while this query identifies the keepers from a group of dupes, it does not identify the keepers where there are no dupes.
So... I need a efficient way to pull this together... probably using a GROUP BY instead of JOINing two big record sets.
p.s. Um, no, I'm not a spammer! This is for opt-in lists! People have opted in multiple times, and we don't want to send them multiple emails!
p.p.s.... wait... don't answer this! I just found the answer! I'll be right back.
asked 29 Mar '10, 22:19
Can you check the performance of the following statement ?
This would avoid the update of records that need to stay. And to clear the flag later on.
answered 30 Mar '10, 10:23
Well, after writing all that up, and re-reading it, it souded a bit like my question of week or two ago...
And Mark Culp gave the answer there to this question too! Thanks Mark! You rock!
The answer is to use a derived table... and instead of trying to identify the records to delete, just identify the keepers, and delete everything else.
That does it, and it processes my 500,000 test set in 50 seconds. I haven't scaled bigger than that yet.
Update: I just ran this against 1.27 million records in 141 seconds! It seems to scale up nicely!