I have a table consisting of a record_id, email_address, and a handful of other attributes....

CREATE TABLE "mailing_list" (
"record_id" integer NOT NULL DEFAULT autoincrement,
"email_address" long varchar NULL,
"dupe_flag" char(1) NULL,
/* and a bunch more */
PRIMARY KEY ( "record_id" )
);

I have an index as follows.. CREATE INDEX "email_record_id" ON "mailing_list" ( "email_address" ASC, "record_id" ASC ) IN "indexes";

I have duplicate records in there -- more than one record per email address. I want to delete all duplicates, saving the one with the largest record ID. This works pretty well...

UPDATE mailing_list tableA join mailing_list tableB  
    ON tableA.email_address =  TableB.email_address 
      AND TableA.record_id > TableB.record_id   
    set TableB.dupe_flag = 'y';

DELETE FROM mailing_list WHERE dupe_flag = 'Y';

I'm joining the table to itself on email address, and the record ID bit is what preserves the first record of a set of duplicates, so all the others get marked for deletion.

I'll go out on a limb and say it works perfectly if my table has 100,000 records. It runs in about 10-12 seconds.

But the problem is that if I run it over a larger set (500,000 records in my next step up), it apparently gets bogged down -- I kill it when we reach 700 seconds. I need to be able to run it over a couple million records.

I can quickly identify my dupes (and also the record_id that I want to keep from a set with duplicates) with this...

 SELECT  email_address, max(record_id), count(*) as dupe_count  
 FROM mailing_list 
 GROUP BY email_address 
 HAVING dupe_count > 1 

But while this query identifies the keepers from a group of dupes, it does not identify the keepers where there are no dupes.

So... I need a efficient way to pull this together... probably using a GROUP BY instead of JOINing two big record sets.

Thanks!

p.s. Um, no, I'm not a spammer! This is for opt-in lists! People have opted in multiple times, and we don't want to send them multiple emails!

p.p.s.... wait... don't answer this! I just found the answer! I'll be right back.

asked 29 Mar '10, 22:19

Ron%20Hiner's gravatar image

Ron Hiner
880202427
accept rate: 9%


Can you check the performance of the following statement ?

DELETE FROM mailing_list 
WHERE record_id IS NOT IN ( 
       /* derived table of just keepers */   
      SELECT  max(record_id) as keeper  
          FROM mailing_list 
             GROUP BY email_address 
) 

This would avoid the update of records that need to stay. And to clear the flag later on.

Thomas

permanent link

answered 30 Mar '10, 10:23

Thomas%20Duemesnil's gravatar image

Thomas Dueme...
2.5k243560
accept rate: 15%

Thanks T! In my tests, that cuts the deduping time in by 60% in the smaller record sets (100,000 records). Looks like large sets are an even bigger improvement.

(30 Mar '10, 13:44) Ron Hiner

@Ron: Besides the performace improvement, I would prefer Thomas's suggestion as it seems more straightforward IMHO: Directly deleting what is unnecessary instead of marking what has to remain and then deleting just the rest...

(30 Mar '10, 14:06) Volker Barth

Keep it as simple as possible, that's it. And get better readability as a side effect.

(30 Mar '10, 14:35) Reimer Pods

Well, after writing all that up, and re-reading it, it souded a bit like my question of week or two ago...

http://sqlanywhere-forum.sap.com/questions/472/how-can-i-use-a-subquery-to-limit-my-where-clause-in-an-update

And Mark Culp gave the answer there to this question too! Thanks Mark! You rock!

The answer is to use a derived table... and instead of trying to identify the records to delete, just identify the keepers, and delete everything else.

  Update mailing_list  ml 
      set dupe_flag = 'N'
        from ( 
       /* derived table of just keepers */   
      SELECT  email_address, max(record_id) as keeper  
          FROM mailing_list 
             GROUP BY email_address ) dt
        where ml.record_id = dt.keeper ;

DELETE FROM mailing_list WHERE dupe_flag is null;

That does it, and it processes my 500,000 test set in 50 seconds. I haven't scaled bigger than that yet.

Update: I just ran this against 1.27 million records in 141 seconds! It seems to scale up nicely!

Ron

permanent link

answered 29 Mar '10, 22:25

Ron%20Hiner's gravatar image

Ron Hiner
880202427
accept rate: 9%

edited 29 Mar '10, 22:37

3

If you run this type of update more than once, and new records can be added between runs, then you will need to reset your dupe_flag to null before running your update ... set dupe_flag = 'N'. ... because obviously a record that use to be max(record_id) in the last run may no longer be the max when run again.

(29 Mar '10, 23:15) Mark Culp
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×5
×1

question asked: 29 Mar '10, 22:19

question was seen: 1,581 times

last updated: 30 Mar '10, 10:23