I have a tab delimited nvarchar column with many values. The problem is that we sometimes have duplicates in that column. How can I update the column and remove all duplicates?

Today it look like this:

ID--Material--Information--
---------------------------
1--Test1------Engine 12b \x09 Diam 32 \x09 Rear left \x09 Engine 12b
2--Test1------Engine 24b \x09 Diam 32 \x09 Rear right \x09 Diam 32 \x09 Rear right  
3--Test2------Engine 36b \x09 Diam 32 \x09 Rear left \x09 Rear left

I wanna remove duplicates in column informations so it look like this:


ID--Material--Information--
---------------------------
1--Test1------Engine 12b \x09 Diam 32 \x09 Rear left
2--Test1------Engine 24b \x09 Diam 32 \x09 Rear right   
3--Test2------Engine 36b \x09 Diam 32 \x09 Rear left

asked 12 Oct '15, 16:51

Rolle's gravatar image

Rolle
380203041
accept rate: 0%

edited 14 Oct '15, 02:11

Volker%20Barth's gravatar image

Volker Barth
29.9k294446654


Here's a complete sample (FWIW, it's nice if you provide a simple test schema by yourself...):

create table Test
(
    ID int not null default autoincrement primary key,
    Material varchar(30) not null,
    Information varchar(1000) not null
);
insert Test values
   (1, 'Test1', 'Engine 12b \t Diam 32 \t Rear left \t Engine 12b'),
   (2, 'Test1', 'Engine 24b \t Diam 32 \t Rear right \t Diam 32 \t Rear right'),  
   (3, 'Test2', 'Engine 36b \t Diam 32 \t Rear left \t Rear left');

select * from Test;

Now use sa_split_list to build a query that turns the "blanks-and-tab-delimited" column into separate rows (while preserving the order) - note that cross apply is handy to use a column as parameter for a procedure:

select distinct id, line_num, row_value
from Test cross apply sa_split_list(Information, ' \t ')
order by 1, 2;

-- will return
id,line_num,row_value
1,1,Engine 12b
1,2,Diam 32
1,3,Rear left
1,4,Engine 12b
2,1,Engine 24b
2,2,Diam 32
2,3,Rear right
2,4,Diam 32
2,5,Rear right
3,1,Engine 36b
3,2,Diam 32
3,3,Rear left
3,4,Rear left

Now, use that as a derived query (dt1) to remove duplicate values (and still preserving the order, i.e. for duplicate values, the first appearance is preserved):

select id, row_value, min(line_num)
from
   (select distinct id, line_num, row_value
    from Test cross apply sa_split_list(Information, ' \t ')) dt
group by id, row_value
order by 1, 3;

-- will return 
id,row_value,min_line_num
1,Engine 12b,1
1,Diam 32,2
1,Rear left,3
2,Engine 24b,1
2,Diam 32,2
2,Rear right,3
3,Engine 36b,1
3,Diam 32,2
3,Rear left,3

Now, use that again as a derived query dt2 to re-build the list:

select id, list(row_value, ' \t ' order by min_line_num) as Deduplicated_Information
from
   (select id, row_value, min(line_num) as min_line_num
    from
       (select distinct id, line_num, row_value
        from Test cross apply sa_split_list(Information, ' \t ')) dt1
    group by id, row_value) dt2
group by id
order by 1;

-- will return
id,Deduplicated_Information
1,Engine 12b \t Diam 32 \t Rear left
2,Engine 24b \t Diam 32 \t Rear right
3,Engine 36b \t Diam 32 \t Rear left

Now, build a join with the original table to check the results:

select Test.*, Deduplicated_Information
from Test inner join
   (select id, list(row_value, ' \t ' order by min_line_num) as Deduplicated_Information
    from
      (select id, row_value, min(line_num) as min_line_num
       from
          (select distinct id, line_num, row_value
           from Test cross apply sa_split_list(Information, ' \t ')) dt1
       group by id, row_value) dt2
    group by id) dt3
   on Test.ID = dt3.id
order by 1;

-- will return
ID,Material,Information,Deduplicated_Information
1,Test1,Engine 12b \t Diam 32 \t Rear left \t Engine 12b,Engine 12b \t Diam 32 \t Rear left
2,Test1,Engine 24b \t Diam 32 \t Rear right \t Diam 32 \t Rear right,Engine 24b \t Diam 32 \t Rear right
3,Test2,Engine 36b \t Diam 32 \t Rear left \t Rear left,Engine 36b \t Diam 32 \t Rear left

and use that join to update the real table:

update Test
set Information = Deduplicated_Information
from Test inner join
   (select id, list(row_value, ' \t ' order by min_line_num) as Deduplicated_Information
    from
      (select id, row_value, min(line_num) as min_line_num
       from
          (select distinct id, line_num, row_value
           from Test cross apply sa_split_list(Information, ' \t ')) dt1
       group by id, row_value) dt2
    group by id) dt3
   on Test.ID = dt3.id;

select * from Test;

-- will return
ID,Material,Information
1,Test1,Engine 12b \t Diam 32 \t Rear left
2,Test1,Engine 24b \t Diam 32 \t Rear right
3,Test2,Engine 36b \t Diam 32 \t Rear left
permanent link

answered 13 Oct '15, 08:29

Volker%20Barth's gravatar image

Volker Barth
29.9k294446654
accept rate: 32%

edited 13 Oct '15, 08:35

Volker, I am curious. Could you please explain my, why you have decided to use CROSS APPLY? I tested a simpler request, and it works fine as well:

SELECT DISTINCT id, line_num, row_value FROM Test, sa_split_list(Information, ' \t ') ORDER BY 1, 2;

(14 Oct '15, 07:27) Vlad
Replies hidden

For me (using 12.0.1.4301), your statement results in a -824 error ("Illegal reference to correlation name 'Test'")

AFAIK it's necessary to use an APPLY operator or a LATERAL expression if one wants to use a procedure in the FROM clause and has to supply column values from other tables as procedure parameters. Cf. a search for the "lateral" tag:)

(14 Oct '15, 07:45) Volker Barth
Comment Text Removed

Strange. I was testing my query on SA17. Even plans look the same (I used the feature found by Breck that compares plans). The only small difference is the Final plan build time.

(14 Oct '15, 10:29) Vlad

Hm, that seems to be a difference between v12 and above - with the latest v16 build (16.0.0.2178) the comma operator is accepted as with v17.

FWIW, the following semantically equivalent expression with a CROSS JOIN is also rejected in v12 (with the same -824 error) and allowed in v16/v17:

select distinct id, line_num, row_value
from Test cross join sa_split_list(Information, ' \t ')
order by 1, 2;
(20 Oct '15, 06:52) Volker Barth

Could you please explain my, why you have decided to use CROSS APPLY?

After further investigation, I guess my answer is "because it think I understand that construction, and because it works with v12":)

(20 Oct '15, 06:54) Volker Barth

Before giving you this answer, I honestly stole it from this page. I do not want to sound rude, but this was my search request.

Ok, returning to the proposal. Here it is:

SELECT DISTINCT(row_value)
FROM sa_split_list('Engine 12b \x09 Diam 32 \x09 Rear left \x09 Engine 12b', ' \x09 ');

Result:

row_value
-----------
Engine 12b
Diam 32
Rear left

I think, this will be a quite incorrect to do everything on the SQL server. I would suggest you to change your software code so it will use SET collections instead of LIST, when you serialize tokens (or deserialize) to/from the database. Plus, this kind of normalization could be implemented once for the column, so you will never have such question anymore.

At least this is how I would solve this task.

permanent link

answered 13 Oct '15, 03:38

Vlad's gravatar image

Vlad
3367924
accept rate: 0%

1

Just to add:

With LIST() you can turn the distinct rows into the single column "Information" again (though that might violate normalization rules, as Vlad has pointed out).

Based on that, you can certainly combine sa_split_list() and list() to update the columns containing duplicate tokens in one single update statement.

(13 Oct '15, 04:07) Volker Barth

A good suggestion, I forgot to search for the function LIST. I am always thinking that sometimes it is better to fix everything in the application code, rather than to use SQL.

(13 Oct '15, 04:17) Vlad
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×18
×12

question asked: 12 Oct '15, 16:51

question was seen: 430 times

last updated: 20 Oct '15, 07:01