I have a tab delimited nvarchar column with many values. The problem is that we sometimes have duplicates in that column. How can I update the column and remove all duplicates? Today it look like this:
|
Here's a complete sample (FWIW, it's nice if you provide a simple test schema by yourself...): create table Test ( ID int not null default autoincrement primary key, Material varchar(30) not null, Information varchar(1000) not null ); insert Test values (1, 'Test1', 'Engine 12b \t Diam 32 \t Rear left \t Engine 12b'), (2, 'Test1', 'Engine 24b \t Diam 32 \t Rear right \t Diam 32 \t Rear right'), (3, 'Test2', 'Engine 36b \t Diam 32 \t Rear left \t Rear left'); select * from Test; Now use sa_split_list to build a query that turns the "blanks-and-tab-delimited" column into separate rows (while preserving the order) - note that cross apply is handy to use a column as parameter for a procedure: select distinct id, line_num, row_value from Test cross apply sa_split_list(Information, ' \t ') order by 1, 2; -- will return id,line_num,row_value 1,1,Engine 12b 1,2,Diam 32 1,3,Rear left 1,4,Engine 12b 2,1,Engine 24b 2,2,Diam 32 2,3,Rear right 2,4,Diam 32 2,5,Rear right 3,1,Engine 36b 3,2,Diam 32 3,3,Rear left 3,4,Rear left Now, use that as a derived query (dt1) to remove duplicate values (and still preserving the order, i.e. for duplicate values, the first appearance is preserved): select id, row_value, min(line_num) from (select distinct id, line_num, row_value from Test cross apply sa_split_list(Information, ' \t ')) dt group by id, row_value order by 1, 3; -- will return id,row_value,min_line_num 1,Engine 12b,1 1,Diam 32,2 1,Rear left,3 2,Engine 24b,1 2,Diam 32,2 2,Rear right,3 3,Engine 36b,1 3,Diam 32,2 3,Rear left,3 Now, use that again as a derived query dt2 to re-build the list: select id, list(row_value, ' \t ' order by min_line_num) as Deduplicated_Information from (select id, row_value, min(line_num) as min_line_num from (select distinct id, line_num, row_value from Test cross apply sa_split_list(Information, ' \t ')) dt1 group by id, row_value) dt2 group by id order by 1; -- will return id,Deduplicated_Information 1,Engine 12b \t Diam 32 \t Rear left 2,Engine 24b \t Diam 32 \t Rear right 3,Engine 36b \t Diam 32 \t Rear left Now, build a join with the original table to check the results: select Test.*, Deduplicated_Information from Test inner join (select id, list(row_value, ' \t ' order by min_line_num) as Deduplicated_Information from (select id, row_value, min(line_num) as min_line_num from (select distinct id, line_num, row_value from Test cross apply sa_split_list(Information, ' \t ')) dt1 group by id, row_value) dt2 group by id) dt3 on Test.ID = dt3.id order by 1; -- will return ID,Material,Information,Deduplicated_Information 1,Test1,Engine 12b \t Diam 32 \t Rear left \t Engine 12b,Engine 12b \t Diam 32 \t Rear left 2,Test1,Engine 24b \t Diam 32 \t Rear right \t Diam 32 \t Rear right,Engine 24b \t Diam 32 \t Rear right 3,Test2,Engine 36b \t Diam 32 \t Rear left \t Rear left,Engine 36b \t Diam 32 \t Rear left and use that join to update the real table: update Test set Information = Deduplicated_Information from Test inner join (select id, list(row_value, ' \t ' order by min_line_num) as Deduplicated_Information from (select id, row_value, min(line_num) as min_line_num from (select distinct id, line_num, row_value from Test cross apply sa_split_list(Information, ' \t ')) dt1 group by id, row_value) dt2 group by id) dt3 on Test.ID = dt3.id; select * from Test; -- will return ID,Material,Information 1,Test1,Engine 12b \t Diam 32 \t Rear left 2,Test1,Engine 24b \t Diam 32 \t Rear right 3,Test2,Engine 36b \t Diam 32 \t Rear left Volker, I am curious. Could you please explain my, why you have decided to use CROSS APPLY? I tested a simpler request, and it works fine as well: SELECT DISTINCT id, line_num, row_value FROM Test, sa_split_list(Information, ' \t ') ORDER BY 1, 2;
(14 Oct '15, 07:27)
Vlad
Replies hidden
For me (using 12.0.1.4301), your statement results in a -824 error ("Illegal reference to correlation name 'Test'") AFAIK it's necessary to use an APPLY operator or a LATERAL expression if one wants to use a procedure in the FROM clause and has to supply column values from other tables as procedure parameters. Cf. a search for the "lateral" tag:)
(14 Oct '15, 07:45)
Volker Barth
Comment Text Removed
Strange. I was testing my query on SA17. Even plans look the same (I used the feature found by Breck that compares plans). The only small difference is the Final plan build time.
(14 Oct '15, 10:29)
Vlad
Hm, that seems to be a difference between v12 and above - with the latest v16 build (16.0.0.2178) the comma operator is accepted as with v17. FWIW, the following semantically equivalent expression with a CROSS JOIN is also rejected in v12 (with the same -824 error) and allowed in v16/v17: select distinct id, line_num, row_value from Test cross join sa_split_list(Information, ' \t ') order by 1, 2;
(20 Oct '15, 06:52)
Volker Barth
After further investigation, I guess my answer is "because it think I understand that construction, and because it works with v12":)
(20 Oct '15, 06:54)
Volker Barth
|
Before giving you this answer, I honestly stole it from this page. I do not want to sound rude, but this was my search request. Ok, returning to the proposal. Here it is:
Result:
I think, this will be a quite incorrect to do everything on the SQL server. I would suggest you to change your software code so it will use SET collections instead of LIST, when you serialize tokens (or deserialize) to/from the database. Plus, this kind of normalization could be implemented once for the column, so you will never have such question anymore. At least this is how I would solve this task. 1
Just to add: With LIST() you can turn the distinct rows into the single column "Information" again (though that might violate normalization rules, as Vlad has pointed out). Based on that, you can certainly combine sa_split_list() and list() to update the columns containing duplicate tokens in one single update statement.
(13 Oct '15, 04:07)
Volker Barth
A good suggestion, I forgot to search for the function LIST. I am always thinking that sometimes it is better to fix everything in the application code, rather than to use SQL.
(13 Oct '15, 04:17)
Vlad
|