With a database with default Windows-1252 CHAR encoding and default UTF-8 NCHAR encoding, I'm facing values in a NCHAR column that are not valid in the CHAR charset. (These are probably Polish characters, e.g. "C with Acute Accent", small ć = U+0107, and yes, I'm aware that using UTF-8 as CHAR encoding would omit that problem – but that is not possible in this case.)

Note, I'm not dealing with conversion between client and database, it's just database internal conversion via csconvert(). I'm using a query such as

select MyColumn,
   cast(csconvert(MyColumn, 'char_charset', 'nchar_charset') as long varchar) as MyColumn1252
from MyTable

How do I test whether a conversion via csconvert leads to losses?

I had thought the on_charset_conversion_failure option would be of help but setting tast option to a value besides the default "Ignore" "returns SQLCODE -883:

You attempted to set the on_charset_conversion_failure option to produce either a warning or error report. This setting is not available when the single-byte to single-byte converter is used."

Using csconvert to convert according NCHAR values to CHAR replaces the invalid characters with a substitution character /x1A character as expected - so is that the recommended method to check for invalidity by either testing

  • for a (known) SUB character via "where MyColumn1252 like '%\x1A%'"
  • or whether original and converted data are inequal, such as "where MyColumn != MyColumn1252"?

And is there a facility to get the "base character" in this case (here a latin "c") via some csconvert with different locales or collation tailoring?

asked 17 Aug, 13:14

Volker%20Barth's gravatar image

Volker Barth
38.8k351528799
accept rate: 34%

edited 17 Aug, 13:15


The doc says that on_charset_conversion_failure is for conversions between the client & the server, not for conversions occurring within the server. There's a lot of history there but ultimately I don't think there's a built-in, authoritative way to accomplish what you want. You can look for a substitution character but not all character sets have the same substitution character. Your second suggestion is probably best: round-tripping the string. If to_nchar( to_char( nstr ) ) != nstr then the conversion was lossy or the original string had an invalid encoding.

I don't think there is a way to get the "base character" either.

permanent link

answered 17 Aug, 13:47

John%20Smirnios's gravatar image

John Smirnios
11.2k394150
accept rate: 38%

Well, I had hoped that, say within the German locale, I could use collation tailoring options that treat a regular c and a c with acute accent as equal (as that would be the typical "real-life workaround"). Thanks for the quick clarification, and the hint with to_char().

(17 Aug, 15:54) Volker Barth
2

That's a different question from your original "How do I test whether a conversion via csconvert leads to losses?" or "how to get the base character" :)

If you want to compare strings at just the primary level, I think you can use COMPARE( x, y, 'UCA(case=ignore;accent=ignore;punct=primary)' )

You may need to be careful if x & y are different string types. If one is CHAR and the other NCHAR, I have a vague recollection that they might both be converted to CHAR first which would be lossy but maybe I'm confusing it with inference rules for comparison: https://dcx.sap.com/index.html#sqla170/en/html/81f5aac76ce21014ab3ac8ae03b903ae.html. I would hope that they are both promoted to NCHAR.

(17 Aug, 16:16) John Smirnios
Replies hidden
1

John, thanks for the further pointer, I had (as usually...) forgotten that all string literals are converted to the CHAR charset, which had made my previous comparisons pointless.

If I use column values or build fitting unicode strings via UNISTR and \u codes, I get the expected results that different locales treat different (accented) characters as separate characters or not. E.g. "ć" is a separate character in Polish whereas it is a variant of "c" in German.

begin
   declare str1 nvarchar(30) = unistr('Novi\u0107');
   declare str2 nvarchar(30) = unistr('Novic');
   select str1, str2,
      compare(str1, str2, 'UCA(Locale=pl;Accent=Ignore)') as pl_compare, 
      compare(str1, str2, 'UCA(Locale=de;Accent=Ignore)') as de_compare;
end;

return Nović Novic 1 0

(18 Aug, 07:25) Volker Barth

Just to add: For the (very rare) occasions of lossy conversions in my case, I now use

  • the test "MyUnicodeColum != to_nchar(to_char(MyUnicodeColum))" to detect such cases, and
  • then check for the substituted characters with some kind of "brute-force" approach whether any of the CHAR "base characters" (A-ZÆÐÞa-zæðþ) is treated as equal when compared via "UCA(Locale=de;Case=respect;Accent=ignore)" with the unicode character - if that succeeds, the according base character is used instead of the substitute character.

So in my above sample, "Nović" will be converted to "Novic" instead of "Novi␚", as that's the best I can do here IMHO.

(19 Aug, 03:54) Volker Barth
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×16
×13

question asked: 17 Aug, 13:14

question was seen: 144 times

last updated: 19 Aug, 03:58