With a database with default Windows-1252 CHAR encoding and default UTF-8 NCHAR encoding, I'm facing values in a NCHAR column that are not valid in the CHAR charset. (These are probably Polish characters, e.g. "C with Acute Accent", small ć = U+0107, and yes, I'm aware that using UTF-8 as CHAR encoding would omit that problem – but that is not possible in this case.) Note, I'm not dealing with conversion between client and database, it's just database internal conversion via csconvert(). I'm using a query such as select MyColumn, cast(csconvert(MyColumn, 'char_charset', 'nchar_charset') as long varchar) as MyColumn1252 from MyTable How do I test whether a conversion via csconvert leads to losses? I had thought the on_charset_conversion_failure option would be of help but setting tast option to a value besides the default "Ignore" "returns SQLCODE -883:
Using csconvert to convert according NCHAR values to CHAR replaces the invalid characters with a substitution character /x1A character as expected - so is that the recommended method to check for invalidity by either testing
And is there a facility to get the "base character" in this case (here a latin "c") via some csconvert with different locales or collation tailoring? |
The doc says that on_charset_conversion_failure is for conversions between the client & the server, not for conversions occurring within the server. There's a lot of history there but ultimately I don't think there's a built-in, authoritative way to accomplish what you want. You can look for a substitution character but not all character sets have the same substitution character. Your second suggestion is probably best: round-tripping the string. If to_nchar( to_char( nstr ) ) != nstr then the conversion was lossy or the original string had an invalid encoding. I don't think there is a way to get the "base character" either. Well, I had hoped that, say within the German locale, I could use collation tailoring options that treat a regular c and a c with acute accent as equal (as that would be the typical "real-life workaround"). Thanks for the quick clarification, and the hint with to_char().
(17 Aug '22, 15:54)
Volker Barth
2
That's a different question from your original "How do I test whether a conversion via csconvert leads to losses?" or "how to get the base character" :) If you want to compare strings at just the primary level, I think you can use COMPARE( x, y, 'UCA(case=ignore;accent=ignore;punct=primary)' ) You may need to be careful if x & y are different string types. If one is CHAR and the other NCHAR, I have a vague recollection that they might both be converted to CHAR first which would be lossy but maybe I'm confusing it with inference rules for comparison: https://dcx.sap.com/index.html#sqla170/en/html/81f5aac76ce21014ab3ac8ae03b903ae.html. I would hope that they are both promoted to NCHAR.
(17 Aug '22, 16:16)
John Smirnios
Replies hidden
1
John, thanks for the further pointer, I had (as usually...) forgotten that all string literals are converted to the CHAR charset, which had made my previous comparisons pointless. If I use column values or build fitting unicode strings via UNISTR and \u codes, I get the expected results that different locales treat different (accented) characters as separate characters or not. E.g. "ć" is a separate character in Polish whereas it is a variant of "c" in German. begin declare str1 nvarchar(30) = unistr('Novi\u0107'); declare str2 nvarchar(30) = unistr('Novic'); select str1, str2, compare(str1, str2, 'UCA(Locale=pl;Accent=Ignore)') as pl_compare, compare(str1, str2, 'UCA(Locale=de;Accent=Ignore)') as de_compare; end; return Nović Novic 1 0
(18 Aug '22, 07:25)
Volker Barth
Just to add: For the (very rare) occasions of lossy conversions in my case, I now use
So in my above sample, "Nović" will be converted to "Novic" instead of "Novi␚", as that's the best I can do here IMHO.
(19 Aug '22, 03:54)
Volker Barth
|