How can I test whether a conversion from NCHAR to CHAR (single byte collation) is lossy?

With a database with default Windows-1252 CHAR encoding and default UTF-8 NCHAR encoding, I'm facing values in a NCHAR column that are not valid in the CHAR charset. (These are probably Polish characters, e.g. "C with Acute Accent", small ć = U+0107, and yes, I'm aware that using UTF-8 as CHAR encoding would omit that problem – but that is not possible in this case.)

Note, I'm not dealing with conversion between client and database, it's just database internal conversion via csconvert(). I'm using a query such as

select MyColumn,
   cast(csconvert(MyColumn, 'char_charset', 'nchar_charset') as long varchar) as MyColumn1252
from MyTable

How do I test whether a conversion via csconvert leads to losses?

I had thought the on_charset_conversion_failure option would be of help but setting tast option to a value besides the default "Ignore" "returns SQLCODE -883:

You attempted to set the on_charset_conversion_failure option to produce either a warning or error report. This setting is not available when the single-byte to single-byte converter is used."

Using csconvert to convert according NCHAR values to CHAR replaces the invalid characters with a substitution character /x1A character as expected - so is that the recommended method to check for invalidity by either testing

for a (known) SUB character via "where MyColumn1252 like '%\x1A%'"
or whether original and converted data are inequal, such as "where MyColumn != MyColumn1252"?

And is there a facility to get the "base character" in this case (here a latin "c") via some csconvert with different locales or collation tailoring?

utf8 data-conversion

asked 17 Aug '22, 13:14

Volker Barth
40.2k●361●549●822
accept rate: 34%

edited 17 Aug '22, 13:15

flat view

One Answer:

active answers oldest answers newest answers popular answers

The doc says that on_charset_conversion_failure is for conversions between the client & the server, not for conversions occurring within the server. There's a lot of history there but ultimately I don't think there's a built-in, authoritative way to accomplish what you want. You can look for a substitution character but not all character sets have the same substitution character. Your second suggestion is probably best: round-tripping the string. If to_nchar( to_char( nstr ) ) != nstr then the conversion was lossy or the original string had an invalid encoding.

I don't think there is a way to get the "base character" either.

permanent link

answered 17 Aug '22, 13:47

John Smirnios
12.0k●3●96●166
accept rate: 37%

Well, I had hoped that, say within the German locale, I could use collation tailoring options that treat a regular c and a c with acute accent as equal (as that would be the typical "real-life workaround"). Thanks for the quick clarification, and the hint with to_char().

(17 Aug '22, 15:54) Volker Barth

That's a different question from your original "How do I test whether a conversion via csconvert leads to losses?" or "how to get the base character" :)

If you want to compare strings at just the primary level, I think you can use COMPARE( x, y, 'UCA(case=ignore;accent=ignore;punct=primary)' )

You may need to be careful if x & y are different string types. If one is CHAR and the other NCHAR, I have a vague recollection that they might both be converted to CHAR first which would be lossy but maybe I'm confusing it with inference rules for comparison: https://dcx.sap.com/index.html#sqla170/en/html/81f5aac76ce21014ab3ac8ae03b903ae.html. I would hope that they are both promoted to NCHAR.

(17 Aug '22, 16:16) John Smirnios

Replies hidden

John, thanks for the further pointer, I had (as usually...) forgotten that all string literals are converted to the CHAR charset, which had made my previous comparisons pointless.

If I use column values or build fitting unicode strings via UNISTR and \u codes, I get the expected results that different locales treat different (accented) characters as separate characters or not. E.g. "ć" is a separate character in Polish whereas it is a variant of "c" in German.

begin
   declare str1 nvarchar(30) = unistr('Novi\u0107');
   declare str2 nvarchar(30) = unistr('Novic');
   select str1, str2,
      compare(str1, str2, 'UCA(Locale=pl;Accent=Ignore)') as pl_compare, 
      compare(str1, str2, 'UCA(Locale=de;Accent=Ignore)') as de_compare;
end;

return Nović Novic 1 0

(18 Aug '22, 07:25) Volker Barth

Just to add: For the (very rare) occasions of lossy conversions in my case, I now use

the test "MyUnicodeColum != to_nchar(to_char(MyUnicodeColum))" to detect such cases, and
then check for the substituted characters with some kind of "brute-force" approach whether any of the CHAR "base characters" (A-ZÆÐÞa-zæðþ) is treated as equal when compared via "UCA(Locale=de;Case=respect;Accent=ignore)" with the unicode character - if that succeeds, the according base character is used instead of the substitute character.

So in my above sample, "Nović" will be converted to "Novic" instead of "Novi␚", as that's the best I can do here IMHO.

(19 Aug '22, 03:54) Volker Barth

flat view

Your answer

toggle preview

community wiki:

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

*italic* or _italic_
**bold** or __bold__
link:[text](http://url.com/ "title")
image?![alt text](/path/img.jpg "title")
numbered list: 1. Foo 2. Bar
to add a line break simply add two spaces to where you would like the new line to be.
basic HTML tags are also supported

learn more about Markdown

Question tags:

data-conversion ×17
utf8 ×14

question asked: 17 Aug '22, 13:14

question was seen: 476 times

last updated: 19 Aug '22, 03:58

SAP SQL Anywhere

How can I test whether a conversion from NCHAR to CHAR (single byte collation) is lossy?

Follow this question

Related questions