We like to create a database that stores CHARs as "UTF16-LE", but literally ALL (!) collations listed with -l say they are incompatible with "UTF-16LE". So what is the magic trick...?

asked 05 Mar '15, 05:16

mkarg's gravatar image

mkarg
31224
accept rate: 0%


AFAIK, in SQL Anywhere neither CHAR nor NCHAR data types can use UTF-16 to store values. When using UNICODE you are restricted to UTF-8 (which will often need much lesser storage) - in contrast to say, MS SQL Server using UTF-16 for NCHAR data IIRC.

Cf. the qoute from the doc topic "International languages and character sets":

Unicode support SQL Anywhere supports Unicode as follows:

  • Client support for UTF-16 in SQL Anywhere client libraries for ODBC, OLE DB, ADO.NET, and JDBC
  • NCHAR data types for storing Unicode character data in UTF-8
  • CHAR data types can use UTF-8 encoding

(Note: I may be mixing - as usually - the terms "character sets"/"encodings"/"collations"...)

permanent link

answered 05 Mar '15, 07:47

Volker%20Barth's gravatar image

Volker Barth
31.3k312458674
accept rate: 32%

edited 05 Mar '15, 07:48

And remeber to define any char columns in a way like: varchar(5 char) to get 5 chars and not 5 bytes, which might be too small for 5 UTF-8 chars.

(05 Mar '15, 08:02) Martin

Volker is correct. The database does not store data in UTF16. It is supported only on the client side and as an encoding for LOAD, UNLOAD, CSCONVERT, etc. Use a UTF-8 encoding to store Unicode data in the database.

(05 Mar '15, 08:36) John Smirnios
Replies hidden

Why then does dbinit -le list so many UTF16 entries?

SQL Anywhere Initialization Utility Version 11.0.1.3158

SA, IANA, MIME, ICU, JAVA, ASE

ANSI_X3.4-1968, ANSI_X3.4-1968, US-ASCII, US-ASCII, ASCII,
Big5, Big5, Big5, windows-950-2000, Big5, cp950
Big5-HKSCS, , , ibm-1375_P100-2003, MS950_HKSCS, big5hk
EUC-KR, EUC-KR, EUC-KR, ibm-970_P110-1995, EUC-KR, eucksc
EUC-TW, , x-euc-tw, ibm-964_P110-1999, cp964, euccns
Extended_UNIX_Code_Packed_Format_for_Japanese, Extended_UNIX_Code_Packed_Format_
for_Japanese, EUC-JP, ibm-33722_P12A-1999, EUC-JP, eucjis
GB18030, GB18030, , gb18030, ,
GB2312, GB2312, GB2312, ibm-1383_P110-1999, , eucgb
GBK, GBK, , windows-936-2000, GBK, cp936
hp-roman8, hp-roman8, , ibm-1051_P100-1995, , roman8
IBM437, IBM437, , ibm-437_P100-1995, cp437, cp437
IBM850, IBM850, cp850, ibm-850_P100-1995, cp850, cp850
IBM852, IBM852, , ibm-852_P100-1995, cp852, cp852
IBM855, IBM855, , ibm-855_P100-1995, cp855, cp855
IBM856, , cp856, ibm-856_P100-1995, cp856, cp856
IBM857, IBM857, cp857, ibm-857_P100-1995, cp857, cp857
IBM860, IBM860, cp860, ibm-860_P100-1995, cp860, cp860
IBM861, IBM861, cp861, ibm-861_P100-1995, cp861, cp861
IBM862, IBM862, cp862, ibm-862_P100-1995, cp862, cp862
IBM863, IBM863, cp863, ibm-863_P100-1995, cp863, cp863
IBM864, IBM864, cp864, ibm-864_X110-1999, cp864, cp864
IBM865, IBM865, cp865, ibm-865_P100-1995, cp865, cp865
IBM866, IBM866, cp866, ibm-866_P100-1995, cp866, cp866
IBM869, IBM869, cp869, ibm-869_P100-1995, cp869, cp869
IBM949, , , windows-949-2000, windows-949, cp949
IBM950, , , ibm-950_P110-1999, cp950, cp950
ISO-8859-15, ISO-8859-15, ISO-8859-15, ibm-923_P100-1998, ISO-8859-15, iso15
ISO_8859-1:1987, ISO_8859-1:1987, ISO-8859-1, ISO-8859-1, ISO-8859-1, iso_1
ISO_8859-2:1987, ISO_8859-2:1987, iso-8859-2, ibm-912_P100-1995, iso-8859-2, iso
88592
ISO_8859-5:1988, ISO_8859-5:1988, iso-8859-5, ibm-915_P100-1995, iso-8859-5, iso
88595
ISO_8859-6:1987, ISO_8859-6:1987, iso-8859-6, ibm-1089_P100-1995, iso-8859-6, is
o88596
ISO_8859-7:1987, ISO_8859-7:1987, iso-8859-7, ibm-813_P100-1995, iso-8859-7, iso
88597
ISO_8859-8:1988, ISO_8859-8:1988, iso-8859-8, ibm-916_P100-1995, iso-8859-8, iso
88598
ISO_8859-9:1989, ISO_8859-9:1989, iso-8859-9, ibm-920_P100-1995, iso-8859-9, iso
88599
KOI8-R, KOI8-R, KOI8-R, ibm-878_P100-1996, KOI8-R, koi8
macintosh, macintosh, macintosh, macos-0_2-10.2, , mac
macos-29-10.2, , x-mac-centraleurroman, macos-29-10.2, , mac_ee
macos-6-10.2, , x-mac-greek, macos-6-10.2, , macgrk2
macos-7_3-10.2, , x-mac-cyrillic, macos-7_3-10.2, MacCyrillic, mac_cyr
macturk, , x-mac-turkish, macos-35-10.2, , macturk
Shift_JIS, Shift_JIS, Shift_JIS, ibm-943_P15A-2003, , sjis
TIS-620, TIS-620, , ibm-874_P100-1995, cp874, tis620
UTF-16BE, UTF-16BE, UTF-16BE, UTF-16BE, UTF-16BE,
UTF-16LE, UTF-16LE, UTF-16LE, UTF-16LE, UTF-16LE, utf16
UTF-8, UTF-8, UTF-8, UTF-8, UTF-8, utf8
windows-1250, windows-1250, , ibm-5346_P100-1998, windows-1250, cp1250
windows-1251, windows-1251, , ibm-5347_P100-1998, windows-1251, cp1251
windows-1252, windows-1252, , ibm-5348_P100-1997, windows-1252, cp1252
windows-1253, windows-1253, , ibm-5349_P100-1998, windows-1253, cp1253
windows-1254, windows-1254, , ibm-5350_P100-1998, windows-1254, cp1254
windows-1255, windows-1255, , ibm-9447_P100-2002, windows-1255, cp1255
windows-1256, windows-1256, , windows-1256-2000, windows-1256, cp1256
windows-1257, windows-1257, , ibm-9449_P100-2002, windows-1257, cp1257
windows-1258, windows-1258, , ibm-5354_P100-1998, windows-1258, cp1258
Windows-31J, Windows-31J, , ibm-943_P15A-2003, cp943c, cp932
windows-874-2000, , , windows-874-2000, windows-874, cp874

C:\projects\$templates\$SA_templates\run\dbinit>PAUSE
Press any key to continue . . .

'

(05 Mar '15, 08:40) Breck Carter

Actually we must store Chinese text AND English / German text in the same database due to the nature of the application. While this is possible with UTF-8, we expect that particularly the asian characters will bloat the .db file with significant overhead due to UTF-8's bias for latin characters. We further expect a performance drawback due to that. So we wanted to benchmark, what happens if UTF-16LE is use -- and asked dbinit -le whether it is supported (which says, it IS supported).

So the question is not "how to store unicode" but "how to use UTF-15"? :-)

(05 Mar '15, 09:19) mkarg
Replies hidden
1

And the answer is that you cannot use UTF-16 in a SQLAnywhere database as the encoding for the CHAR or NCHAR type.

(05 Mar '15, 09:34) John Smirnios

dbinit -le lists all encodings known to SQLAnywhere, not just the encodings that can be used in a database. GB18030 is not supported in databases either because you cannot determine the size of a character from the first byte; however, it is supported as a client charset, and in LOAD, UNLOAD, CSCONVERT.

(05 Mar '15, 09:39) John Smirnios

I guess the output from DBINIT -le(+) leaves some questions on its own... - obviously it shows more collations/encodings than are allowed for with DBINIT.

Why then does dbinit -le list so many UTF16 entries?

UTF-16BE, UTF-16BE, UTF-16BE, UTF-16BE, UTF-16BE,
UTF-16LE, UTF-16LE, UTF-16LE, UTF-16LE, UTF-16LE, utf16

There are (just) two, one for "UTF-16BE" and one for "UTF-16LE". In my understanding, each row lists one encoding/collation name in SA naming standard and its appropriate name in 5 other categories (and the first row seems to miss an "ASE" name...).

FWIW: The output seems quite identical for current versions like 12.0.1.4216.

(05 Mar '15, 09:42) Volker Barth
1

As John has told: "The database does not store data in UTF16." That is the answer to your original question, isn't it?

(05 Mar '15, 09:45) Volker Barth

To summarize: SQL Anywhere cannot store any other unicode encoding besides UTF-8, right?

This effectively implies a substantial overhead in .db file size for mixed database which have to store Chinese and Western characters at the same time. Is there a solution available to reduce this overhead or is the effect so small that SAP simply does not see any need to prevent the effect?

(06 Mar '15, 03:33) mkarg
Replies hidden
3

Depending on the mix of Chinese and Western data, there might be storage overhead or there might be storage savings: Western data will take 1 byte per character and Chinese data will be 3 bytes per character. For UTF16, it is (almost) always 2 bytes character for both Western and Chinese. The storage overhead would be greatest when storing only Chinese characters. For a fairly even mix of Western and Chinese, you will break even.

Whether the effect is too small or too large depends on the application, the text stored, the absolute size of the database and, above all, your own particular requirements.

Would UTF16 storage be advantageous for storage and/or performance in some circumstances? Sometimes yes, sometimes no. And, many times, it doesn't matter if one approach outperforms the other when both solutions provide completely adequate space & time performance for the given situation.

I can't say whether SQLA will ever support the use UTF16 as a database storage encoding but it does not currently do so. However, it's entirely possible that SQLAnywhere as it exists today will still completely meet your needs.

(06 Mar '15, 06:51) John Smirnios

ASE does not have endian-specific UTF16-encodings: 'utf16' is always the same endian as the server (or presumably the endian of the client if the client requests 'utf16').

SQLA lets you specify a specific endian and 'utf16' is an alias for the platform endian. I believe you can even refer to 'UTF-16PE' (for Platform Endian) on SQLA as an alias. It's not listed above because it's not the canonical name for any of the categories listed in the chart. It's just an extra alias.

On a big-endian platform, FWIW, the 'utf16' ASE alias would appear on the first line.

(06 Mar '15, 06:59) John Smirnios

Assume my need is to store mostly Chinese with some additional English, are you still convinced there will not be a performance problem?

(06 Mar '15, 08:15) mkarg

Apparently I can't speak for John, but given his response, I don't see that he has stated to be "convinced there will not be a performance problem" in your case.

In my understanding, he has told "it depends" whether UTF-16 may be more performant or not (or correctly: need more or less storage space) than UTF-8, and that even a performance difference may not really be noticeable for users.

As SA does not offer both UTF-8 and UTF-16 for storage, in my understanding, you could attempt to test whether storage and performance do behave worse with UTF-8 compared to using only non-unicode character sets, i.e. to a single-byte charset for western and a fitting chinese character set for chinese data. IMHO, that would be the trade-offs to check for.


(I assume Chinese and Western data will belong to the same tables and columns, so it would not be possible to use CHAR for one and NCHAR for the other - or to use two different databases with different character sets?)

(06 Mar '15, 10:44) Volker Barth
2

Volker is correct: I did not say that there would not be a performance problem in some cases. Some cases will have better space performance when using UTF16 and some will have better space performance when using UTF8. Some cases will have better time performance when using UTF16 and some will have better time performance when using UTF8 (and those are not necessarily the same cases). Whether the performance is a "problem" is completely subjective. Often, either approach yields completely acceptable storage & time performance characteristics for the job at hand. Only you can measure the system and determine if it meets your needs.

(10 Mar '15, 12:32) John Smirnios

I think measuring this makes no sense anymore, as apparently SQL Anywhere cannot store UTF-16 anyways. Either the performance is good with UTF-8 or we have to use a different database product.

(13 Mar '15, 09:34) mkarg
showing 4 of 15 show all flat view
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×12

question asked: 05 Mar '15, 05:16

question was seen: 1,610 times

last updated: 13 Mar '15, 09:34