Preface: This has come up based on this question on UTF encodings...
Both "DBINIT -l" and "DBINIT -le" (and its variant "DBINIT -le+") can be used to list recommended collation sequences or available character set encodings, respectively.
Given the fact that the terms collations/encodings/code pages tend to get messed up (at least by me!), it would be nice if the different outputs of these commands could be explained a bit more.
Are my assumptions correct?
For example, here's the excerpt on 1252 and UTF-16 from all three commands run with 184.108.40.20616:
I guess these are just a few starting questions, I18N is such a huge issue... - I hope John will share his deep insight:)
I'm not sure what deep insight I can provide: your assumptions sound correct to me!
Collations/encodings/code pages all have distinct, precise meanings as do the related concepts of character set & glpyh. Wikipedia can give you the gory details but, in short, "collation" refers to how things are sorted, "encoding" refers to the binary representation of characters, "character set" refers to what logical characters can be represented, a "code page" is essentially a character set and its encoding together, and "glyph" refers to how it appears when printed (kind of like a font). For example, UTF8 is one encoding of the Unicode character set and it can be sorted with UCA (Unicode Collation Algorithm) collation. Often, encoding and character set go hand-in-hand as well. For example, windows-1252 defines both a set of characters and how they are encoded.
SQLA has historically been imprecise when using these terms and that can lead to confusion. For example, all "database collations" in SQLA other than UCA also imply an encoding and a character set. For example, the 1252LATIN1 collation has a well-defined sort order (collation) and the data is always stored in the windows-1252 encoding which can represent the Latin Alphabet character set (plus numerous other characters).
answered 10 Mar '15, 16:17
-ze is used to set the character set used to store character data (ie. encoding) and -z is used to set the collation (ie. for sorting and also used for comparisons generally).
-le and -l map to those concepts roughly as you would expect (and -le+ and -l+ only list a larger set of possible values though some may be historical). Those list possible collations and charsets support by the software and seems to include entries supported by both the server and client library support; not just those supported by the server.
As such, some encodings listed cannot be used as database character sets but client side setting for conversion/translation to the client. For example, any of the 16bit UTF transforms listed there are examples of that; ie. cannot be used to create the database. [I admit it would have been less confusing for dbinit to limit the values shown to just those supported in the creation of the database.]
Some of those can also be used with -zn for NCHAR data but again the only Unicode encoding supported is UTF-8 ... even for that.
For reference Code Pages refer to character sets or encodings, mostly historical for OEM and ANSI encodings, and are often tied to specific combinations or locale/language, one hardware/OS vendor Apple, Microsoft, IBM, HP, ... but they are still in use and will be for the foreseeable future but switching to Unicode leaps over that history; as you already know.
These do show up in the -le output. Expanded mapping to other equivalent names for same show up in the -le+ output [along with a few more additional lines]
For completeness only: I believe -l+ just show deprecated collations in addition to the supported ones listed for -l. You should probably avoid those.
I hope that helps add some clarity to what can be a difficult topic for many.
answered 10 Mar '15, 15:29
Nick Elson S...