AFAIK, SQL Anywhere can handle strings both as binary and character data and cast them accordingly. From the docs:

CHAR, NCHAR, and BINARY data types
SQL Anywhere internals do not distinguish between fixed- and varying-length string types (CHAR, NCHAR, or BINARY).

Sometimes it's handy to cast character data to binary in order to use binary comparison (which can also be done with other means, say collation tailoring and the like).

Question: How do the builtin string functions handle binary arguments?

Problem: The parameter declaration within the documentation does usually just use the word "string" and does not tell the datatype. Datatypes are only specified for return values and some functions return character and/or binary types, for example SUBSTRING:

Returns
* LONG BINARY
* LONG VARCHAR
* LONG NVARCHAR

Other functions do only return character data, such as REPLACE.

Returns
* LONG VARCHAR
* LONG NVARCHAR

Can I assume these functions will cast them to character data (and therefore use collation semantics and the like) when the function does list its parameters as (long) varchar/(long) nvarchar and does not list them as (long) varbinary?

E.g. REPLACE() seems to treat character and binary arguments identically (and follows character semantics, here in a case-insensitive manner because the database is case-insensitive):

select
  replace(cast('abc' as varchar),   cast('A' as varchar),   cast('X' as varchar))   as char_replace,
  replace(cast('abc' as varbinary), cast('A' as varbinary), cast('X' as varbinary)) as bin_replace

 -- returns (and therefore treats 'a' like 'A')
 -- char_replace,bin_replace
 -- Xbc,Xbc

asked 05 Jun '14, 11:25

Volker%20Barth's gravatar image

Volker Barth
40.1k361549819
accept rate: 34%

edited 05 Jun '14, 11:30

1

a BinaryReplace() function would be really helpful.

(05 Jun '14, 12:09) Justin Willey

First let me say that I agree that the documentation is a little light on the details of the parameters that are accepted by each function and the behaviour of each function as it relates to the types of input parameters. I have discussed this with the documentation team in the past, and I spent several days last fall generating a method to semi-automatically compiling a report of all input combinations to all of the functions and the resultant outputs (but the information about the exact translation that occurs being inputs and outputs is not generated). The issue is complex and so far we have not found a simple way to represent the necessary information in the documentation.

The answer to the question "How do string functions work with binary data?" is "It depends on which function is being used". I.e. there is not a one-answer-fits-all-functions answer.

There are basically two classes of functions (hence two answers) though:

  • The first class of functions are those that explicitly handle binary data: these include functions like length(), len(), byte_length(), substr(), substring(), byte_substr(), csconvert(), to_char(), to_nchar(), hash(), encrypt(), decrypt(), isencrypted(), compress(), decompress(), base64_encode(), string(), patindex(), ... and maybe some more that I missed? These functions have explicit code that determines if the input is binary or a char type and then does something (that hopefully makes sense to you) based on that information.
  • The second class of functions are those that only handle [n][var]char data as input. These functions assume that the input is in database character collation (or database nchar collation) and therefore processes the [n]char inputs according to the corresponding collation rules (e.g. steps through the strings character-by-character where each 'character' is defined by the character set being used).

When binary input is given where a char type input is expected the binary data is automatically casted to char type (using an identity mapping - i.e. no conversion in the data is performed).

For example, providing binary inputs to REPLACE results in the binary data being treated as character data in the database CHAR character set and characters are replaced one-by-one accordingly. As such what you get out will depend entirely on the character set being used (and of course the inputs). E.g.

begin
   declare @bin long binary;
   declare @fr  long binary;
   declare @to  long binary;

  set @bin = hextobin( '0102030405060708090a0b0c0d0e0f' );

  set @fr = hextobin( '01' );
  set @to = hextobin( '40' );

  select bintohex( @bin ), bintohex( replace( @bin, @fr, @to ) )
end;

The output from the above is:

'0102030405060708090A0B0C0D0E0F','4002030405060708090A0B0C0D0E0F'
and is what I would have expected since I am using a single byte character set.

I would agree that a binary replace (e.g. byte_replace()) function could be useful... although I'm sure a crafty programmer might be able to come up with a solution using the existing capabilities. This is left as an exercise for the reader! ;-)

permanent link

answered 05 Jun '14, 15:03

Mark%20Culp's gravatar image

Mark Culp
24.9k10139297
accept rate: 41%

edited 05 Jun '14, 15:21

1

Many thanks for that comprehensive answer - it is very useful.

This is left as an exercise for the reader! ;-) indeed that's what we have done, but as a user-defined function it's gruesomely slow compared to the built-in equivalent. In occasional usage that's not a problem, but when you are converting millions of rows of data a lot of coffee gets drunk!

(05 Jun '14, 18:33) Justin Willey
Replies hidden
1

Thanks for that thorough information - maybe that could be used to add a separate topic to the docs, linked to "string functions"?

As a side note: So for builtin functions, SA has a concept of overloading/polymorphism whereas user-defined functions and procedures do not support that? Otherwise it would be difficult to understand how the same function (name) can handle data of different types...

(06 Jun '14, 03:43) Volker Barth
Replies hidden

To help understand the business case, for what purpose are you doing binary data manipulation in the server and how often do you do it?

(06 Jun '14, 07:47) Mark Culp

Regarding your side note: Correct, most builtin functions have multiple binding rules and the server runs through the list and picks the first one that fits the input types and then calls the associated routine for that rule.

(06 Jun '14, 07:52) Mark Culp

This normally happens when we are trying to sort out data imported from other systems where there are character translation issues. eg an accented character appears as something else, or the GBP symbol has been replaced with something else.

The problem with using Replace() is that in many cases the collation settings will mean that it can't be used eg (a real example):

update person set notes=replace(notes,'ú','£')

can't be used because every u in the field will also be changed as the collation sequence helpfully reckons that 'ú' is the same as 'u'. One can of course code a UDF to go through every string character by character - which is fine for updating a single value, but en masse it's very slow.

An option on Replace() to ignore the collation would do the trick nicely.

(06 Jun '14, 08:00) Justin Willey
1

In my case, it's usually been an (occasional) attempt to do a binary replace, i.e. when the character semantics would find undesired matches (differening in case/accents (German umlauts) and the like). That has raised the current question...

Whereas it's - well, not easy but doable - to use different kinds of comparisons via collation tailoring and the like, string manipulation with different collations seems hard.

(06 Jun '14, 08:01) Volker Barth
1

Just on the frequency question - it's not something we would expect to use as part of regular production code - it's a bit too scary for that. More something used in migrations, clean-up scripts, imports etc.

(06 Jun '14, 10:40) Justin Willey
showing 2 of 7 show all flat view
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×30
×22
×17
×13

question asked: 05 Jun '14, 11:25

question was seen: 4,177 times

last updated: 06 Jun '14, 10:40