Can we get XML encoding to better escape characters?

Not sure whether this is the right place to say something, but there's been a number of instances where I've had generated XML fail due to illegal characters in the content. The most recent instance was from a SOAP server rejecting the request because the implicit XML encoding that is done with XMLELEMENT() seems to miss a number of required character conversions.

I threw together the following function as a workaround, but I'm glad that this kind of operation doesn't tend to be time-critical or else the level of required complexity is a bit overwhelming.

Otherwise this quickly falls into the category again of "should I really need to do this?"

create or replace function dba.xml_encode(pInput long nvarchar) returns xml
begin // assumes nvarchar is defined as utf-8
  declare @path long varchar;
  declare @nextPos integer;
  declare @intCode smallint;
  declare @nextCode bigint;
  declare @numPoints smallint;
  declare @subPos integer;

  // note: i am making up the "&pct;" entity to move %s out of the way, patched up in the return at the end
  set @path = replace(replace(replace(replace(replace(replace(replace(http_encode(pInput),
    '%26','&amp;'),'%3C','&lt;'),'%3E','&gt;'),'%27','&apos;'),'%22','&quot;'),
    '%1A','&#65533;'),'%25','&pct;');

  // first pass: legitimate character encodings
  set @nextPos = locate(@path, '%');
  while @nextPos <> 0 loop
    set @intCode = hextoint(substring(@path, @nextPos+1, 2));
    set @numPoints = 1;
    set @nextCode = @intCode;
    case
      when @intCode >= 0xF0 then
        set @numPoints = 4;
        set @nextCode = @intCode & 0x07;
      when @intCode >= 0xE0 then
        set @numPoints = 3;
        set @nextCode = @intCode & 0x0F;
      when @intCode >= 0xD0 then
        set @numPoints = 2;
        set @nextCode = @intCode & 0x1F;
      when @intCode >= 0x80 then
        set @numPoints = 0;   // error
    end case;
    case @numPoints
      when 0 then
        set @nextPos = @nextPos + 1; // skip
      when 1 then
        set @path = replace(@path, substring(@path, @nextPos, 3), char(@intCode));
      else
        set @subPos = @nextPos + 3;
        subLoop: while @numPoints > 1 and substring(@path, @subPos, 1) = '%' loop
          set @intCode = hextoint(substring(@path, @subPos+1, 2));
          if @intCode & 0xC0 <> 0x80 then
            leave subLoop;
          end if;
          set @nextCode = (@nextCode * 64) | (@intCode & 0x3F);
          set @numPoints = @numPoints - 1;
          set @subPos = @subPos + 3;
        end loop;
        if @numPoints > 1 then
          set @nextPos = @subPos + 1; // error
        else
          set @path = replace(@path, substring(@path, @nextPos, @subPos-@nextPos),
            '&#' || @nextCode || ';');
        end if;
    end case;
    set @nextPos = locate(@path, '%', @nextPos);
  end loop;

  // second pass: corrupted character encodings
  set @nextPos = locate(@path, '%');
  while @nextPos <> 0 loop
    set @intCode = hextoint(substring(@path, @nextPos+1, 2));
    set @path = replace(@path, substring(@path, @nextPos, 3), '&#' || @intCode || ';');
    set @nextPos = locate(@path, '%', @nextPos);
  end loop;

  return replace(@path,'&pct;','%');
end;

xml product-suggestion encoding

asked 24 May '12, 14:59

Erik Anderson
436●8●13●23
accept rate: 15%

flat view

One Answer:

active answers oldest answers newest answers popular answers

Note that most unicode characters are allowed to be in an XML document without being escaped (with a few exceptions) - see http://en.wikipedia.org/wiki/XML - and therefore I do not understand the purpose of your function?

Could it be that the SOAP server that is consuming the XML that SA is generating is not compliant?

permanent link

answered 24 May '12, 16:55

Mark Culp
24.9k●10●141●297
accept rate: 41%

In the most recent case the SOAP server (Amazon AWS this time around) was complaining ("invalid xml character") about an en-dash (U+2013, latin1 150) being in the document, even with me sticking encoding="utf-8" on the submitting document (although I don't remember whether XML is defined as "long varchar" or "long nvarchar" anymore). Once it was translated to – [edit:escaped the entity] the request went through and I was able to verify that the character matched. Note that I'm not using the internal SOAP function calls, although once it's in XML variables it shouldn't really make a difference, should it?

I have in the past had a lot of problems with SUB (ascii 26) which appears to be illegal in any form in an XML document (no matter how you escape it) and is produced when you attempt to store a character in a field in a charset that does not support the character (I got this a lot in early tests trying to shove chinese into long nvarchar while the pipe was not fully UNICODE-safe). MSXML for instance will stop dead as soon as it sees this character anywhere in the input. I have been translating this to U+FFFD which seems to have a similar meaning but is passed without comment.

(24 May '12, 17:46) Erik Anderson

flat view

Your answer

toggle preview

community wiki:

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

*italic* or _italic_
**bold** or __bold__
link:[text](http://url.com/ "title")
image?![alt text](/path/img.jpg "title")
numbered list: 1. Foo 2. Bar
to add a line break simply add two spaces to where you would like the new line to be.
basic HTML tags are also supported

learn more about Markdown

Question tags:

product-suggestion ×113
xml ×29
encoding ×9

question asked: 24 May '12, 14:59

question was seen: 4,232 times

last updated: 24 May '12, 17:47

SAP SQL Anywhere

Can we get XML encoding to better escape characters?

Follow this question

Related questions