The forum will experience an outage sometime between February 10 at 7:00pm EST and February 12 at 11:59 EST for installation of security updates. The actual time and duration of the outage are unknown but attempts will be made to minimize the downtime. We apologize for any inconvenience.

Not sure whether this is the right place to say something, but there's been a number of instances where I've had generated XML fail due to illegal characters in the content. The most recent instance was from a SOAP server rejecting the request because the implicit XML encoding that is done with XMLELEMENT() seems to miss a number of required character conversions.

I threw together the following function as a workaround, but I'm glad that this kind of operation doesn't tend to be time-critical or else the level of required complexity is a bit overwhelming.

Otherwise this quickly falls into the category again of "should I really need to do this?"

create or replace function dba.xml_encode(pInput long nvarchar) returns xml
begin // assumes nvarchar is defined as utf-8
  declare @path long varchar;
  declare @nextPos integer;
  declare @intCode smallint;
  declare @nextCode bigint;
  declare @numPoints smallint;
  declare @subPos integer;

  // note: i am making up the "&pct;" entity to move %s out of the way, patched up in the return at the end
  set @path = replace(replace(replace(replace(replace(replace(replace(http_encode(pInput),
    '%26','&'),'%3C','<'),'%3E','>'),'%27','''),'%22','"'),
    '%1A','�'),'%25','&pct;');

  // first pass: legitimate character encodings
  set @nextPos = locate(@path, '%');
  while @nextPos <> 0 loop
    set @intCode = hextoint(substring(@path, @nextPos+1, 2));
    set @numPoints = 1;
    set @nextCode = @intCode;
    case
      when @intCode >= 0xF0 then
        set @numPoints = 4;
        set @nextCode = @intCode & 0x07;
      when @intCode >= 0xE0 then
        set @numPoints = 3;
        set @nextCode = @intCode & 0x0F;
      when @intCode >= 0xD0 then
        set @numPoints = 2;
        set @nextCode = @intCode & 0x1F;
      when @intCode >= 0x80 then
        set @numPoints = 0;   // error
    end case;
    case @numPoints
      when 0 then
        set @nextPos = @nextPos + 1; // skip
      when 1 then
        set @path = replace(@path, substring(@path, @nextPos, 3), char(@intCode));
      else
        set @subPos = @nextPos + 3;
        subLoop: while @numPoints > 1 and substring(@path, @subPos, 1) = '%' loop
          set @intCode = hextoint(substring(@path, @subPos+1, 2));
          if @intCode & 0xC0 <> 0x80 then
            leave subLoop;
          end if;
          set @nextCode = (@nextCode * 64) | (@intCode & 0x3F);
          set @numPoints = @numPoints - 1;
          set @subPos = @subPos + 3;
        end loop;
        if @numPoints > 1 then
          set @nextPos = @subPos + 1; // error
        else
          set @path = replace(@path, substring(@path, @nextPos, @subPos-@nextPos),
            '&#' || @nextCode || ';');
        end if;
    end case;
    set @nextPos = locate(@path, '%', @nextPos);
  end loop;

  // second pass: corrupted character encodings
  set @nextPos = locate(@path, '%');
  while @nextPos <> 0 loop
    set @intCode = hextoint(substring(@path, @nextPos+1, 2));
    set @path = replace(@path, substring(@path, @nextPos, 3), '&#' || @intCode || ';');
    set @nextPos = locate(@path, '%', @nextPos);
  end loop;

  return replace(@path,'&pct;','%');
end;

asked 24 May '12, 14:59

Erik%20Anderson's gravatar image

Erik Anderson
42181223
accept rate: 15%


Note that most unicode characters are allowed to be in an XML document without being escaped (with a few exceptions) - see http://en.wikipedia.org/wiki/XML - and therefore I do not understand the purpose of your function?

Could it be that the SOAP server that is consuming the XML that SA is generating is not compliant?

permanent link

answered 24 May '12, 16:55

Mark%20Culp's gravatar image

Mark Culp
22.3k9129262
accept rate: 40%

In the most recent case the SOAP server (Amazon AWS this time around) was complaining ("invalid xml character") about an en-dash (U+2013, latin1 150) being in the document, even with me sticking encoding="utf-8" on the submitting document (although I don't remember whether XML is defined as "long varchar" or "long nvarchar" anymore). Once it was translated to &#8211; [edit:escaped the entity] the request went through and I was able to verify that the character matched. Note that I'm not using the internal SOAP function calls, although once it's in XML variables it shouldn't really make a difference, should it?

I have in the past had a lot of problems with SUB (ascii 26) which appears to be illegal in any form in an XML document (no matter how you escape it) and is produced when you attempt to store a character in a field in a charset that does not support the character (I got this a lot in early tests trying to shove chinese into long nvarchar while the pipe was not fully UNICODE-safe). MSXML for instance will stop dead as soon as it sees this character anywhere in the input. I have been translating this to U+FFFD which seems to have a similar meaning but is passed without comment.

(24 May '12, 17:46) Erik Anderson
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×106
×18
×8

question asked: 24 May '12, 14:59

question was seen: 2,189 times

last updated: 24 May '12, 17:47