Not sure whether this is the right place to say something, but there's been a number of instances where I've had generated XML fail due to illegal characters in the content. The most recent instance was from a SOAP server rejecting the request because the implicit XML encoding that is done with XMLELEMENT() seems to miss a number of required character conversions. I threw together the following function as a workaround, but I'm glad that this kind of operation doesn't tend to be time-critical or else the level of required complexity is a bit overwhelming. Otherwise this quickly falls into the category again of "should I really need to do this?" create or replace function dba.xml_encode(pInput long nvarchar) returns xml begin // assumes nvarchar is defined as utf-8 declare @path long varchar; declare @nextPos integer; declare @intCode smallint; declare @nextCode bigint; declare @numPoints smallint; declare @subPos integer; // note: i am making up the "&pct;" entity to move %s out of the way, patched up in the return at the end set @path = replace(replace(replace(replace(replace(replace(replace(http_encode(pInput), '%26','&'),'%3C','<'),'%3E','>'),'%27','''),'%22','"'), '%1A','�'),'%25','&pct;'); // first pass: legitimate character encodings set @nextPos = locate(@path, '%'); while @nextPos <> 0 loop set @intCode = hextoint(substring(@path, @nextPos+1, 2)); set @numPoints = 1; set @nextCode = @intCode; case when @intCode >= 0xF0 then set @numPoints = 4; set @nextCode = @intCode & 0x07; when @intCode >= 0xE0 then set @numPoints = 3; set @nextCode = @intCode & 0x0F; when @intCode >= 0xD0 then set @numPoints = 2; set @nextCode = @intCode & 0x1F; when @intCode >= 0x80 then set @numPoints = 0; // error end case; case @numPoints when 0 then set @nextPos = @nextPos + 1; // skip when 1 then set @path = replace(@path, substring(@path, @nextPos, 3), char(@intCode)); else set @subPos = @nextPos + 3; subLoop: while @numPoints > 1 and substring(@path, @subPos, 1) = '%' loop set @intCode = hextoint(substring(@path, @subPos+1, 2)); if @intCode & 0xC0 <> 0x80 then leave subLoop; end if; set @nextCode = (@nextCode * 64) | (@intCode & 0x3F); set @numPoints = @numPoints - 1; set @subPos = @subPos + 3; end loop; if @numPoints > 1 then set @nextPos = @subPos + 1; // error else set @path = replace(@path, substring(@path, @nextPos, @subPos-@nextPos), '&#' || @nextCode || ';'); end if; end case; set @nextPos = locate(@path, '%', @nextPos); end loop; // second pass: corrupted character encodings set @nextPos = locate(@path, '%'); while @nextPos <> 0 loop set @intCode = hextoint(substring(@path, @nextPos+1, 2)); set @path = replace(@path, substring(@path, @nextPos, 3), '&#' || @intCode || ';'); set @nextPos = locate(@path, '%', @nextPos); end loop; return replace(@path,'&pct;','%'); end; |
Note that most unicode characters are allowed to be in an XML document without being escaped (with a few exceptions) - see http://en.wikipedia.org/wiki/XML - and therefore I do not understand the purpose of your function? Could it be that the SOAP server that is consuming the XML that SA is generating is not compliant? In the most recent case the SOAP server (Amazon AWS this time around) was complaining ("invalid xml character") about an en-dash (U+2013, latin1 150) being in the document, even with me sticking encoding="utf-8" on the submitting document (although I don't remember whether XML is defined as "long varchar" or "long nvarchar" anymore). Once it was translated to – [edit:escaped the entity] the request went through and I was able to verify that the character matched. Note that I'm not using the internal SOAP function calls, although once it's in XML variables it shouldn't really make a difference, should it? I have in the past had a lot of problems with SUB (ascii 26) which appears to be illegal in any form in an XML document (no matter how you escape it) and is produced when you attempt to store a character in a field in a charset that does not support the character (I got this a lot in early tests trying to shove chinese into long nvarchar while the pipe was not fully UNICODE-safe). MSXML for instance will stop dead as soon as it sees this character anywhere in the input. I have been translating this to U+FFFD which seems to have a similar meaning but is passed without comment.
(24 May '12, 17:46)
Erik Anderson
|