Please be aware that the content in SAP SQL Anywhere Forum will be migrated to the SAP Community in June and this forum will be retired.

Is there any good way to remove all HTML tags in a string with sql (preferably without function) that looks like this.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><title>
        </title>
        <style type="text/css">
            .csC583D0C8{text-align:left;text-indent:0pt;margin:12pt 0pt 12pt 0pt}
            .csCABEF5ED{color:#282832;background-color:transparent;font-family:Arial;font-size:10pt;font-weight:bold;font-style:normal;}
        </style>
    </head>
    <body>
        <p class="csC583D0C8"><span class="csCABEF5ED">Test row 1<br/>Test row2</span></p></body>
</html>

Results should be like this

Test row 1
Test row 2

asked 16 Oct '20, 07:30

Rolle's gravatar image

Rolle
558495161
accept rate: 0%

edited 16 Oct '20, 09:03

Volker%20Barth's gravatar image

Volker Barth
40.2k361550822


AFAIK, as XHTML documents are using XML, you might be better off by trying the builtin OpenXML operator to access the desired document portions. - Note, while I have often used OpenXML for XML documents, I have not tried myself with XHTML.

Here's a starting sample:

create or replace variable varDoc xml;
set varDoc = 
'
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><title>
        </title>
        <style type="text/css">
            .csC583D0C8{text-align:left;text-indent:0pt;margin:12pt 0pt 12pt 0pt}
            .csCABEF5ED{color:#282832;background-color:transparent;font-family:Arial;font-size:10pt;font-weight:bold;font-style:normal;}
        </style>
    </head>
    <body>
        

Test row 1
Test row2

</body> </html>'; select * -- list all nodes -- from OpenXML(varDoc, '//*') -- filtered on the span node within the body's paragraphs from OpenXML(varDoc, '//*:body/*:p/*:span') with (ID int '@mp:id', Node long varchar 'node()', Name long varchar '@mp:localname', Text long varchar 'text()', XmlText long varchar '@mp:xmltext', Class long varchar '@class') order by ID;

This return the following row:

ID;Node;Name;Text;XmlText;Class
606;'Test row 1';'span';'Test row 1';'Test row 1
Test row2
';'csCABEF5ED'

Apparently, in contrast to normal XML documents where your data would be easily accessable via well-named attributes or elements, here the XML contents of the "span element" would need to be extracted lateron, say via regex_substring and replace, as Justin has suggested. But I hope you get the idea.

FWIW, to list the whole document, simply use the first OpenXML call (i.e. un-comment it).

permanent link

answered 16 Oct '20, 08:11

Volker%20Barth's gravatar image

Volker Barth
40.2k361550822
accept rate: 34%

edited 16 Oct '20, 08:39

Here's a way to extract the text via further regexp_substr and sa_split_list - as so often, I'm impressed by the builtin tools...

create or replace variable varDoc xml;
set varDoc = 
'
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><title>
        </title>
        <style type="text/css">
            .csC583D0C8{text-align:left;text-indent:0pt;margin:12pt 0pt 12pt 0pt}
            .csCABEF5ED{color:#282832;background-color:transparent;font-family:Arial;font-size:10pt;font-weight:bold;font-style:normal;}
        </style>
    </head>
    <body>
        

Test row 1
Test row2

</body> </html>'; select XmlText, -- search for text within the "span" tag regexp_substr(XmlText, '(?<=[<]span class="' || Class || '"[>]).*(?=[<]/span[>])') as TextExtract, line_num, row_value -- filtered on the span node within the body's paragraphs from OpenXML(varDoc, '//*:body/*:p/*:span') with (ID int '@mp:id', Node long varchar 'node()', Name long varchar '@mp:localname', Text long varchar 'text()', XmlText long varchar '@mp:xmltext', Class long varchar '@class') -- split TextExtract based on br tag outer apply sa_split_list(TextExtract, '
') order by ID, line_num;

returns

XmlText;TextExtract;line_num;row_value
'Test row 1
Test row2
';'Test row 1
Test row2';1;'Test row 1' 'Test row 1
Test row2
';'Test row 1
Test row2';2;'Test row2'
(16 Oct '20, 09:00) Volker Barth

If you can be certain about how your html is formatted, then you can probably do something with REGEXP_SUBSTR() and a basic expression like <[^>]*>. However, even in your example you will first have to process the line breaks - and find a way of removing the CSS info that is not inside a tag.

Actually parsing html with regular expressions is notoriously difficult, and is extensively discussed on StackOverflow and elsewhere. Any approach based on regular expressions is very much a "fudge" and likely to be broken by any unexpected changes to the html document - the right way is to use a proper library.

If you don't know it - regex101.com is a useful resource.

permanent link

answered 16 Oct '20, 07:59

Justin%20Willey's gravatar image

Justin Willey
7.6k137179249
accept rate: 20%

edited 16 Oct '20, 08:03

1

No mention of regular expressions is complete without input from xkcd :)...

(27 Oct '20, 12:08) Breck Carter
Replies hidden
Comment Text Removed

the right way is to use a proper library

Do browsers use "a proper library"? I suspect not, because browsers successfully handle ALL SORTS of improper HTML...

...but that may not matter in this case; if all you have is proper input then you must find some way to deal with it :)

(27 Oct '20, 12:16) Breck Carter
Replies hidden

The great mystery I find with regular expressions, is having finally got one to work properly and think I really understand what is going on, I can come back a day later and not have the slightest idea how it works!

(27 Oct '20, 12:23) Justin Willey

Well, I'd always thought this would be the one and only reference :)

Wait, forgot to escape a space. Wheeeeee[taptaptap]eeeeee!

(27 Oct '20, 13:09) Volker Barth

Gosh, you'd consider this uncomprehensible?

'(?<=[<]span class="' || Class || '"[>]).*(?=[<]/span[>])'

Unfortunately, it was written almost two weeks ago, if I could only ask the author. :)

(27 Oct '20, 13:12) Volker Barth
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×22
×3

question asked: 16 Oct '20, 07:30

question was seen: 3,427 times

last updated: 27 Oct '20, 13:12