Remove all HTML tags in a string

Is there any good way to remove all HTML tags in a string with sql (preferably without function) that looks like this.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><title>
        </title>
        <style type="text/css">
            .csC583D0C8{text-align:left;text-indent:0pt;margin:12pt 0pt 12pt 0pt}
            .csCABEF5ED{color:#282832;background-color:transparent;font-family:Arial;font-size:10pt;font-weight:bold;font-style:normal;}
        </style>
    </head>
    <body>
        <p class="csC583D0C8"><span class="csCABEF5ED">Test row 1<br/>Test row2</span></p></body>
</html>

Results should be like this

Test row 1
Test row 2

regexp openxml

asked 16 Oct '20, 07:30

Rolle
558●49●51●61
accept rate: 0%

edited 16 Oct '20, 09:03

Volker Barth
40.2k●361●550●822

flat view

2 Answers:

active answers oldest answers newest answers popular answers

AFAIK, as XHTML documents are using XML, you might be better off by trying the builtin OpenXML operator to access the desired document portions. - Note, while I have often used OpenXML for XML documents, I have not tried myself with XHTML.

Here's a starting sample:

create or replace variable varDoc xml;
set varDoc = 
'
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><title>
        </title>
        <style type="text/css">
            .csC583D0C8{text-align:left;text-indent:0pt;margin:12pt 0pt 12pt 0pt}
            .csCABEF5ED{color:#282832;background-color:transparent;font-family:Arial;font-size:10pt;font-weight:bold;font-style:normal;}
        </style>
    </head>
    <body>
        Test row 1
Test row2</body>
</html>';
select *
-- list all nodes
-- from OpenXML(varDoc, '//*')
-- filtered on the span node within the body's paragraphs
from OpenXML(varDoc, '//*:body/*:p/*:span')
   with (ID int '@mp:id',
         Node long varchar 'node()',
         Name long varchar '@mp:localname',
         Text long varchar 'text()',
         XmlText long varchar '@mp:xmltext',
         Class long varchar '@class')
order by ID;

This return the following row:

ID;Node;Name;Text;XmlText;Class
606;'Test row 1';'span';'Test row 1';'Test row 1
Test row2';'csCABEF5ED'

Apparently, in contrast to normal XML documents where your data would be easily accessable via well-named attributes or elements, here the XML contents of the "span element" would need to be extracted lateron, say via regex_substring and replace, as Justin has suggested. But I hope you get the idea.

FWIW, to list the whole document, simply use the first OpenXML call (i.e. un-comment it).

permanent link

answered 16 Oct '20, 08:11

Volker Barth
40.2k●361●550●822
accept rate: 34%

edited 16 Oct '20, 08:39

Here's a way to extract the text via further regexp_substr and sa_split_list - as so often, I'm impressed by the builtin tools...

create or replace variable varDoc xml;
set varDoc = 
'
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><title>
        </title>
        <style type="text/css">
            .csC583D0C8{text-align:left;text-indent:0pt;margin:12pt 0pt 12pt 0pt}
            .csCABEF5ED{color:#282832;background-color:transparent;font-family:Arial;font-size:10pt;font-weight:bold;font-style:normal;}
        </style>
    </head>
    <body>
        Test row 1
Test row2</body>
</html>';
select XmlText,
   -- search for text within the "span" tag
   regexp_substr(XmlText, '(?<=[<]span class="' || Class || '"[>]).*(?=[<]/span[>])') as TextExtract,
   line_num, row_value
-- filtered on the span node within the body's paragraphs
from OpenXML(varDoc, '//*:body/*:p/*:span')
   with (ID int '@mp:id',
      Node long varchar 'node()',
      Name long varchar '@mp:localname',
      Text long varchar 'text()',
      XmlText long varchar '@mp:xmltext',
      Class long varchar '@class')
   -- split TextExtract based on br tag
   outer apply sa_split_list(TextExtract, '
')
order by ID, line_num;

returns

XmlText;TextExtract;line_num;row_value
'Test row 1
Test row2';'Test row 1
Test row2';1;'Test row 1'
'Test row 1
Test row2';'Test row 1
Test row2';2;'Test row2'

(16 Oct '20, 09:00) Volker Barth

flat view

If you can be certain about how your html is formatted, then you can probably do something with REGEXP_SUBSTR() and a basic expression like <[^>]*>. However, even in your example you will first have to process the line breaks - and find a way of removing the CSS info that is not inside a tag.

Actually parsing html with regular expressions is notoriously difficult, and is extensively discussed on StackOverflow and elsewhere. Any approach based on regular expressions is very much a "fudge" and likely to be broken by any unexpected changes to the html document - the right way is to use a proper library.

If you don't know it - regex101.com is a useful resource.

permanent link

answered 16 Oct '20, 07:59

Justin Willey
7.6k●137●179●249
accept rate: 20%

edited 16 Oct '20, 08:03

No mention of regular expressions is complete without input from xkcd :)...

(27 Oct '20, 12:08) Breck Carter

Replies hidden

Comment Text Removed

the right way is to use a proper library

Do browsers use "a proper library"? I suspect not, because browsers successfully handle ALL SORTS of improper HTML...

...but that may not matter in this case; if all you have is proper input then you must find some way to deal with it :)

(27 Oct '20, 12:16) Breck Carter

Replies hidden

The great mystery I find with regular expressions, is having finally got one to work properly and think I really understand what is going on, I can come back a day later and not have the slightest idea how it works!

(27 Oct '20, 12:23) Justin Willey

Well, I'd always thought this would be the one and only reference :)

Wait, forgot to escape a space. Wheeeeee[taptaptap]eeeeee!

(27 Oct '20, 13:09) Volker Barth

Gosh, you'd consider this uncomprehensible?

'(?<=[<]span class="' || Class || '"[>]).*(?=[<]/span[>])'

Unfortunately, it was written almost two weeks ago, if I could only ask the author. :)

(27 Oct '20, 13:12) Volker Barth

flat view

Your answer

toggle preview

community wiki:

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

*italic* or _italic_
**bold** or __bold__
link:[text](http://url.com/ "title")
image?![alt text](/path/img.jpg "title")
numbered list: 1. Foo 2. Bar
to add a line break simply add two spaces to where you would like the new line to be.
basic HTML tags are also supported

learn more about Markdown

Question tags:

openxml ×22
regexp ×3

question asked: 16 Oct '20, 07:30

question was seen: 3,427 times

last updated: 27 Oct '20, 13:12

SAP SQL Anywhere

Remove all HTML tags in a string

Follow this question

Related questions