Is there any good way to remove all HTML tags in a string with sql (preferably without function) that looks like this.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><title>
        </title>
        <style type="text/css">
            .csC583D0C8{text-align:left;text-indent:0pt;margin:12pt 0pt 12pt 0pt}
            .csCABEF5ED{color:#282832;background-color:transparent;font-family:Arial;font-size:10pt;font-weight:bold;font-style:normal;}
        </style>
    </head>
    <body>
        <p class="csC583D0C8"><span class="csCABEF5ED">Test row 1<br/>Test row2</span></p></body>
</html>

Results should be like this

Test row 1
Test row 2

asked 16 Oct '20, 07:30

Rolle's gravatar image

Rolle
558495161
accept rate: 0%

edited 16 Oct '20, 09:03

Volker%20Barth's gravatar image

Volker Barth
39.7k357545814


AFAIK, as XHTML documents are using XML, you might be better off by trying the builtin OpenXML operator to access the desired document portions. - Note, while I have often used OpenXML for XML documents, I have not tried myself with XHTML.

Here's a starting sample:

create or replace variable varDoc xml;
set varDoc = 
'
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><title>
        </title>
        <style type="text/css">
            .csC583D0C8{text-align:left;text-indent:0pt;margin:12pt 0pt 12pt 0pt}
            .csCABEF5ED{color:#282832;background-color:transparent;font-family:Arial;font-size:10pt;font-weight:bold;font-style:normal;}
        </style>
    </head>
    <body>
        

Test row 1
Test row2

</body> </html>'; select * -- list all nodes -- from OpenXML(varDoc, '//*') -- filtered on the span node within the body's paragraphs from OpenXML(varDoc, '//*:body/*:p/*:span') with (ID int '@mp:id', Node long varchar 'node()', Name long varchar '@mp:localname', Text long varchar 'text()', XmlText long varchar '@mp:xmltext', Class long varchar '@class') order by ID;

This return the following row:

ID;Node;Name;Text;XmlText;Class
606;'Test row 1';'span';'Test row 1';'Test row 1
Test row2
';'csCABEF5ED'

Apparently, in contrast to normal XML documents where your data would be easily accessable via well-named attributes or elements, here the XML contents of the "span element" would need to be extracted lateron, say via regex_substring and replace, as Justin has suggested. But I hope you get the idea.

FWIW, to list the whole document, simply use the first OpenXML call (i.e. un-comment it).

permanent link

answered 16 Oct '20, 08:11

Volker%20Barth's gravatar image

Volker Barth
39.7k357545814
accept rate: 34%

edited 16 Oct '20, 08:39

Here's a way to extract the text via further regexp_substr and sa_split_list - as so often, I'm impressed by the builtin tools...

create or replace variable varDoc xml;
set varDoc = 
'
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><title>
        </title>
        <style type="text/css">
            .csC583D0C8{text-align:left;text-indent:0pt;margin:12pt 0pt 12pt 0pt}
            .csCABEF5ED{color:#282832;background-color:transparent;font-family:Arial;font-size:10pt;font-weight:bold;font-style:normal;}
        </style>
    </head>
    <body>
        

Test row 1
Test row2

</body> </html>'; select XmlText, -- search for text within the "span" tag regexp_substr(XmlText, '(?<=[<]span class="' || Class || '"[>]).*(?=[<]/span[>])') as TextExtract, line_num, row_value -- filtered on the span node within the body's paragraphs from OpenXML(varDoc, '//*:body/*:p/*:span') with (ID int '@mp:id', Node long varchar 'node()', Name long varchar '@mp:localname', Text long varchar 'text()', XmlText long varchar '@mp:xmltext', Class long varchar '@class') -- split TextExtract based on br tag outer apply sa_split_list(TextExtract, '
') order by ID, line_num;

returns

XmlText;TextExtract;line_num;row_value
'Test row 1
Test row2
';'Test row 1
Test row2';1;'Test row 1' 'Test row 1
Test row2
';'Test row 1
Test row2';2;'Test row2'
(16 Oct '20, 09:00) Volker Barth

If you can be certain about how your html is formatted, then you can probably do something with REGEXP_SUBSTR() and a basic expression like <[^>]*>. However, even in your example you will first have to process the line breaks - and find a way of removing the CSS info that is not inside a tag.

Actually parsing html with regular expressions is notoriously difficult, and is extensively discussed on StackOverflow and elsewhere. Any approach based on regular expressions is very much a "fudge" and likely to be broken by any unexpected changes to the html document - the right way is to use a proper library.

If you don't know it - regex101.com is a useful resource.

permanent link

answered 16 Oct '20, 07:59

Justin%20Willey's gravatar image

Justin Willey
7.6k136179248
accept rate: 19%

edited 16 Oct '20, 08:03

1

No mention of regular expressions is complete without input from xkcd :)...

(27 Oct '20, 12:08) Breck Carter
Replies hidden
Comment Text Removed

the right way is to use a proper library

Do browsers use "a proper library"? I suspect not, because browsers successfully handle ALL SORTS of improper HTML...

...but that may not matter in this case; if all you have is proper input then you must find some way to deal with it :)

(27 Oct '20, 12:16) Breck Carter
Replies hidden

The great mystery I find with regular expressions, is having finally got one to work properly and think I really understand what is going on, I can come back a day later and not have the slightest idea how it works!

(27 Oct '20, 12:23) Justin Willey

Well, I'd always thought this would be the one and only reference :)

Wait, forgot to escape a space. Wheeeeee[taptaptap]eeeeee!

(27 Oct '20, 13:09) Volker Barth

Gosh, you'd consider this uncomprehensible?

'(?<=[<]span class="' || Class || '"[>]).*(?=[<]/span[>])'

Unfortunately, it was written almost two weeks ago, if I could only ask the author. :)

(27 Oct '20, 13:12) Volker Barth
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×21
×3

question asked: 16 Oct '20, 07:30

question was seen: 3,106 times

last updated: 27 Oct '20, 13:12