Is there any good way to remove all HTML tags in a string with sql (preferably without function) that looks like this.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><title>
        </title>
        <style type="text/css">
            .csC583D0C8{text-align:left;text-indent:0pt;margin:12pt 0pt 12pt 0pt}
            .csCABEF5ED{color:#282832;background-color:transparent;font-family:Arial;font-size:10pt;font-weight:bold;font-style:normal;}
        </style>
    </head>
    <body>
        <p class="csC583D0C8"><span class="csCABEF5ED">Test row 1<br/>Test row2</span></p></body>
</html>

Results should be like this

Test row 1
Test row 2

asked 16 Oct, 07:30

Rolle's gravatar image

Rolle
558434960
accept rate: 0%

edited 16 Oct, 09:03

Volker%20Barth's gravatar image

Volker Barth
36.9k343505765


AFAIK, as XHTML documents are using XML, you might be better off by trying the builtin OpenXML operator to access the desired document portions. - Note, while I have often used OpenXML for XML documents, I have not tried myself with XHTML.

Here's a starting sample:

create or replace variable varDoc xml;
set varDoc = 
'
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><title>
        </title>
        <style type="text/css">
            .csC583D0C8{text-align:left;text-indent:0pt;margin:12pt 0pt 12pt 0pt}
            .csCABEF5ED{color:#282832;background-color:transparent;font-family:Arial;font-size:10pt;font-weight:bold;font-style:normal;}
        </style>
    </head>
    <body>
        

Test row 1
Test row2

</body> </html>'; select * -- list all nodes -- from OpenXML(varDoc, '//*') -- filtered on the span node within the body's paragraphs from OpenXML(varDoc, '//*:body/*:p/*:span') with (ID int '@mp:id', Node long varchar 'node()', Name long varchar '@mp:localname', Text long varchar 'text()', XmlText long varchar '@mp:xmltext', Class long varchar '@class') order by ID;

This return the following row:

ID;Node;Name;Text;XmlText;Class
606;'Test row 1';'span';'Test row 1';'Test row 1
Test row2
';'csCABEF5ED'

Apparently, in contrast to normal XML documents where your data would be easily accessable via well-named attributes or elements, here the XML contents of the "span element" would need to be extracted lateron, say via regex_substring and replace, as Justin has suggested. But I hope you get the idea.

FWIW, to list the whole document, simply use the first OpenXML call (i.e. un-comment it).

permanent link

answered 16 Oct, 08:11

Volker%20Barth's gravatar image

Volker Barth
36.9k343505765
accept rate: 34%

edited 16 Oct, 08:39

Here's a way to extract the text via further regexp_substr and sa_split_list - as so often, I'm impressed by the builtin tools...

create or replace variable varDoc xml;
set varDoc = 
'
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><title>
        </title>
        <style type="text/css">
            .csC583D0C8{text-align:left;text-indent:0pt;margin:12pt 0pt 12pt 0pt}
            .csCABEF5ED{color:#282832;background-color:transparent;font-family:Arial;font-size:10pt;font-weight:bold;font-style:normal;}
        </style>
    </head>
    <body>
        

Test row 1
Test row2

</body> </html>'; select XmlText, -- search for text within the "span" tag regexp_substr(XmlText, '(?<=[<]span class="' || Class || '"[>]).*(?=[<]/span[>])') as TextExtract, line_num, row_value -- filtered on the span node within the body's paragraphs from OpenXML(varDoc, '//*:body/*:p/*:span') with (ID int '@mp:id', Node long varchar 'node()', Name long varchar '@mp:localname', Text long varchar 'text()', XmlText long varchar '@mp:xmltext', Class long varchar '@class') -- split TextExtract based on br tag outer apply sa_split_list(TextExtract, '
') order by ID, line_num;

returns

XmlText;TextExtract;line_num;row_value
'Test row 1
Test row2
';'Test row 1
Test row2';1;'Test row 1' 'Test row 1
Test row2
';'Test row 1
Test row2';2;'Test row2'
(16 Oct, 09:00) Volker Barth

If you can be certain about how your html is formatted, then you can probably do something with REGEXP_SUBSTR() and a basic expression like <[^>]*>. However, even in your example you will first have to process the line breaks - and find a way of removing the CSS info that is not inside a tag.

Actually parsing html with regular expressions is notoriously difficult, and is extensively discussed on StackOverflow and elsewhere. Any approach based on regular expressions is very much a "fudge" and likely to be broken by any unexpected changes to the html document - the right way is to use a proper library.

If you don't know it - regex101.com is a useful resource.

permanent link

answered 16 Oct, 07:59

Justin%20Willey's gravatar image

Justin Willey
7.3k128165240
accept rate: 20%

edited 16 Oct, 08:03

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×18
×3

question asked: 16 Oct, 07:30

question was seen: 75 times

last updated: 16 Oct, 09:03