Is there any good way to remove all HTML tags in a string with sql (preferably without function) that looks like this. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><title> </title> <style type="text/css"> .csC583D0C8{text-align:left;text-indent:0pt;margin:12pt 0pt 12pt 0pt} .csCABEF5ED{color:#282832;background-color:transparent;font-family:Arial;font-size:10pt;font-weight:bold;font-style:normal;} </style> </head> <body> <p class="csC583D0C8"><span class="csCABEF5ED">Test row 1<br/>Test row2</span></p></body> </html> Results should be like this Test row 1 Test row 2 |
AFAIK, as XHTML documents are using XML, you might be better off by trying the builtin OpenXML operator to access the desired document portions. - Note, while I have often used OpenXML for XML documents, I have not tried myself with XHTML. Here's a starting sample: create or replace variable varDoc xml; set varDoc = ' <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><title> </title> <style type="text/css"> .csC583D0C8{text-align:left;text-indent:0pt;margin:12pt 0pt 12pt 0pt} .csCABEF5ED{color:#282832;background-color:transparent;font-family:Arial;font-size:10pt;font-weight:bold;font-style:normal;} </style> </head> <body> This return the following row: ID;Node;Name;Text;XmlText;Class
606;'Test row 1';'span';'Test row 1';'Test row 1
Apparently, in contrast to normal XML documents where your data would be easily accessable via well-named attributes or elements, here the XML contents of the "span element" would need to be extracted lateron, say via regex_substring and replace, as Justin has suggested. But I hope you get the idea. FWIW, to list the whole document, simply use the first OpenXML call (i.e. un-comment it). Here's a way to extract the text via further regexp_substr and sa_split_list - as so often, I'm impressed by the builtin tools... create or replace variable varDoc xml; set varDoc = ' <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><title> </title> <style type="text/css"> .csC583D0C8{text-align:left;text-indent:0pt;margin:12pt 0pt 12pt 0pt} .csCABEF5ED{color:#282832;background-color:transparent;font-family:Arial;font-size:10pt;font-weight:bold;font-style:normal;} </style> </head> <body> returns XmlText;TextExtract;line_num;row_value 'Test row 1
(16 Oct '20, 09:00)
Volker Barth
|
If you can be certain about how your html is formatted, then you can probably do something with REGEXP_SUBSTR() and a basic expression like <[^>]*>. However, even in your example you will first have to process the line breaks - and find a way of removing the CSS info that is not inside a tag. Actually parsing html with regular expressions is notoriously difficult, and is extensively discussed on StackOverflow and elsewhere. Any approach based on regular expressions is very much a "fudge" and likely to be broken by any unexpected changes to the html document - the right way is to use a proper library. If you don't know it - regex101.com is a useful resource. 1
No mention of regular expressions is complete without input from xkcd :)...
(27 Oct '20, 12:08)
Breck Carter
Replies hidden
Comment Text Removed
Do browsers use "a proper library"? I suspect not, because browsers successfully handle ALL SORTS of improper HTML... ...but that may not matter in this case; if all you have is proper input then you must find some way to deal with it :)
(27 Oct '20, 12:16)
Breck Carter
Replies hidden
The great mystery I find with regular expressions, is having finally got one to work properly and think I really understand what is going on, I can come back a day later and not have the slightest idea how it works!
(27 Oct '20, 12:23)
Justin Willey
Well, I'd always thought this would be the one and only reference :)
(27 Oct '20, 13:09)
Volker Barth
Gosh, you'd consider this uncomprehensible?
Unfortunately, it was written almost two weeks ago, if I could only ask the author. :)
(27 Oct '20, 13:12)
Volker Barth
|