Clean HTML¶
HTML cleanup with granular switches for scripts, metadata, embedded media, interactive elements, headings, phrasing content, and more. Supports wildcard-based tag and attribute removal, selective content stripping, and empty-node pruning. Returns text or HTML depending on the mode.
Behavior¶
Three explicit modes with different outputs:
Mode |
How to trigger |
Returns |
Description |
|---|---|---|---|
A) text-only |
No parameters provided (all |
|
Extracts text, skips script-supporting tags, inserts spaces. |
B) structural clean |
At least one flag is |
|
Removes/unwraps per flags and serializes sanitized HTML. |
C) text+preserve |
Parameters present and all are |
|
Extracts text but preserves groups explicitly set False. |
Note
When deleting nodes between adjacent text nodes, the cleaner inserts one space to avoid word concatenation.
In Mode B the serializer uses quote_attr_values="always" for stable diffs.
Parameters¶
Parameter |
Description |
|---|---|
|
(str) Raw HTML input. |
|
(bool | None) Drop executable tags ( |
|
(bool | None) Drop metadata ( |
|
(bool | None) Drop flow content (layout + phrasing, e.g. |
|
(bool | None) Drop sectioning ( |
|
(bool | None) Drop headings |
|
(bool | None) Drop phrasing (inline) elements, e.g. |
|
(bool | None) Drop embedded content ( |
|
(bool | None) Drop interactive elements ( |
|
(bool | None) Drop palpable elements (broad set incl. |
|
(bool | None) Remove |
|
(bool | None) Remove |
|
(str | list | None) Remove attributes by name or wildcard
(e.g. |
|
(str | list | None) Unwrap tags by name or wildcard (children are lifted into parent). |
|
(bool | None) Prune empty nodes after edits. |
|
(str | list | None) Keep tag but drop inner content. |
|
(str | list | None) Remove tag and its entire content. |
Parameter semantics¶
None → flag unset. If all are None ⇒ Mode A.
True → request removal/operation ⇒ Mode B.
False → request preservation ⇒ Mode C (text output that preserves those groups;
remove_comments=Falseandremove_doctype=Falsealso preserve them).
Tag groups reference¶
Flag |
Tags affected |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Examples¶
Mode A — text only¶
import textwizard as tw
txt = tw.clean_html("<div><p>Hello</p><script>x()</script></div>")
print(txt)
Output
Hello
Mode B — structural clean (HTML out)¶
Drop scripts, metadata, embeds; strip attributes; prune empties
import textwizard as tw
html = """
<html><head>
<title>x</title><meta charset="utf-8">
<link rel="preload" href="x.css"><script>evil()</script>
</head>
<body>
<article><h1>Title</h1><img src="a.png"><p id="k" onclick="x()">hello</p></article>
<!-- comment -->
</body></html>
"""
out = tw.clean_html(
html,
remove_script=True,
remove_metadata_tags=True,
remove_embedded_tags=True,
remove_specific_attributes=["id", "on*"],
remove_empty_tags=True,
remove_comments=True,
remove_doctype=True,
)
print(out)
Output
<html> <body> <article><h1>Title</h1><p>hello</p></article> </body></html>
Wildcards and unwrap vs hard remove
import textwizard as tw
html = """
<div id="hero" data-track="x">
<svg viewBox="0 0 10 10"><circle r="5"/></svg>
<p class="k" onclick="hack()">Hello</p>
<iframe src="a.html"></iframe>
</div>
"""
test = tw.clean_html(
html,
remove_tags_and_contents=["iframe", "template"],
remove_specific_attributes=["id", "data-*", "on*"],
remove_empty_tags=True,
)
print(test)
Output
<html><body><div> <p class="k">Hello</p> </div> </body></html>
Content stripping vs tag deletion
import textwizard as tw
html = """
<article>
<script>track()</script>
<style>p{}</style>
<pre>code stays</pre>
<noscript>fallback</noscript>
</article>
"""
keep_tags_drop_content = tw.clean_html(
html,
remove_content_tags=["script","style"], # keep <script>/<style> but empty them
)
print(keep_tags_drop_content)
Output
<html><head></head><body><article> <script></script> <style></style> <pre>code stays</pre> <noscript>fallback</noscript> </article> </body></html>
Sectioning, headings, flow
import textwizard as tw
html = "<section><h1>T</h1><div><address>X</address><p>Body</p></div></section>"
out = tw.clean_html(
html,
remove_sectioning_tags=True, # drop <section>/<article>/<aside>/<nav>
remove_heading_tags=True, # drop <h1>-<h6>
)
print(out)
Output
<html><head></head><body></body></html>
Interactive and embedded
import textwizard as tw
html = """
<button id="b" disabled>Click</button>
<img src="logo.png" alt="Logo">
<video src="v.mp4"></video>
"""
out = tw.clean_html(
html,
remove_interactive_tags=True, # button, input, select
remove_embedded_tags=True, # img, iframe, embed, video, audio
remove_specific_attributes=["id"],
remove_empty_tags=True
)
print(out) # "" empty
Mode C — text with preservation¶
Preserve sectioning + headings + comments
import textwizard as tw
html = "<article><h1>T</h1><p>Body</p><!-- c --></article>"
txt = tw.clean_html(
html,
remove_sectioning_tags=False,
remove_heading_tags=False,
remove_comments=False,
)
print(txt)
Output
<article><h1>T</h1>Body<!-- c --></article>
Preserve images but as-is text elsewhere
import textwizard as tw
html = '<p>A<img src="a.png" alt="A">B</p>'
txt = tw.clean_html(
html,
remove_embedded_tags=False, # keep <img>
)
print(txt)
Output
A<img src="a.png" alt="A">B
Returns¶
Mode A:
strplain text.Mode B:
strserialized HTML.Mode C:
strtext with selected tags/comments/doctype preserved inline.
Operational notes¶
Prefer targeted flags to preserve semantics. Use broad switches only for aggressive sanitization.
Wildcards: - Attributes:
"on*"for event handlers,"data-*","aria-*"- Tags: exact names, lists, or patterns like"*ads*".When the DOM becomes empty after removals, returns
"".The serializer may add
<html><body>…wrappers to ensure a well-formed tree (Mode B).
Errors¶
Invalid input type →
ValueError.Malformed markup is normalized rather than rejected when possible.
See also¶
beutifull_html — Pretty-print and normalize HTML formatting
HTML → Markdown — Convert HTML to Markdown
TextWizard — Overview and quick start