========== Clean HTML ========== HTML cleanup with granular switches for scripts, metadata, embedded media, interactive elements, headings, phrasing content, and more. Supports wildcard-based *tag* and *attribute* removal, selective content stripping, and empty-node pruning. Returns **text** or **HTML** depending on the mode. Behavior ======== Three explicit modes with different outputs: +-----------------------------------------------+--------------------------------------------+-------------------------+--------------------------------------------------------------+ | **Mode** | **How to trigger** | **Returns** | **Description** | +===============================================+============================================+=========================+==============================================================+ | **A) text-only** | No parameters provided (all ``None``) | ``str`` (plain text) | Extracts text, skips script-supporting tags, inserts spaces. | +-----------------------------------------------+--------------------------------------------+-------------------------+--------------------------------------------------------------+ | **B) structural clean** | At least one flag is ``True`` | ``str`` (HTML) | Removes/unwraps per flags and serializes sanitized HTML. | +-----------------------------------------------+--------------------------------------------+-------------------------+--------------------------------------------------------------+ | **C) text+preserve** | Parameters present and all are ``False`` | ``str`` (text+markup) | Extracts text but **preserves** groups explicitly set False. | +-----------------------------------------------+--------------------------------------------+-------------------------+--------------------------------------------------------------+ .. note:: When deleting nodes between adjacent text nodes, the cleaner inserts **one space** to avoid word concatenation. In Mode B the serializer uses ``quote_attr_values="always"`` for stable diffs. Parameters ==================== +-------------------------------+--------------------------------------------------------------------------+ | **Parameter** | **Description** | +===============================+==========================================================================+ | ``text`` | (*str*) Raw HTML input. | +-------------------------------+--------------------------------------------------------------------------+ | ``remove_script`` | (*bool | None*) Drop executable tags (``") print(txt) **Output** .. code-block:: text Hello Mode B — structural clean (HTML out) ------------------------------------ **Drop scripts, metadata, embeds; strip attributes; prune empties** .. code-block:: python import textwizard as tw html = """ x

Title

hello

""" out = tw.clean_html( html, remove_script=True, remove_metadata_tags=True, remove_embedded_tags=True, remove_specific_attributes=["id", "on*"], remove_empty_tags=True, remove_comments=True, remove_doctype=True, ) print(out) **Output** .. code-block:: html

Title

hello

**Wildcards and unwrap vs hard remove** .. code-block:: python import textwizard as tw html = """

Hello

""" test = tw.clean_html( html, remove_tags_and_contents=["iframe", "template"], remove_specific_attributes=["id", "data-*", "on*"], remove_empty_tags=True, ) print(test) **Output** .. code-block:: html

Hello

**Content stripping vs tag deletion** .. code-block:: python import textwizard as tw html = """

code stays

""" keep_tags_drop_content = tw.clean_html( html, remove_content_tags=["script","style"], # keep

code stays

**Sectioning, headings, flow** .. code-block:: python import textwizard as tw html = "

T

X

Body

" out = tw.clean_html( html, remove_sectioning_tags=True, # drop

/

/

/

remove_heading_tags=True, # drop

-

) print(out) Output .. code-block:: html Interactive and embedded .. code-block:: python import textwizard as tw html = """ """ out = tw.clean_html( html, remove_interactive_tags=True, # button, input, select remove_embedded_tags=True, # img, iframe, embed, video, audio remove_specific_attributes=["id"], remove_empty_tags=True ) print(out) # "" empty Mode C — text with preservation ------------------------------- Preserve sectioning + headings + comments .. code-block:: python import textwizard as tw html = "
T
Body
" txt = tw.clean_html( html, remove_sectioning_tags=False, remove_heading_tags=False, remove_comments=False, ) print(txt) Output .. code-block:: html
T
Body
Preserve images but as-is text elsewhere .. code-block:: python import textwizard as tw html = '
AB
' txt = tw.clean_html( html, remove_embedded_tags=False, # keep ) print(txt) Output .. code-block:: html AB Returns ======= - Mode A: ``str`` plain text. - Mode B: ``str`` serialized HTML. - Mode C: ``str`` text with selected tags/comments/doctype preserved inline. Operational notes ================= - Prefer targeted flags to preserve semantics. Use broad switches only for aggressive sanitization. - Wildcards: - Attributes: ``"on"`` for event handlers, ``"data-"``, ``"aria-"`` - Tags: exact names, lists, or patterns like ``"ads*"``. - When the DOM becomes empty after removals, returns ``""``. - The serializer may add ``…`` wrappers to ensure a well-formed tree (Mode B). Errors ====== - Invalid input type → ``ValueError``. - Malformed markup is normalized rather than rejected when possible. See also ======== - :doc:`beutifull_html` — Pretty-print and normalize HTML formatting - :doc:`html_to_markdown` — Convert HTML to Markdown - :doc:`intro` — Overview and quick start