========== Clean HTML ========== HTML cleanup with granular switches for scripts, metadata, embedded media, interactive elements, headings, phrasing content, and more. Supports wildcard-based *tag* and *attribute* removal, selective content stripping, and empty-node pruning. Returns **text** or **HTML** depending on the mode. Behavior ======== Three explicit modes with different outputs: +-----------------------------------------------+--------------------------------------------+-------------------------+--------------------------------------------------------------+ | **Mode** | **How to trigger** | **Returns** | **Description** | +===============================================+============================================+=========================+==============================================================+ | **A) text-only** | No parameters provided (all ``None``) | ``str`` (plain text) | Extracts text, skips script-supporting tags, inserts spaces. | +-----------------------------------------------+--------------------------------------------+-------------------------+--------------------------------------------------------------+ | **B) structural clean** | At least one flag is ``True`` | ``str`` (HTML) | Removes/unwraps per flags and serializes sanitized HTML. | +-----------------------------------------------+--------------------------------------------+-------------------------+--------------------------------------------------------------+ | **C) text+preserve** | Parameters present and all are ``False`` | ``str`` (text+markup) | Extracts text but **preserves** groups explicitly set False. | +-----------------------------------------------+--------------------------------------------+-------------------------+--------------------------------------------------------------+ .. note:: When deleting nodes between adjacent text nodes, the cleaner inserts **one space** to avoid word concatenation. In Mode B the serializer uses ``quote_attr_values="always"`` for stable diffs. Parameters ==================== +-------------------------------+--------------------------------------------------------------------------+ | **Parameter** | **Description** | +===============================+==========================================================================+ | ``text`` | (*str*) Raw HTML input. | +-------------------------------+--------------------------------------------------------------------------+ | ``remove_script`` | (*bool | None*) Drop executable tags (``") print(txt) **Output** .. code-block:: text Hello Mode B — structural clean (HTML out) ------------------------------------ **Drop scripts, metadata, embeds; strip attributes; prune empties** .. code-block:: python import textwizard as tw html = """ x

Title

hello

""" out = tw.clean_html( html, remove_script=True, remove_metadata_tags=True, remove_embedded_tags=True, remove_specific_attributes=["id", "on*"], remove_empty_tags=True, remove_comments=True, remove_doctype=True, ) print(out) **Output** .. code-block:: html

Title

hello

**Wildcards and unwrap vs hard remove** .. code-block:: python import textwizard as tw html = """

Hello

""" test = tw.clean_html( html, remove_tags_and_contents=["iframe", "template"], remove_specific_attributes=["id", "data-*", "on*"], remove_empty_tags=True, ) print(test) **Output** .. code-block:: html

Hello

**Content stripping vs tag deletion** .. code-block:: python import textwizard as tw html = """
code stays
""" keep_tags_drop_content = tw.clean_html( html, remove_content_tags=["script","style"], # keep
code stays
**Sectioning, headings, flow** .. code-block:: python import textwizard as tw html = "

T

X

Body

" out = tw.clean_html( html, remove_sectioning_tags=True, # drop
/
/