========= Clean XML ========= XML cleanup with namespace stripping, tag/attribute pruning (with wildcards), whitespace normalization, and duplicate-sibling removal. If **no flags** are provided, the function returns the **concatenated text content** of the XML. If **any flag** is set, it returns **serialized XML** (no XML declaration). Parameters ==================== .. list-table:: :header-rows: 1 :widths: 28 72 * - **Parameter** - **Description** * - ``text`` - (*str | bytes*) Raw XML input. * - ``remove_comments`` - (*bool | None*) Remove ````. Preserves spacing by inserting one space when needed. * - ``remove_processing_instructions`` - (*bool | None*) Remove ````. Preserves spacing like above. * - ``remove_cdata_sections`` - (*bool | None*) Unwrap CDATA by decoding entity text content. * - ``remove_empty_tags`` - (*bool | None*) Drop elements with no children and no text (root never dropped). * - ``remove_namespaces`` - (*bool | None*) Strip element and attribute namespace prefixes and ``xmlns`` declarations. * - ``remove_duplicate_siblings`` - (*bool | None*) Keep only the first identical serialized sibling element. * - ``collapse_whitespace`` - (*bool | None*) Collapse runs of whitespace in text nodes to a single space. * - ``remove_specific_tags`` - (*str | list | None*) **Delete** matching elements entirely. Wildcards supported. Patterns can be local names (``"price"``) or qnames (``"ns:price"``); matching uses local-name. * - ``remove_content_tags`` - (*str | list | None*) Keep the element but **remove its children and text**. Wildcards supported. * - ``remove_attributes`` - (*str | list | None*) Delete attributes by name. Supports wildcards and qname matching (e.g. ``"xml:*"``, ``"data-*"``, ``"id"``, ``"xlink:href"``). * - ``remove_declaration`` - (*bool | None*) **No-op** in current implementation (output never includes XML declaration). * - ``normalize_entities`` - (*bool | None*) **No-op** in current implementation (entity normalization occurs only with ``remove_cdata_sections``). - **No flags provided** → returns **plain text** joined from all text nodes (whitespace collapsed between nodes). - **At least one flag provided** → applies transformations and returns **XML string** without XML declaration. - **Booleans**: ``True`` → apply operation. ``False``/``None`` → skip. - **Lists/values** (tags, attributes): ``None`` → skip. Non-empty value/list → apply for each pattern. - **Text-only mode**: when **all** parameters are ``None`` the function returns plain text. Examples ======== Text-only mode (no flags) ------------------------- .. code-block:: python import textwizard as tw xml = "OneTwo" txt = tw.clean_xml(xml) # no flags print(txt) **Output** .. code-block:: text One Two Remove namespaces + comments + empties -------------------------------------- .. code-block:: python import textwizard as tw xml = "ok" fixed = tw.clean_xml( xml, remove_namespaces=True, remove_empty_tags=True, remove_comments=True, ) print(fixed) **Output** .. code-block:: xml ok Delete specific tags vs keep-tag-drop-content --------------------------------------------- .. code-block:: python import textwizard as tw xml = "mT

Body

" test1 = tw.clean_xml(xml, remove_specific_tags=["meta","title"]) test2 = tw.clean_xml(xml, remove_content_tags=["sec"]) print(test1) print(test2) **Output** .. code-block:: xml

Body

m Wildcard attribute removal (qname and local) -------------------------------------------- .. code-block:: python import textwizard as tw xml = '' out = tw.clean_xml( xml, remove_attributes=["id", "svg:*"], # drop local 'id' and any svg-qualified attributes remove_namespaces=True, ) print(out) **Output** .. code-block:: xml Collapse whitespace and deduplicate siblings -------------------------------------------- .. code-block:: python import textwizard as tw xml = " a b a b c" out = tw.clean_xml(xml, collapse_whitespace=True, remove_duplicate_siblings=True) print(out) **Output** .. code-block:: xml a bc Remove empty tags safely ------------------------ .. code-block:: python import textwizard as tw xml = "1" out = tw.clean_xml(xml, remove_empty_tags=True) print(out) **Output** .. code-block:: xml 1 Returns ======= - **No flags**: ``str`` plain text (concatenated from text nodes). - **Any flag**: ``str`` XML without declaration. Operational notes ================= - Tag patterns match by **local name**; a ``"ns:tag"`` pattern is accepted but matched on local-name. - Attribute patterns can be given as **local** (``"href"``) or **qname** (``"xlink:href"``); wildcards ``*`` and ``?`` supported. - Comments/PI removal preserves word boundaries by inserting a single space when two text nodes would otherwise touch. - If the root becomes empty and ``remove_empty_tags=True``, returns an empty string ``""``. Errors ====== - Invalid XML → recovered when possible; otherwise may raise a parser error. See also ======== - :doc:`clean_html` — HTML cleanup - :doc:`clean_csv` — CSV cleanup