Clean XML¶
XML cleanup with namespace stripping, tag/attribute pruning (with wildcards), whitespace normalization, and duplicate-sibling removal. If no flags are provided, the function returns the concatenated text content of the XML. If any flag is set, it returns serialized XML (no XML declaration).
Parameters¶
Parameter |
Description |
|---|---|
|
(str | bytes) Raw XML input. |
|
(bool | None) Remove |
|
(bool | None) Remove |
|
(bool | None) Unwrap CDATA by decoding entity text content. |
|
(bool | None) Drop elements with no children and no text (root never dropped). |
|
(bool | None) Strip element and attribute namespace prefixes and |
|
(bool | None) Keep only the first identical serialized sibling element. |
|
(bool | None) Collapse runs of whitespace in text nodes to a single space. |
|
(str | list | None) Delete matching elements entirely. Wildcards supported.
Patterns can be local names ( |
|
(str | list | None) Keep the element but remove its children and text. Wildcards supported. |
|
(str | list | None) Delete attributes by name. Supports wildcards and qname matching
(e.g. |
|
(bool | None) No-op in current implementation (output never includes XML declaration). |
|
(bool | None) No-op in current implementation (entity normalization occurs only with |
No flags provided → returns plain text joined from all text nodes (whitespace collapsed between nodes).
At least one flag provided → applies transformations and returns XML string without XML declaration.
Booleans:
True→ apply operation.False/None→ skip.Lists/values (tags, attributes):
None→ skip. Non-empty value/list → apply for each pattern.Text-only mode: when all parameters are
Nonethe function returns plain text.
Examples¶
Text-only mode (no flags)¶
import textwizard as tw
xml = "<root><a>One</a><!-- c --><b>Two</b></root>"
txt = tw.clean_xml(xml) # no flags
print(txt)
Output
One Two
Remove namespaces + comments + empties¶
import textwizard as tw
xml = "<root xmlns='ns'><a/><b>ok</b><!-- x --></root>"
fixed = tw.clean_xml(
xml,
remove_namespaces=True,
remove_empty_tags=True,
remove_comments=True,
)
print(fixed)
Output
<root><b>ok</b></root>
Wildcard attribute removal (qname and local)¶
import textwizard as tw
xml = '<svg:svg xmlns:svg="http://www.w3.org/2000/svg" width="10" height="10"><svg:g id="g1"/></svg:svg>'
out = tw.clean_xml(
xml,
remove_attributes=["id", "svg:*"], # drop local 'id' and any svg-qualified attributes
remove_namespaces=True,
)
print(out)
Output
<svg width="10" height="10"><g/></svg>
Collapse whitespace and deduplicate siblings¶
import textwizard as tw
xml = "<r><x> a b </x><x> a b </x><x>c</x></r>"
out = tw.clean_xml(xml, collapse_whitespace=True, remove_duplicate_siblings=True)
print(out)
Output
<r><x>a b</x><x>c</x></r>
Returns¶
No flags:
strplain text (concatenated from text nodes).Any flag:
strXML without declaration.
Operational notes¶
Tag patterns match by local name; a
"ns:tag"pattern is accepted but matched on local-name.Attribute patterns can be given as local (
"href") or qname ("xlink:href"); wildcards*and?supported.Comments/PI removal preserves word boundaries by inserting a single space when two text nodes would otherwise touch.
If the root becomes empty and
remove_empty_tags=True, returns an empty string"".
Errors¶
Invalid XML → recovered when possible; otherwise may raise a parser error.
See also¶
Clean HTML — HTML cleanup
Clean CSV — CSV cleanup