Clean CSV¶

Deterministic CSV cleanup with full dialect control and structural fixes. Removes columns or rows, blanks values by exact match or wildcard, trims whitespace, deduplicates, and drops empty rows/columns. Always returns a CSV str serialized with the chosen dialect.

Parameters¶

Parameter	Description
`text`	(str) Raw CSV input.
`delimiter`	(str) Field separator. Default: `,`
`quotechar`	(str) Quote character. Default: `"`.
`escapechar`	(str \| None) Escape prefix for quotechar. Default: `None`.
`doublequote`	(bool) Double the quotechar to escape inside fields. Default: `True`.
`skipinitialspace`	(bool) Skip spaces right after delimiter. Default: `False`.
`lineterminator`	(str) Line terminator for output. Default: `"\n"`.
`quoting`	(int) One of `csv.QUOTE_MINIMAL \| QUOTE_ALL \| QUOTE_NONE \| QUOTE_NONNUMERIC`.
`remove_columns`	(str \| int \| list) Columns to drop by name (requires header) or 0-based index.
`remove_row_index`	(int \| list) Row indices to drop (0-based over the parsed rows).
`remove_values`	(str \| int \| list) Values to blank out. Supports wildcards `*` and `?`.
`remove_duplicates_rows`	(bool) Remove duplicate records (exact row match). Default: `False`.
`trim_whitespace`	(bool) Strip leading/trailing whitespace inside fields. Default: `False`.
`remove_empty_columns`	(bool) Drop columns that are empty after cleaning. Default: `False`.
`remove_empty_rows`	(bool) Drop rows with all-empty fields after cleaning. Default: `False`.

Note

Row indexing is 0-based over the physical row order as parsed. If your CSV has a header, the header is row index 0.

Examples¶

Basic cleanup: drop columns, redact values, trim, dedupe¶

import textwizard as tw

csv_data = """id,name,age,city,salary
1,John,30,New York,50000
2,Jane,25,,40000
3,,35,Los Angeles,60000
4,Mark,45,,70000
5,Sarah,40,New York,
1,John,30,New York,50000
"""
out = tw.clean_csv(
    csv_data,
    delimiter=",",
    remove_columns=["id", "salary"],
    remove_values=["John", "50000"],
    trim_whitespace=True,
    remove_empty_columns=True,
    remove_empty_rows=True,
    remove_duplicates_rows=True,
)
print(out)

Output

name,age,city
,30,New York
Jane,25,
,35,Los Angeles
Mark,45,
Sarah,40,New York

Drop by index vs by name¶

import textwizard as tw

csv_data = "A,B,C\n1,2,3\n4,5,6\n"
by_name  = tw.clean_csv(csv_data, remove_columns=["B"])
by_index = tw.clean_csv(csv_data, remove_columns=[1])  # same column
print(by_name); print(by_index)

Output

A,C
1,3
4,6

A,C
1,3
4,6

Remove specific rows¶

import textwizard as tw

csv_data = "h1,h2\nx1,y1\nx2,y2\nx3,y3\n"
out = tw.clean_csv(csv_data, remove_row_index=[2])  # drops "x2,y2"
print(out)

Output

h1,h2
x1,y1
x3,y3

Normalize dialect: semicolon → comma, quote all¶

import textwizard as tw, csv

csv_data = "id;name;note\n1;Alice;\"a; b\""
out = tw.clean_csv(
    csv_data,
    delimiter=";",
    quoting=csv.QUOTE_ALL,
    lineterminator="\n",
)
print(out)

Output

id;name;note
1;Alice;"a; b"

Wildcard Examples¶

import textwizard as tw

csv_data = "name,email\nJohn,john.doe@example.com\nJane,jane@corp.com\n"
out = tw.clean_csv(
    csv_data,
    remove_values=["*@example.com", "Jane"],  # blanks fields matching these patterns
)
print(out)

Output

name,email
John,
,jane@corp.com

Headerless CSV: drop by index and trim¶

import textwizard as tw

csv_data = "  a , 1 , x \n  b , 2 , y \n"
out = tw.clean_csv(
    csv_data,
    delimiter=",",
    trim_whitespace=True,
    remove_columns=[2],       # drop third column
)
print(out)

Output

a,1
b,2

Deduplicate only¶

import textwizard as tw

csv_data = "k,v\nA,1\nA,1\nB,2\n"
out = tw.clean_csv(csv_data, remove_duplicates_rows=True)
print(out)

Output

k,v
A,1
B,2

Returns¶

str — cleaned CSV serialized with the specified dialect.

Operational notes¶

Column name matching requires a header row. Otherwise use indices.
remove_values blanks matching cells (does not drop rows). Supports wildcards * and ?.
remove_empty_columns/remove_empty_rows run after other operations.
Deduplication compares full serialized rows with dialect normalization.

Errors¶

Invalid dialect or malformed CSV may raise parsing errors.
Unknown quoting constant → ValueError.

Clean CSV¶

Parameters¶

Examples¶

Basic cleanup: drop columns, redact values, trim, dedupe¶

Drop by index vs by name¶

Remove specific rows¶

Normalize dialect: semicolon → comma, quote all¶

Wildcard Examples¶

Headerless CSV: drop by index and trim¶

Deduplicate only¶

Returns¶

Operational notes¶

Errors¶

See also¶