Clean CSV

Deterministic CSV cleanup with full dialect control and structural fixes. Removes columns or rows, blanks values by exact match or wildcard, trims whitespace, deduplicates, and drops empty rows/columns. Always returns a CSV str serialized with the chosen dialect.

Parameters

Parameter

Description

text

(str) Raw CSV input.

delimiter

(str) Field separator. Default: ,

quotechar

(str) Quote character. Default: ".

escapechar

(str | None) Escape prefix for quotechar. Default: None.

doublequote

(bool) Double the quotechar to escape inside fields. Default: True.

skipinitialspace

(bool) Skip spaces right after delimiter. Default: False.

lineterminator

(str) Line terminator for output. Default: "\n".

quoting

(int) One of csv.QUOTE_MINIMAL | QUOTE_ALL | QUOTE_NONE | QUOTE_NONNUMERIC.

remove_columns

(str | int | list) Columns to drop by name (requires header) or 0-based index.

remove_row_index

(int | list) Row indices to drop (0-based over the parsed rows).

remove_values

(str | int | list) Values to blank out. Supports wildcards * and ?.

remove_duplicates_rows

(bool) Remove duplicate records (exact row match). Default: False.

trim_whitespace

(bool) Strip leading/trailing whitespace inside fields. Default: False.

remove_empty_columns

(bool) Drop columns that are empty after cleaning. Default: False.

remove_empty_rows

(bool) Drop rows with all-empty fields after cleaning. Default: False.

Note

Row indexing is 0-based over the physical row order as parsed. If your CSV has a header, the header is row index 0.

Examples

Basic cleanup: drop columns, redact values, trim, dedupe

import textwizard as tw

csv_data = """id,name,age,city,salary
1,John,30,New York,50000
2,Jane,25,,40000
3,,35,Los Angeles,60000
4,Mark,45,,70000
5,Sarah,40,New York,
1,John,30,New York,50000
"""
out = tw.clean_csv(
    csv_data,
    delimiter=",",
    remove_columns=["id", "salary"],
    remove_values=["John", "50000"],
    trim_whitespace=True,
    remove_empty_columns=True,
    remove_empty_rows=True,
    remove_duplicates_rows=True,
)
print(out)

Output

name,age,city
,30,New York
Jane,25,
,35,Los Angeles
Mark,45,
Sarah,40,New York

Drop by index vs by name

import textwizard as tw

csv_data = "A,B,C\n1,2,3\n4,5,6\n"
by_name  = tw.clean_csv(csv_data, remove_columns=["B"])
by_index = tw.clean_csv(csv_data, remove_columns=[1])  # same column
print(by_name); print(by_index)

Output

A,C
1,3
4,6

A,C
1,3
4,6

Remove specific rows

import textwizard as tw

csv_data = "h1,h2\nx1,y1\nx2,y2\nx3,y3\n"
out = tw.clean_csv(csv_data, remove_row_index=[2])  # drops "x2,y2"
print(out)

Output

h1,h2
x1,y1
x3,y3

Normalize dialect: semicolon → comma, quote all

import textwizard as tw, csv

csv_data = "id;name;note\n1;Alice;\"a; b\""
out = tw.clean_csv(
    csv_data,
    delimiter=";",
    quoting=csv.QUOTE_ALL,
    lineterminator="\n",
)
print(out)

Output

id;name;note
1;Alice;"a; b"

Wildcard Examples

import textwizard as tw

csv_data = "name,email\nJohn,john.doe@example.com\nJane,jane@corp.com\n"
out = tw.clean_csv(
    csv_data,
    remove_values=["*@example.com", "Jane"],  # blanks fields matching these patterns
)
print(out)

Output

name,email
John,
,jane@corp.com

Headerless CSV: drop by index and trim

import textwizard as tw

csv_data = "  a , 1 , x \n  b , 2 , y \n"
out = tw.clean_csv(
    csv_data,
    delimiter=",",
    trim_whitespace=True,
    remove_columns=[2],       # drop third column
)
print(out)

Output

a,1
b,2

Deduplicate only

import textwizard as tw

csv_data = "k,v\nA,1\nA,1\nB,2\n"
out = tw.clean_csv(csv_data, remove_duplicates_rows=True)
print(out)

Output

k,v
A,1
B,2

Returns

str — cleaned CSV serialized with the specified dialect.

Operational notes

  • Column name matching requires a header row. Otherwise use indices.

  • remove_values blanks matching cells (does not drop rows). Supports wildcards * and ?.

  • remove_empty_columns/remove_empty_rows run after other operations.

  • Deduplication compares full serialized rows with dialect normalization.

Errors

  • Invalid dialect or malformed CSV may raise parsing errors.

  • Unknown quoting constant → ValueError.

See also