Clean CSV¶
Deterministic CSV cleanup with full dialect control and structural fixes.
Removes columns or rows, blanks values by exact match or wildcard, trims whitespace, deduplicates, and drops empty rows/columns. Always returns a CSV str serialized with the chosen dialect.
Parameters¶
Parameter |
Description |
|---|---|
|
(str) Raw CSV input. |
|
(str) Field separator. Default: |
|
(str) Quote character. Default: |
|
(str | None) Escape prefix for quotechar. Default: |
|
(bool) Double the quotechar to escape inside fields. Default: |
|
(bool) Skip spaces right after delimiter. Default: |
|
(str) Line terminator for output. Default: |
|
(int) One of |
|
(str | int | list) Columns to drop by name (requires header) or 0-based index. |
|
(int | list) Row indices to drop (0-based over the parsed rows). |
|
(str | int | list) Values to blank out. Supports wildcards |
|
(bool) Remove duplicate records (exact row match). Default: |
|
(bool) Strip leading/trailing whitespace inside fields. Default: |
|
(bool) Drop columns that are empty after cleaning. Default: |
|
(bool) Drop rows with all-empty fields after cleaning. Default: |
Note
Row indexing is 0-based over the physical row order as parsed.
If your CSV has a header, the header is row index 0.
Examples¶
Basic cleanup: drop columns, redact values, trim, dedupe¶
import textwizard as tw
csv_data = """id,name,age,city,salary
1,John,30,New York,50000
2,Jane,25,,40000
3,,35,Los Angeles,60000
4,Mark,45,,70000
5,Sarah,40,New York,
1,John,30,New York,50000
"""
out = tw.clean_csv(
csv_data,
delimiter=",",
remove_columns=["id", "salary"],
remove_values=["John", "50000"],
trim_whitespace=True,
remove_empty_columns=True,
remove_empty_rows=True,
remove_duplicates_rows=True,
)
print(out)
Output
name,age,city ,30,New York Jane,25, ,35,Los Angeles Mark,45, Sarah,40,New York
Drop by index vs by name¶
import textwizard as tw
csv_data = "A,B,C\n1,2,3\n4,5,6\n"
by_name = tw.clean_csv(csv_data, remove_columns=["B"])
by_index = tw.clean_csv(csv_data, remove_columns=[1]) # same column
print(by_name); print(by_index)
Output
A,C 1,3 4,6 A,C 1,3 4,6
Remove specific rows¶
import textwizard as tw
csv_data = "h1,h2\nx1,y1\nx2,y2\nx3,y3\n"
out = tw.clean_csv(csv_data, remove_row_index=[2]) # drops "x2,y2"
print(out)
Output
h1,h2 x1,y1 x3,y3
Normalize dialect: semicolon → comma, quote all¶
import textwizard as tw, csv
csv_data = "id;name;note\n1;Alice;\"a; b\""
out = tw.clean_csv(
csv_data,
delimiter=";",
quoting=csv.QUOTE_ALL,
lineterminator="\n",
)
print(out)
Output
id;name;note 1;Alice;"a; b"
Wildcard Examples¶
import textwizard as tw
csv_data = "name,email\nJohn,john.doe@example.com\nJane,jane@corp.com\n"
out = tw.clean_csv(
csv_data,
remove_values=["*@example.com", "Jane"], # blanks fields matching these patterns
)
print(out)
Output
name,email John, ,jane@corp.com
Headerless CSV: drop by index and trim¶
import textwizard as tw
csv_data = " a , 1 , x \n b , 2 , y \n"
out = tw.clean_csv(
csv_data,
delimiter=",",
trim_whitespace=True,
remove_columns=[2], # drop third column
)
print(out)
Output
a,1 b,2
Deduplicate only¶
import textwizard as tw
csv_data = "k,v\nA,1\nA,1\nB,2\n"
out = tw.clean_csv(csv_data, remove_duplicates_rows=True)
print(out)
Output
k,v A,1 B,2
Returns¶
str — cleaned CSV serialized with the specified dialect.
Operational notes¶
Column name matching requires a header row. Otherwise use indices.
remove_valuesblanks matching cells (does not drop rows). Supports wildcards*and?.remove_empty_columns/remove_empty_rowsrun after other operations.Deduplication compares full serialized rows with dialect normalization.
Errors¶
Invalid dialect or malformed CSV may raise parsing errors.
Unknown quoting constant →
ValueError.
See also¶
Clean HTML — HTML cleanup
Clean XML — XML cleanup