=========
Clean CSV
=========

Deterministic CSV cleanup with full dialect control and structural fixes.  
Removes columns or rows, blanks values by exact match or wildcard, trims whitespace, deduplicates, and drops empty rows/columns. Always returns a CSV ``str`` serialized with the chosen dialect.


Parameters
====================

.. list-table::
   :header-rows: 1
   :widths: 26 74

   * - **Parameter**
     - **Description**
   * - ``text``
     - (*str*) Raw CSV input.
   * - ``delimiter``
     - (*str*) Field separator. Default: ``,``
   * - ``quotechar``
     - (*str*) Quote character. Default: ``"``.
   * - ``escapechar``
     - (*str | None*) Escape prefix for quotechar. Default: ``None``.
   * - ``doublequote``
     - (*bool*) Double the quotechar to escape inside fields. Default: ``True``.
   * - ``skipinitialspace``
     - (*bool*) Skip spaces right after delimiter. Default: ``False``.
   * - ``lineterminator``
     - (*str*) Line terminator for output. Default: ``"\n"``.
   * - ``quoting``
     - (*int*) One of ``csv.QUOTE_MINIMAL | QUOTE_ALL | QUOTE_NONE | QUOTE_NONNUMERIC``.
   * - ``remove_columns``
     - (*str | int | list*) Columns to drop by **name** (requires header) or **0-based index**.
   * - ``remove_row_index``
     - (*int | list*) Row indices to drop (0-based over the parsed rows).
   * - ``remove_values``
     - (*str | int | list*) Values to blank out. Supports wildcards ``*`` and ``?``.
   * - ``remove_duplicates_rows``
     - (*bool*) Remove duplicate records (exact row match). Default: ``False``.
   * - ``trim_whitespace``
     - (*bool*) Strip leading/trailing whitespace inside fields. Default: ``False``.
   * - ``remove_empty_columns``
     - (*bool*) Drop columns that are empty after cleaning. Default: ``False``.
   * - ``remove_empty_rows``
     - (*bool*) Drop rows with all-empty fields after cleaning. Default: ``False``.

.. note::
   **Row indexing** is 0-based over the physical row order as parsed.  
   If your CSV has a header, the header is row index ``0``.

Examples
========

Basic cleanup: drop columns, redact values, trim, dedupe
--------------------------------------------------------

.. code-block:: python

   import textwizard as tw

   csv_data = """id,name,age,city,salary
   1,John,30,New York,50000
   2,Jane,25,,40000
   3,,35,Los Angeles,60000
   4,Mark,45,,70000
   5,Sarah,40,New York,
   1,John,30,New York,50000
   """
   out = tw.clean_csv(
       csv_data,
       delimiter=",",
       remove_columns=["id", "salary"],
       remove_values=["John", "50000"],
       trim_whitespace=True,
       remove_empty_columns=True,
       remove_empty_rows=True,
       remove_duplicates_rows=True,
   )
   print(out)

**Output**

   .. code-block:: text

    name,age,city
    ,30,New York
    Jane,25,
    ,35,Los Angeles
    Mark,45,
    Sarah,40,New York

Drop by index vs by name
------------------------

.. code-block:: python

   import textwizard as tw

   csv_data = "A,B,C\n1,2,3\n4,5,6\n"
   by_name  = tw.clean_csv(csv_data, remove_columns=["B"])
   by_index = tw.clean_csv(csv_data, remove_columns=[1])  # same column
   print(by_name); print(by_index)

**Output**

   .. code-block:: text

    A,C
    1,3
    4,6
    
    A,C
    1,3
    4,6

Remove specific rows
------------------------------

.. code-block:: python

   import textwizard as tw

   csv_data = "h1,h2\nx1,y1\nx2,y2\nx3,y3\n"
   out = tw.clean_csv(csv_data, remove_row_index=[2])  # drops "x2,y2"
   print(out)

**Output**

   .. code-block:: text

    h1,h2
    x1,y1
    x3,y3

Normalize dialect: semicolon → comma, quote all
-----------------------------------------------

.. code-block:: python

   import textwizard as tw, csv

   csv_data = "id;name;note\n1;Alice;\"a; b\""
   out = tw.clean_csv(
       csv_data,
       delimiter=";",             
       quoting=csv.QUOTE_ALL,   
       lineterminator="\n",
   )
   print(out)

**Output** 

   .. code-block:: text

    id;name;note
    1;Alice;"a; b"

Wildcard Examples
---------------------------------

.. code-block:: python

   import textwizard as tw

   csv_data = "name,email\nJohn,john.doe@example.com\nJane,jane@corp.com\n"
   out = tw.clean_csv(
       csv_data,
       remove_values=["*@example.com", "Jane"],  # blanks fields matching these patterns
   )
   print(out)

**Output**

   .. code-block:: text

    name,email
    John,
    ,jane@corp.com

Headerless CSV: drop by index and trim
--------------------------------------

.. code-block:: python

   import textwizard as tw

   csv_data = "  a , 1 , x \n  b , 2 , y \n"
   out = tw.clean_csv(
       csv_data,
       delimiter=",",
       trim_whitespace=True,
       remove_columns=[2],       # drop third column
   )
   print(out)

**Output** 

   .. code-block:: text

      a,1
      b,2

Deduplicate only
----------------

.. code-block:: python

   import textwizard as tw

   csv_data = "k,v\nA,1\nA,1\nB,2\n"
   out = tw.clean_csv(csv_data, remove_duplicates_rows=True)
   print(out)

**Output**

   .. code-block:: text

      k,v
      A,1
      B,2

Returns
=======

``str`` — cleaned CSV serialized with the specified dialect.

Operational notes
=================

- Column name matching requires a header row. Otherwise use indices.  
- ``remove_values`` blanks matching cells (does not drop rows). Supports wildcards ``*`` and ``?``.  
- ``remove_empty_columns``/``remove_empty_rows`` run **after** other operations.  
- Deduplication compares full serialized rows with dialect normalization.

Errors
======

- Invalid dialect or malformed CSV may raise parsing errors.  
- Unknown quoting constant → ``ValueError``.

See also
========

- :doc:`clean_html` — HTML cleanup
- :doc:`clean_xml` — XML cleanup