================ Text similarity ================ Compute a similarity score between two strings using one of three measures **Cosine ,Jaccard, Levenshtein**. Returns a single **float in [0.0, 1.0]** (``1.0`` ≡ identical). Parameters ========== - ``a`` (``str``): First text. - ``b`` (``str``): Second text. - ``method`` (``"cosine" | "jaccard" | "levenshtein"``; default ``"cosine"``) Returns ======= - ``float`` – similarity score in the range **0.0 – 1.0**. Methods ======= - **cosine** Cosine similarity on **unigram + bigram** TF vectors built from lowercase word tokens. If either vector is all zeros (no tokens), the score is ``0.0``. - **jaccard** Jaccard index on **sets** of lowercase word tokens: ``J(A, B) = |A ∩ B| / |A ∪ B|`` (``0`` if both sets are empty). - **levenshtein** Normalised edit-distance similarity: ``sim = 1 − dist(a, b) / max(len(a), len(b))`` Exact match → ``1.0``; completely different strings → closer to ``0.0``. Uses a memory-efficient dynamic programming routine. Examples ======== Basic usage ----------- .. code-block:: python import textwizard as tw s1 = tw.text_similarity("kitten", "sitting", method="levenshtein") s2 = tw.text_similarity("hello world", "hello brave world", method="jaccard") s3 = tw.text_similarity("abc def", "abc xyz", method="cosine") print(s1, s2, s3) **Output (placeholder)** .. code-block:: text 0.5714285714285714 0.6666666666666666 0.33333333333333337 More examples ------------- Cosine with bigrams: .. code-block:: python import textwizard as tw print(tw.text_similarity("deep learning", "learning deep nets", method="cosine")) **Output** .. code-block:: text 0.5163977794943222 Jaccard on short phrases: .. code-block:: python import textwizard as tw print(tw.text_similarity("the quick brown fox", "the quick red fox", method="jaccard")) **Output** .. code-block:: text 0.6 Levenshtein identical / empty: ------------------------------ .. code-block:: python import textwizard as tw print(tw.text_similarity("same", "same", method="levenshtein")) # 1.0 print(tw.text_similarity("", "nonempty", method="levenshtein")) # 0.0 **Output** .. code-block:: text 1.0 0.0 Edge cases & notes ================== - Mixed scripts and punctuation: tokens are extracted with ``\w+``; punctuation is ignored for cosine/jaccard. - Very short texts can yield low cosine/jaccard scores due to sparse tokens; Levenshtein may be more stable. - Complexity: • Cosine/Jaccard – linear in token count. • Levenshtein – ``O(len(a)·len(b))`` time, memory-optimised; suitable for short/medium strings.