Text similarity¶

Compute a similarity score between two strings using one of three measures Cosine ,Jaccard, Levenshtein. Returns a single float in [0.0, 1.0] (1.0 ≡ identical).

Parameters¶

a (str): First text.
b (str): Second text.
method ("cosine" | "jaccard" | "levenshtein"; default "cosine")

Returns¶

float – similarity score in the range 0.0 – 1.0.

Methods¶

cosine Cosine similarity on unigram + bigram TF vectors built from lowercase word tokens. If either vector is all zeros (no tokens), the score is 0.0.
jaccard Jaccard index on sets of lowercase word tokens: J(A, B) = |A ∩ B| / |A ∪ B| (0 if both sets are empty).
levenshtein Normalised edit-distance similarity: sim = 1 − dist(a, b) / max(len(a), len(b)) Exact match → 1.0; completely different strings → closer to 0.0. Uses a memory-efficient dynamic programming routine.

Examples¶

Basic usage¶

import textwizard as tw

s1 = tw.text_similarity("kitten", "sitting", method="levenshtein")
s2 = tw.text_similarity("hello world", "hello brave world", method="jaccard")
s3 = tw.text_similarity("abc def", "abc xyz", method="cosine")
print(s1, s2, s3)

Output (placeholder)

5714285714285714
6666666666666666
33333333333333337

More examples¶

Cosine with bigrams:

import textwizard as tw
print(tw.text_similarity("deep learning", "learning deep nets", method="cosine"))

Output

0.5163977794943222

Jaccard on short phrases:

import textwizard as tw
print(tw.text_similarity("the quick brown fox", "the quick red fox", method="jaccard"))

Output

0.6

Levenshtein identical / empty:¶

import textwizard as tw
print(tw.text_similarity("same", "same", method="levenshtein"))  # 1.0
print(tw.text_similarity("", "nonempty", method="levenshtein"))  # 0.0

Output

1.0
0.0

Edge cases & notes¶

Mixed scripts and punctuation: tokens are extracted with \w+; punctuation is ignored for cosine/jaccard.
Very short texts can yield low cosine/jaccard scores due to sparse tokens; Levenshtein may be more stable.
Complexity: • Cosine/Jaccard – linear in token count. • Levenshtein – O(len(a)·len(b)) time, memory-optimised; suitable for short/medium strings.