Text similarity¶
Compute a similarity score between two strings using one of three measures Cosine ,Jaccard, Levenshtein.
Returns a single float in [0.0, 1.0] (1.0 ≡ identical).
Parameters¶
a(str): First text.b(str): Second text.method("cosine" | "jaccard" | "levenshtein"; default"cosine")
Returns¶
float– similarity score in the range 0.0 – 1.0.
Methods¶
cosine Cosine similarity on unigram + bigram TF vectors built from lowercase word tokens. If either vector is all zeros (no tokens), the score is
0.0.jaccard Jaccard index on sets of lowercase word tokens:
J(A, B) = |A ∩ B| / |A ∪ B|(0if both sets are empty).levenshtein Normalised edit-distance similarity:
sim = 1 − dist(a, b) / max(len(a), len(b))Exact match →1.0; completely different strings → closer to0.0. Uses a memory-efficient dynamic programming routine.
Examples¶
Basic usage¶
import textwizard as tw
s1 = tw.text_similarity("kitten", "sitting", method="levenshtein")
s2 = tw.text_similarity("hello world", "hello brave world", method="jaccard")
s3 = tw.text_similarity("abc def", "abc xyz", method="cosine")
print(s1, s2, s3)
Output (placeholder)
0.5714285714285714
0.6666666666666666
0.33333333333333337
More examples¶
Cosine with bigrams:
import textwizard as tw
print(tw.text_similarity("deep learning", "learning deep nets", method="cosine"))
Output
0.5163977794943222
Jaccard on short phrases:
import textwizard as tw
print(tw.text_similarity("the quick brown fox", "the quick red fox", method="jaccard"))
Output
0.6
Levenshtein identical / empty:¶
import textwizard as tw
print(tw.text_similarity("same", "same", method="levenshtein")) # 1.0
print(tw.text_similarity("", "nonempty", method="levenshtein")) # 0.0
Output
1.0
0.0
Edge cases & notes¶
Mixed scripts and punctuation: tokens are extracted with
\w+; punctuation is ignored for cosine/jaccard.Very short texts can yield low cosine/jaccard scores due to sparse tokens; Levenshtein may be more stable.
Complexity: • Cosine/Jaccard – linear in token count. • Levenshtein –
O(len(a)·len(b))time, memory-optimised; suitable for short/medium strings.