Text similarity

Compute a similarity score between two strings using one of three measures Cosine ,Jaccard, Levenshtein. Returns a single float in [0.0, 1.0] (1.0 ≡ identical).

Parameters

  • a (str): First text.

  • b (str): Second text.

  • method ("cosine" | "jaccard" | "levenshtein"; default "cosine")

Returns

  • float – similarity score in the range 0.0 – 1.0.

Methods

  • cosine Cosine similarity on unigram + bigram TF vectors built from lowercase word tokens. If either vector is all zeros (no tokens), the score is 0.0.

  • jaccard Jaccard index on sets of lowercase word tokens: J(A, B) = |A B| / |A B| (0 if both sets are empty).

  • levenshtein Normalised edit-distance similarity: sim = 1 dist(a, b) / max(len(a), len(b)) Exact match → 1.0; completely different strings → closer to 0.0. Uses a memory-efficient dynamic programming routine.

Examples

Basic usage

import textwizard as tw

s1 = tw.text_similarity("kitten", "sitting", method="levenshtein")
s2 = tw.text_similarity("hello world", "hello brave world", method="jaccard")
s3 = tw.text_similarity("abc def", "abc xyz", method="cosine")
print(s1, s2, s3)

Output (placeholder)

0.5714285714285714
0.6666666666666666
0.33333333333333337

More examples

Cosine with bigrams:

import textwizard as tw
print(tw.text_similarity("deep learning", "learning deep nets", method="cosine"))

Output

0.5163977794943222

Jaccard on short phrases:

import textwizard as tw
print(tw.text_similarity("the quick brown fox", "the quick red fox", method="jaccard"))

Output

0.6

Levenshtein identical / empty:

import textwizard as tw
print(tw.text_similarity("same", "same", method="levenshtein"))  # 1.0
print(tw.text_similarity("", "nonempty", method="levenshtein"))  # 0.0

Output

1.0
0.0

Edge cases & notes

  • Mixed scripts and punctuation: tokens are extracted with \w+; punctuation is ignored for cosine/jaccard.

  • Very short texts can yield low cosine/jaccard scores due to sparse tokens; Levenshtein may be more stable.

  • Complexity: • Cosine/Jaccard – linear in token count. • Levenshtein – O(len(a)·len(b)) time, memory-optimised; suitable for short/medium strings.