Text statistics¶
Compute a compact suite of lexical diversity and distribution metrics from a string.
Note
All numeric values are rounded to 3 decimals.
If there are fewer than 2 distinct tokens, zipf.slope and zipf.r2 are NaN.
What it computes¶
entropy– Shannon entropy (bits per token) of the token-frequency distribution.zipf.slope/zipf.r2– slope and R² of the linear fit onlog10(rank) → log10(freq).vocab_gini– Gini coefficient of type-frequency inequality (0=uniform, 1=max inequality).type_token_ratio–|V| / N, unique types over total tokens.hapax_ratio– share of tokens that occur exactly once.simpson_index– Simpson diversity index:1 - Σ (f_w / N)^2.yule_k– Yule’s K (lexical concentration):10^4 · (Σ i^2·V_i − N) / N^2.avg_word_length– average token length (characters), weighted by frequency.
Parameters¶
text(str): Input string to analyze.
Returns¶
dictA dictionary with the following keys (all floats unless noted):
entropyzipf(dict) →{"slope": float, "r2": float}vocab_ginitype_token_ratiohapax_ratiosimpson_indexyule_kavg_word_length
Metric definitions¶
Shannon entropy:
H = − Σ_w (f_w / N) · log₂(f_w / N)Zipf fit: Linear regression of
x = log10(rank),y = log10(freq)→ reportslopeandR².Gini: With frequencies
v₁ ≤ v₂ ≤ … ≤ v_m, total tokensN, typesm:G = (2 · Σ_{i=1..m} i·v_i) / (m·N) − (m + 1) / mType–Token Ratio:
TTR = |V| / NHapax Ratio:
(#types with frequency 1) / NSimpson index:
1 − Σ_w (f_w / N)²Yule’s K:
K = 10⁴ · (Σ_{i=1..M} i²·V_i − N) / N²whereV_i= #types with frequencyiAverage word length:
(Σ_w |w| · f_w) / N
Examples¶
Basic usage¶
import textwizard as tw
stats = tw.analyze_text_statistics("a a a b b c d e f g")
print(stats)
Output
{'entropy': 2.646, 'zipf': {'slope': -0.605, 'r2': 0.838}, 'vocab_gini': 0.229, 'type_token_ratio': 0.7, 'hapax_ratio': 0.5, 'simpson_index': 0.82, 'yule_k': 800.0, 'avg_word_length': 1.0}
Single repeated token:
import textwizard as tw
print(tw.analyze_text_statistics("hello hello hello"))
Output
{'entropy': -0.0, 'zipf': {'slope': nan, 'r2': nan}, 'vocab_gini': 0.0, 'type_token_ratio': 0.333, 'hapax_ratio': 0.0, 'simpson_index': 0.0, 'yule_k': 6666.667, 'avg_word_length': 5.0}
Notes¶
Tokenization is deliberately simple (
text.lower().split()) to keep metrics stable and fast.NaNappears for Zipf metrics when fewer than two distinct tokens are present.