Co-occurrence
kenon.cooccurrence.build_cooccurrence_graph(tokens, window=2, stopwords=None, min_weight=0.0)
Build a weighted co-occurrence graph using skip-gram windows.
Each node is a token. Each edge weight is the relative co-occurrence frequency of the two tokens within the specified window.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
list[Token]
|
Flat list of tokens (already lowercased / lemmatised as desired). |
required |
window
|
int
|
Half-width of the skip-gram context window. A window of 2 means each token is paired with the 2 tokens before and 2 after. |
2
|
stopwords
|
frozenset[str] | None
|
Tokens to exclude from nodes and edges. |
None
|
min_weight
|
float
|
Drop edges with weight below this threshold. |
0.0
|
Returns:
| Type | Description |
|---|---|
SemanticGraph
|
A |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Contract
- No self-loops in the returned graph.
- All edge weights are positive.
- Stopword filtering happens before counting.
Example
tokens = ["cat", "sat", "mat", "cat", "mat"] g = build_cooccurrence_graph(tokens, window=1) g.has_node("cat") True g["cat"]["sat"]["weight"] > 0 True
Source code in kenon/cooccurrence.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | |
kenon.cooccurrence.detect_collocations(tokens, n=2, metric='pmi', top_n=20, min_freq=2)
Detect statistically significant n-grams using NLTK collocation finders.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
list[Token]
|
Flat token list. |
required |
n
|
int
|
N-gram size. Supports 2 (bigrams) and 3 (trigrams). |
2
|
metric
|
str
|
Scoring metric. One of |
'pmi'
|
top_n
|
int
|
Number of top collocations to return. |
20
|
min_freq
|
int
|
Minimum frequency filter applied before scoring. |
2
|
Returns:
| Type | Description |
|---|---|
list[tuple[str, ...]]
|
List of token tuples sorted by score descending. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
ValueError
|
If |
Contract
- Returns at most
top_ntuples. - Each tuple has length
n.
Example
tokens = ["new", "york", "city", "new", "york", "times"] * 10 colls = detect_collocations(tokens, n=2, top_n=5) ("new", "york") in colls True
Source code in kenon/cooccurrence.py
92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | |