Graph-based models#
TextRank#
- class perke.unsupervised.graph_based.TextRank(valid_pos_tags: Set[str] | None = None)#
Bases:
Extractor
TextRank keyphrase extractor
This model builds a graph that represents the text. A graph based ranking algorithm is then applied to extract the phrases that are most important in the text.
In this implementation, nodes are words of certain parts of speech (nouns and adjectives) and edges represent co-occurrence relation, controlled by the distance between word occurrences (here a window of 2 words). Nodes are weighted by the TextRank graph-based weighting algorithm in its unweighted variant.
Note
Implementation of the TextRank model for keyword extraction described in:
Examples
from pathlib import Path from perke.unsupervised.graph_based import TextRank # Define the set of valid part of speech tags to occur in the model. valid_pos_tags = {'NOUN', 'ADJ'} # 1. Create a TextRank extractor. extractor = TextRank(valid_pos_tags=valid_pos_tags) # 2. Load the text. input_filepath = Path(__file__).parent.parent.parent / 'input.txt' extractor.load_text(input=input_filepath, word_normalization_method=None) # 3. Build the graph representation of the text and weight the # words. Keyphrase candidates are composed of the 33 percent # highest weighted words. extractor.weight_candidates(window_size=2, top_t_percent=0.33) # 4. Get the 10 highest weighted candidates as keyphrases. keyphrases = extractor.get_n_best(n=10) for i, (weight, keyphrase) in enumerate(keyphrases): print(f'{i+1}.\t{keyphrase}, \t{weight}')
- Variables:
graph – The word graph
graph_edges_are_weighted – Whether graph edges are weighted
- __init__(valid_pos_tags: Set[str] | None = None) None #
Initializes TextRank.
- Parameters:
valid_pos_tags (Set[str] | None) – Set of valid part of speech tags, defaults to nouns and adjectives. I.e.
{'NOUN', 'ADJ'}
.
- select_candidates() None #
Selects candidates using the longest sequences of certain parts of speech.
- weight_candidates(window_size: int = 2, top_t_percent: float | None = None, normalize_weights: bool = False) None #
Tailored candidate weighting method for TextRank. Keyphrase candidates are either composed of the top T-highest weighted words as in the original paper or extracted using the
select_candidates
method. Candidates are weighting using the sum of their (normalized?) words.- Parameters:
window_size (int) – The size of window for connecting two words in the graph, defaults to
2
.top_t_percent (float | None) – Percentage of top vertices to keep for phrase generation.
normalize_weights (bool) – Whether normalize keyphrase weight by their length, defaults to
False
.
SingleRank#
- class perke.unsupervised.graph_based.SingleRank(valid_pos_tags: Set[str] | None = None)#
Bases:
TextRank
SingleRank keyphrase extractor
This model is an extension of the TextRank model that uses the number of co-occurrences to weight edges in the graph.
Note
Implementation of the SingleRank model described in:
Xiaojun Wan and Jianguo XiaoIn proceedings of the NCAI, pages 855–860, 2008Examples
from pathlib import Path from perke.unsupervised.graph_based import SingleRank # Define the set of valid part of speech tags to occur in the model. valid_pos_tags = {'NOUN', 'ADJ'} # 1. Create a SingleRank extractor. extractor = SingleRank(valid_pos_tags=valid_pos_tags) # 2. Load the text. input_filepath = Path(__file__).parent.parent.parent / 'input.txt' extractor.load_text(input=input_filepath, word_normalization_method=None) # 3. Select the longest sequences of nouns and adjectives as # candidates. extractor.select_candidates() # 4. Weight the candidates using the sum of their words weights that # are computed using random walk. In the graph, nodes are certain # parts of speech (nouns and adjectives) that are connected if # they co-occur in a window of 10 words. extractor.weight_candidates(window=10) # 5. Get the 10 highest weighted candidates as keyphrases keyphrases = extractor.get_n_best(n=10) for i, (weight, keyphrase) in enumerate(keyphrases): print(f'{i+1}.\t{keyphrase}, \t{weight}')
- Variables:
graph_edges_are_weighted – Whether graph edges are weighted
- __init__(valid_pos_tags: Set[str] | None = None) None #
Initializes SingleRank.
- Parameters:
valid_pos_tags (Set[str] | None) – Set of valid part of speech tags, defaults to nouns and adjectives. I.e.
{'NOUN', 'ADJ'}
.
- weight_candidates(window_size: int = 10, normalize_weights: bool = False, **kwargs) None #
Weights candidates using the weighted variant of the TextRank formulae. Candidates are weighted by the sum of the weights of their words.
- Parameters:
window_size (
int
) – The size of window for connecting two words in the graph, defaults to10
.normalize_weights (
bool
) – Whether normalize keyphrase weight by their length, defaults toFalse
.
PositionRank#
- class perke.unsupervised.graph_based.PositionRank(valid_pos_tags: Set[str] | None = None)#
Bases:
SingleRank
PositionRank keyphrase extractor
This model is an unsupervised approach to extract keyphrases from scholarly documents that incorporates information from all positions of a word’s occurrences into a biased PageRank.
Note
Implementation of the PositionRank described in:
Corina Florescu and Cornelia CarageaIn proceedings of ACL, pages 1105-1115, 2017Examples
from pathlib import Path from perke.unsupervised.graph_based import PositionRank # Define the set of valid part of speech tags to occur in the model. valid_pos_tags = {'NOUN', 'NOUN,EZ', 'ADJ', 'ADJ,EZ'} # Define the grammar for selecting the keyphrase candidates grammar = r""" NP: {<NOUN>}<VERB> NP: {<DET(,EZ)?|NOUN(,EZ)?|NUM(,EZ)?|ADJ(,EZ)|PRON><DET(,EZ)|NOUN(,EZ)|NUM(,EZ)|ADJ(,EZ)|PRON>*} <NOUN>}{<.*(,EZ)?> """ # 1. Create a PositionRank extractor. extractor = PositionRank(valid_pos_tags=valid_pos_tags) # 2. Load the text. input_filepath = Path(__file__).parent.parent.parent / 'input.txt' extractor.load_text( input=input_filepath, word_normalization_method=None, universal_pos_tags=False, ) # 3. Select the noun phrases up to 3 words as keyphrase candidates. extractor.select_candidates(grammar=grammar, maximum_word_number=3) # 4. Weight the candidates using the sum of their word's weights # that are computed using random walk biased with the position of # the words in the text. In the graph, nodes are words (nouns # and adjectives only) that are connected if they co-occur in a # window of 10 words. extractor.weight_candidates(window_size=10) # 5. Get the 10 highest weighted candidates as keyphrases keyphrases = extractor.get_n_best(n=10) for i, (weight, keyphrase) in enumerate(keyphrases): print(f'{i+1}.\t{keyphrase}, \t{weight}')
- Variables:
positions – Dict of normalized word to the sums of word’s inverse positions
- __init__(valid_pos_tags: Set[str] | None = None) None #
Initializes PositionRank.
- Parameters:
valid_pos_tags (Set[str] | None) – Set of valid part of speech tags, defaults to nouns and adjectives. I.e.
{'NOUN', 'NOUN,EZ', 'ADJ', 'ADJ,EZ'}
.
- select_candidates(grammar: str | None = None, maximum_length: int = 3, **kwargs) None #
Candidate selection heuristic using a syntactic part of speech pattern for noun phrase extraction. Keyphrase candidates are noun phrases that match the regular expression (adjective)*(noun)+, with a given length.
- Parameters:
grammar (str | None) –
Grammar defining part of speech patterns of noun phrases, defaults to:
r""" NP: {<NOUN>}<VERB> NP: {<DET(,EZ)?|NOUN(,EZ)?|NUM(,EZ)?|ADJ(,EZ)|PRON><DET(,EZ)|NOUN(,EZ)|NUM(,EZ)|ADJ(,EZ)|PRON>*} <NOUN>}{<.*(,EZ)?> """
maximum_length (
int
) – Maximum length in words of the candidate, defaults to3
.
- weight_candidates(window_size: int = 10, normalize_weights: bool = False, **kwargs) None #
Calculates candidates weights using a biased PageRank.
- Parameters:
window_size (int) – The size of window for connecting two words in the graph, defaults to
10
.normalize_weights (bool) – Normalize keyphrase weight by their length, defaults to
False
.
TopicRank#
- class perke.unsupervised.graph_based.TopicRank(valid_pos_tags: Set[str] | None = None)#
Bases:
Extractor
TopicRank keyphrase extractor.
This model relies on a topical representation of the text. Candidate keyphrases are clustered into topics and used as nodes in a complete graph. A graph-based ranking model is applied to assign a significance weight to each topic. Keyphrases are then generated by selecting a candidate from each of the top ranked topics.
Note
Implementation of the SingleRank model described in:
Adrien Bougouin, Florian Boudin and Béatrice DailleIn proceedings of IJCNLP, pages 543-551, 2013Examples
from pathlib import Path from perke.unsupervised.graph_based import TopicRank # Define the set of valid part of speech tags to occur in the model. valid_pos_tags = {'NOUN', 'ADJ'} # 1. Create a TopicRank extractor. extractor = TopicRank(valid_pos_tags=valid_pos_tags) # 2. Load the text. input_filepath = Path(__file__).parent.parent.parent / 'input.txt' extractor.load_text(input=input_filepath, word_normalization_method='stemming') # 3. Select the longest sequences of nouns and adjectives, that do # not contain punctuation marks or stopwords as candidates. extractor.select_candidates() # 4. Build topics by grouping candidates with HAC (average linkage, # jaccard distance, threshold of 1/4 of shared normalized words). # Weight the topics using random walk, and select the first # occurring candidate from each topic. extractor.weight_candidates( threshold=0.74, metric='jaccard', linkage_method='average' ) # 5. Get the 10 highest weighted candidates as keyphrases keyphrases = extractor.get_n_best(n=10) for i, (weight, keyphrase) in enumerate(keyphrases): print(f'{i + 1}.\t{keyphrase}, \t{weight}')
- Variables:
graph – The topic graph
topics – List of topics
- __init__(valid_pos_tags: Set[str] | None = None) None #
Initializes TopicRank.
- Parameters:
valid_pos_tags (
set[str]
, optional) – Set of valid part of speech tags, defaults to nouns and adjectives. I.e.{'NOUN', 'ADJ'}
.
- select_candidates() None #
Selects the longest sequences of nouns and adjectives as keyphrase candidates.
- weight_candidates(threshold: float = 0.74, metric: Literal['euclidean', 'seuclidean', 'jaccard'] = 'jaccard', linkage_method: Literal['single', 'complete', 'average'] = 'average', topic_heuristic: Literal['first_occurring', 'frequent'] = 'first_occurring') None #
Candidate ranking using random walk.
- Parameters:
threshold (float) – The minimum similarity for clustering, defaults to
0.74
, i.e. more than 1/4 of normalized word overlap similarity.metric (Literal['euclidean', 'seuclidean', 'jaccard']) – The hierarchical clustering metric, defaults to
'jaccard'
Seeperke.base.types.HierarchicalClusteringMetric
for available methods.linkage_method (Literal['single', 'complete', 'average']) – The hierarchical clustering linkage method, defaults to
'average'
. Seeperke.base.types.HierarchicalClusteringLinkageMethod
for available methods.topic_heuristic (Literal['first_occurring', 'frequent']) – The heuristic for selecting the best candidate for each topic, defaults to first occurring candidate. See
perke.base.types.TopicHeuristic
for available heuristics.
MultipartiteRank#
- class perke.unsupervised.graph_based.MultipartiteRank(valid_pos_tags: Set[str] | None = None)#
Bases:
TopicRank
MultipartiteRank keyphrase extractor
This model encodes topical information within a multipartite graph structure. The model represents keyphrase candidates and topics in a single graph and exploits their mutually reinforcing relationship to improve candidate ranking.
Note
Implementation of the MultipartiteRank described in:
Florian BoudinIn proceedings of NAACL, pages 667-672, 2018Examples
from pathlib import Path from perke.unsupervised.graph_based import MultipartiteRank # Define the set of valid part of speech tags to occur in the model. valid_pos_tags = {'NOUN', 'ADJ'} # 1. Create a MultipartiteRank extractor. extractor = MultipartiteRank(valid_pos_tags=valid_pos_tags) # 2. Load the text. input_filepath = Path(__file__).parent.parent.parent / 'input.txt' extractor.load_text(input=input_filepath, word_normalization_method='stemming') # 3. Select the longest sequences of nouns and adjectives, that do # not contain punctuation marks or stopwords as candidates. extractor.select_candidates() # 4. Build the Multipartite graph and weight candidates using # random walk, alpha controls the weight adjustment mechanism, # see TopicRank for metric, linkage method and threshold # parameters. extractor.weight_candidates( threshold=0.74, metric='jaccard', linkage_method='average', alpha=1.1, ) # 5. Get the 10 highest weighted candidates as keyphrases keyphrases = extractor.get_n_best(n=10) for i, (weight, keyphrase) in enumerate(keyphrases): print(f'{i + 1}.\t{keyphrase}, \t{weight}')
- Variables:
topic_ids – Dict of canonical forms of candidates to topic identifiers
graph – The candidate graph
- __init__(valid_pos_tags: Set[str] | None = None) None #
Initializes MultipartiteRank.
- Parameters:
valid_pos_tags (Set[str] | None) – Set of valid part of speech tags, defaults to nouns and adjectives. I.e.
{'NOUN', 'ADJ'}
.
- weight_candidates(threshold: float = 0.74, metric: Literal['euclidean', 'seuclidean', 'jaccard'] = 'jaccard', linkage_method: Literal['single', 'complete', 'average'] = 'average', alpha: float = 1.1) None #
Candidate weight calculation using random walk.
- Parameters:
threshold (float) – The minimum similarity for clustering, defaults to
0.74
, i.e. more than 1/4 of normalized word overlap similarity.metric (Literal['euclidean', 'seuclidean', 'jaccard']) – The hierarchical clustering metric, defaults to
'jaccard'
Seeperke.base.types.HierarchicalClusteringMetric
for available methods.linkage_method (Literal['single', 'complete', 'average']) – The hierarchical clustering linkage method, defaults to
'average'
. SeeHierarchicalClusteringLinkageMethod
for available methods.alpha (float) – Hyper-parameter that controls the strength of the weight adjustment, defaults to
1.1