Graph-based models#

TextRank#

class perke.unsupervised.graph_based.TextRank(valid_pos_tags: Set[str] | None = None)#

Bases: Extractor

TextRank keyphrase extractor

This model builds a graph that represents the text. A graph based ranking algorithm is then applied to extract the phrases that are most important in the text.

In this implementation, nodes are words of certain parts of speech (nouns and adjectives) and edges represent co-occurrence relation, controlled by the distance between word occurrences (here a window of 2 words). Nodes are weighted by the TextRank graph-based weighting algorithm in its unweighted variant.

Note

Implementation of the TextRank model for keyword extraction described in:

Rada Mihalcea and Paul Tarau
TextRank: Bringing Order into Texts
In Proceedings of EMNLP, 2004

Examples

from pathlib import Path

from perke.unsupervised.graph_based import TextRank

# Define the set of valid part of speech tags to occur in the model.
valid_pos_tags = {'NOUN', 'ADJ'}

# 1. Create a TextRank extractor.
extractor = TextRank(valid_pos_tags=valid_pos_tags)

# 2. Load the text.
input_filepath = Path(__file__).parent.parent.parent / 'input.txt'
extractor.load_text(input=input_filepath, word_normalization_method=None)

# 3. Build the graph representation of the text and weight the
#    words. Keyphrase candidates are composed of the 33 percent
#    highest weighted words.
extractor.weight_candidates(window_size=2, top_t_percent=0.33)

# 4. Get the 10 highest weighted candidates as keyphrases.
keyphrases = extractor.get_n_best(n=10)

for i, (weight, keyphrase) in enumerate(keyphrases):
    print(f'{i+1}.\t{keyphrase}, \t{weight}')

Variables:

graph – The word graph
graph_edges_are_weighted – Whether graph edges are weighted

__init__(valid_pos_tags: Set[str] | None = None) → None#

Initializes TextRank.

Parameters:: valid_pos_tags (Set[str] | None) – Set of valid part of speech tags, defaults to nouns and adjectives. I.e. {'NOUN', 'ADJ'}.

select_candidates() → None#

Selects candidates using the longest sequences of certain parts of speech.

weight_candidates(window_size: int = 2, top_t_percent: float | None = None, normalize_weights: bool = False) → None#

Tailored candidate weighting method for TextRank. Keyphrase candidates are either composed of the top T-highest weighted words as in the original paper or extracted using the select_candidates method. Candidates are weighting using the sum of their (normalized?) words.

Parameters:

window_size (int) – The size of window for connecting two words in the graph, defaults to 2.
top_t_percent (float | None) – Percentage of top vertices to keep for phrase generation.
normalize_weights (bool) – Whether normalize keyphrase weight by their length, defaults to False.

SingleRank#

class perke.unsupervised.graph_based.SingleRank(valid_pos_tags: Set[str] | None = None)#

Bases: TextRank

SingleRank keyphrase extractor

This model is an extension of the TextRank model that uses the number of co-occurrences to weight edges in the graph.

Note

Implementation of the SingleRank model described in:

Xiaojun Wan and Jianguo Xiao
Single Document Keyphrase Extraction Using Neighborhood
Knowledge
In proceedings of the NCAI, pages 855–860, 2008

Examples

from pathlib import Path

from perke.unsupervised.graph_based import SingleRank

# Define the set of valid part of speech tags to occur in the model.
valid_pos_tags = {'NOUN', 'ADJ'}

# 1. Create a SingleRank extractor.
extractor = SingleRank(valid_pos_tags=valid_pos_tags)

# 2. Load the text.
input_filepath = Path(__file__).parent.parent.parent / 'input.txt'
extractor.load_text(input=input_filepath, word_normalization_method=None)

# 3. Select the longest sequences of nouns and adjectives as
#    candidates.
extractor.select_candidates()

# 4. Weight the candidates using the sum of their words weights that
#    are computed using random walk. In the graph, nodes are certain
#    parts of speech (nouns and adjectives) that are connected if
#    they co-occur in a window of 10 words.
extractor.weight_candidates(window=10)

# 5. Get the 10 highest weighted candidates as keyphrases
keyphrases = extractor.get_n_best(n=10)

for i, (weight, keyphrase) in enumerate(keyphrases):
    print(f'{i+1}.\t{keyphrase}, \t{weight}')

Variables:: graph_edges_are_weighted – Whether graph edges are weighted

__init__(valid_pos_tags: Set[str] | None = None) → None#

Initializes SingleRank.

Parameters:: valid_pos_tags (Set[str] | None) – Set of valid part of speech tags, defaults to nouns and adjectives. I.e. {'NOUN', 'ADJ'}.

weight_candidates(window_size: int = 10, normalize_weights: bool = False, **kwargs) → None#

Weights candidates using the weighted variant of the TextRank formulae. Candidates are weighted by the sum of the weights of their words.

Parameters:

window_size (int) – The size of window for connecting two words in the graph, defaults to 10.
normalize_weights (bool) – Whether normalize keyphrase weight by their length, defaults to False.

PositionRank#

class perke.unsupervised.graph_based.PositionRank(valid_pos_tags: Set[str] | None = None)#

Bases: SingleRank

PositionRank keyphrase extractor

This model is an unsupervised approach to extract keyphrases from scholarly documents that incorporates information from all positions of a word’s occurrences into a biased PageRank.

Note

Implementation of the PositionRank described in:

Corina Florescu and Cornelia Caragea
PositionRank: An Unsupervised Approach to Keyphrase Extraction
from Scholarly Documents
In proceedings of ACL, pages 1105-1115, 2017

Examples

from pathlib import Path

from perke.unsupervised.graph_based import PositionRank

# Define the set of valid part of speech tags to occur in the model.
valid_pos_tags = {'NOUN', 'NOUN,EZ', 'ADJ', 'ADJ,EZ'}

# Define the grammar for selecting the keyphrase candidates
grammar = r"""
    NP:
        {<NOUN>}<VERB>
    NP:
        {<DET(,EZ)?|NOUN(,EZ)?|NUM(,EZ)?|ADJ(,EZ)|PRON><DET(,EZ)|NOUN(,EZ)|NUM(,EZ)|ADJ(,EZ)|PRON>*}
        <NOUN>}{<.*(,EZ)?>
"""

# 1. Create a PositionRank extractor.
extractor = PositionRank(valid_pos_tags=valid_pos_tags)

# 2. Load the text.
input_filepath = Path(__file__).parent.parent.parent / 'input.txt'
extractor.load_text(
    input=input_filepath,
    word_normalization_method=None,
    universal_pos_tags=False,
)

# 3. Select the noun phrases up to 3 words as keyphrase candidates.
extractor.select_candidates(grammar=grammar, maximum_word_number=3)

# 4. Weight the candidates using the sum of their word's weights
#    that are computed using random walk biased with the position of
#    the words in the text. In the graph, nodes are words (nouns
#    and adjectives only) that are connected if they co-occur in a
#    window of 10 words.
extractor.weight_candidates(window_size=10)

# 5. Get the 10 highest weighted candidates as keyphrases
keyphrases = extractor.get_n_best(n=10)

for i, (weight, keyphrase) in enumerate(keyphrases):
    print(f'{i+1}.\t{keyphrase}, \t{weight}')

Variables:: positions – Dict of normalized word to the sums of word’s inverse positions

__init__(valid_pos_tags: Set[str] | None = None) → None#

Initializes PositionRank.

Parameters:: valid_pos_tags (Set[str] | None) – Set of valid part of speech tags, defaults to nouns and adjectives. I.e. {'NOUN', 'NOUN,EZ', 'ADJ', 'ADJ,EZ'}.

select_candidates(grammar: str | None = None, maximum_length: int = 3, **kwargs) → None#

Candidate selection heuristic using a syntactic part of speech pattern for noun phrase extraction. Keyphrase candidates are noun phrases that match the regular expression (adjective)*(noun)+, with a given length.

Parameters:

grammar (str | None) –

Grammar defining part of speech patterns of noun phrases, defaults to:

r"""
NP:
    {<NOUN>}<VERB>
NP:
    {<DET(,EZ)?|NOUN(,EZ)?|NUM(,EZ)?|ADJ(,EZ)|PRON><DET(,EZ)|NOUN(,EZ)|NUM(,EZ)|ADJ(,EZ)|PRON>*}
    <NOUN>}{<.*(,EZ)?>
"""

maximum_length (int) – Maximum length in words of the candidate, defaults to 3.

weight_candidates(window_size: int = 10, normalize_weights: bool = False, **kwargs) → None#

Calculates candidates weights using a biased PageRank.

Parameters:

window_size (int) – The size of window for connecting two words in the graph, defaults to 10.
normalize_weights (bool) – Normalize keyphrase weight by their length, defaults to False.

TopicRank#

class perke.unsupervised.graph_based.TopicRank(valid_pos_tags: Set[str] | None = None)#

Bases: Extractor

TopicRank keyphrase extractor.

This model relies on a topical representation of the text. Candidate keyphrases are clustered into topics and used as nodes in a complete graph. A graph-based ranking model is applied to assign a significance weight to each topic. Keyphrases are then generated by selecting a candidate from each of the top ranked topics.

Note

Implementation of the SingleRank model described in:

Adrien Bougouin, Florian Boudin and Béatrice Daille
TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction
In proceedings of IJCNLP, pages 543-551, 2013

Examples

from pathlib import Path

from perke.unsupervised.graph_based import TopicRank

# Define the set of valid part of speech tags to occur in the model.
valid_pos_tags = {'NOUN', 'ADJ'}

# 1. Create a TopicRank extractor.
extractor = TopicRank(valid_pos_tags=valid_pos_tags)

# 2. Load the text.
input_filepath = Path(__file__).parent.parent.parent / 'input.txt'
extractor.load_text(input=input_filepath, word_normalization_method='stemming')

# 3. Select the longest sequences of nouns and adjectives, that do
#    not contain punctuation marks or stopwords as candidates.
extractor.select_candidates()

# 4. Build topics by grouping candidates with HAC (average linkage,
#    jaccard distance, threshold of 1/4 of shared normalized words).
#    Weight the topics using random walk, and select the first
#    occurring candidate from each topic.
extractor.weight_candidates(
    threshold=0.74, metric='jaccard', linkage_method='average'
)

# 5. Get the 10 highest weighted candidates as keyphrases
keyphrases = extractor.get_n_best(n=10)

for i, (weight, keyphrase) in enumerate(keyphrases):
    print(f'{i + 1}.\t{keyphrase}, \t{weight}')

Variables:

graph – The topic graph
topics – List of topics

__init__(valid_pos_tags: Set[str] | None = None) → None#

Initializes TopicRank.

Parameters:: valid_pos_tags (set[str], optional) – Set of valid part of speech tags, defaults to nouns and adjectives. I.e. {'NOUN', 'ADJ'}.

select_candidates() → None#

Selects the longest sequences of nouns and adjectives as keyphrase candidates.

weight_candidates(threshold: float = 0.74, metric: Literal['euclidean', 'seuclidean', 'jaccard'] = 'jaccard', linkage_method: Literal['single', 'complete', 'average'] = 'average', topic_heuristic: Literal['first_occurring', 'frequent'] = 'first_occurring') → None#

Candidate ranking using random walk.

Parameters:

threshold (float) – The minimum similarity for clustering, defaults to 0.74, i.e. more than 1/4 of normalized word overlap similarity.
metric (Literal['euclidean', 'seuclidean', 'jaccard']) – The hierarchical clustering metric, defaults to 'jaccard' See perke.base.types.HierarchicalClusteringMetric for available methods.
linkage_method (Literal['single', 'complete', 'average']) – The hierarchical clustering linkage method, defaults to 'average'. See perke.base.types.HierarchicalClusteringLinkageMethod for available methods.
topic_heuristic (Literal['first_occurring', 'frequent']) – The heuristic for selecting the best candidate for each topic, defaults to first occurring candidate. See perke.base.types.TopicHeuristic for available heuristics.

MultipartiteRank#

class perke.unsupervised.graph_based.MultipartiteRank(valid_pos_tags: Set[str] | None = None)#

Bases: TopicRank

MultipartiteRank keyphrase extractor

This model encodes topical information within a multipartite graph structure. The model represents keyphrase candidates and topics in a single graph and exploits their mutually reinforcing relationship to improve candidate ranking.

Note

Implementation of the MultipartiteRank described in:

Florian Boudin
Unsupervised Keyphrase Extraction with Multipartite Graphs
In proceedings of NAACL, pages 667-672, 2018

Examples

from pathlib import Path

from perke.unsupervised.graph_based import MultipartiteRank

# Define the set of valid part of speech tags to occur in the model.
valid_pos_tags = {'NOUN', 'ADJ'}

# 1. Create a MultipartiteRank extractor.
extractor = MultipartiteRank(valid_pos_tags=valid_pos_tags)

# 2. Load the text.
input_filepath = Path(__file__).parent.parent.parent / 'input.txt'
extractor.load_text(input=input_filepath, word_normalization_method='stemming')

# 3. Select the longest sequences of nouns and adjectives, that do
#    not contain punctuation marks or stopwords as candidates.
extractor.select_candidates()

# 4. Build the Multipartite graph and weight candidates using
#    random walk, alpha controls the weight adjustment mechanism,
#    see TopicRank for metric, linkage method and threshold
#    parameters.
extractor.weight_candidates(
    threshold=0.74,
    metric='jaccard',
    linkage_method='average',
    alpha=1.1,
)

# 5. Get the 10 highest weighted candidates as keyphrases
keyphrases = extractor.get_n_best(n=10)

for i, (weight, keyphrase) in enumerate(keyphrases):
    print(f'{i + 1}.\t{keyphrase}, \t{weight}')

Variables:

topic_ids – Dict of canonical forms of candidates to topic identifiers
graph – The candidate graph

__init__(valid_pos_tags: Set[str] | None = None) → None#

Initializes MultipartiteRank.

Parameters:: valid_pos_tags (Set[str] | None) – Set of valid part of speech tags, defaults to nouns and adjectives. I.e. {'NOUN', 'ADJ'}.

Candidate weight calculation using random walk.

Parameters:

threshold (float) – The minimum similarity for clustering, defaults to 0.74, i.e. more than 1/4 of normalized word overlap similarity.
metric (Literal['euclidean', 'seuclidean', 'jaccard']) – The hierarchical clustering metric, defaults to 'jaccard' See perke.base.types.HierarchicalClusteringMetric for available methods.
linkage_method (Literal['single', 'complete', 'average']) – The hierarchical clustering linkage method, defaults to 'average'. See HierarchicalClusteringLinkageMethod for available methods.
alpha (float) – Hyper-parameter that controls the strength of the weight adjustment, defaults to 1.1