Perke’s Base Utils#

Extractor#

class perke.base.extractor.Extractor(valid_pos_tags: Set[str] | None = None)#

Base extractor, provides base functions for all extractors.

Variables:

word_normalization_method – Word normalization method.
sentences – List of sentence objects of the text
candidates – Dict of canonical forms of candidates to candidates, canonical form of a candidate is a string joined from normalized words of the candidate.
stopwords – Set of stopwords
valid_pos_tags – Set of valid part of speech tags.

__init__(valid_pos_tags: Set[str] | None = None) → None#

Initializes the extractor.

Parameters:: valid_pos_tags (Set[str] | None) – Set of valid part of speech tags, defaults to nouns and adjectives. I.e. {'NOUN', 'ADJ'}.

get_n_best(n: int = 10, remove_redundants: bool = False, normalized: bool = False) → List[Tuple[str, float]]#

Returns the n-best candidates.

Parameters:

n (int) – The number of candidates, defaults to 10.
remove_redundants (bool) – Whether redundant keyphrases are filtered out from the n-best list, defaults to False.
normalized (bool) – Whether to get normalized words instead of words of first occurring form of candidate, defaults to False.

Returns:

List of (candidate, weight) tuples, candidate can be either
canonical form or first occurrence joined words.

Return type:

List[Tuple[str, float]]

load_text(input: str | Path, word_normalization_method: Literal['stemming', 'lemmatization', None] = 'stemming', universal_pos_tags: bool = True) → None#

Loads the text of a document or string.

Parameters:

input (str | Path) – Input, this can be either raw text or filepath.
word_normalization_method (Literal['stemming', 'lemmatization', None]) – Word normalization method, defaults to 'stemming'. See perke.base.types.WordNormalizationMethod for available methods.
universal_pos_tags (bool) – Whether to use universal part of speech tags or not, defaults to True.

Data Structures#

class perke.base.data_structures.Candidate(all_words: ~typing.List[~typing.List[str]] = <factory>, offsets: ~typing.List[int] = <factory>, all_pos_tags: ~typing.List[~typing.List[str]] = <factory>, normalized_words: ~typing.List[str] = <factory>, weight: float = 0)#

Represents a keyphrase candidate data structure.

Variables:

all_words (List[List[str]]) – Nested list of words, each words list in the list corresponds to one of the candidate occurrence.
offsets (List[int]) – List of offsets of each occurrence.
all_pos_tags (List[List[str]]) – Nested list of pos tags, each pos tags list in the list corresponds to one of the candidate occurrence.
normalized_words (List[str]) – List of normalized of words, all occurrences have the same list of normalized words.
weight (float) – Candidate weight in weighting algorithms.

add_occurrence(words: List[str], offset: int, pos_tags: List[str], normalized_words: List[str]) → None#

Adds a new occurrence to the candidate.

Parameters:

words (List[str]) – List of words of the occurrence
offset (int) – The offset of the occurrence
pos_tags (List[str]) – List of part of speech tags assigned to words of the occurrence
normalized_words (List[str]) – List of normalized of words of the occurrence

property length: int#

Gets number of normalized words.

Returns:: Number of normalized words

class perke.base.data_structures.Sentence(words: List[str], pos_tags: List[str], normalized_words: List[str])#

Represents a sentence data structure.

Variables:

words (List[str]) – List of words
pos_tags (List[str]) – List of part of speech tags assigned to words
normalized_words (List[str]) – List of normalized of words

property length: int#

Gets number of words.

Returns:: Number of words

Readers#

class perke.base.readers.Reader(word_normalization_method: Literal['stemming', 'lemmatization', None], universal_pos_tags: bool)#

Base Reader

Variables:

word_normalization_method – Word normalization method
normalizer – The hazm normalizer instance
stemmer – The hazm stemmer instance
lemmatizer – The hazm lemmatizer instance
pos_tagger – The hazm pos tagger instance

__init__(word_normalization_method: Literal['stemming', 'lemmatization', None], universal_pos_tags: bool) → None#

Initializes the reader.

Parameters:

word_normalization_method (Literal['stemming', 'lemmatization', None]) – Word normalization method, see perke.base.types.WordNormalizationMethod for available methods.
universal_pos_tags (bool) – Whether to use universal part of speech tags or not

class perke.base.readers.RawTextReader(input: str, word_normalization_method: Literal['stemming', 'lemmatization', None], universal_pos_tags)#

Bases: Reader

Reader for raw text

Variables:: text – Raw text to read sentences from

__init__(input: str, word_normalization_method: Literal['stemming', 'lemmatization', None], universal_pos_tags) → None#

Initializes the reader.

Parameters:

input (str) – Input, this can be either raw text or filepath.
word_normalization_method (Literal['stemming', 'lemmatization', None]) – Word normalization method, see perke.base.types.WordNormalizationMethod for available methods.
universal_pos_tags – Whether to use universal part of speech tags or not

read() → List[Sentence]#

Reads the input and uses hazm to preprocess.

Returns:: List of sentences
Return type:: List[Sentence]

Types#

perke.base.types.HierarchicalClusteringLinkageMethod#: alias of Literal[‘single’, ‘complete’, ‘average’]

perke.base.types.HierarchicalClusteringMetric#: alias of Literal[‘euclidean’, ‘seuclidean’, ‘jaccard’]

perke.base.types.TopicHeuristic#: alias of Literal[‘first_occurring’, ‘frequent’]

perke.base.types.WordNormalizationMethod#: alias of Literal[‘stemming’, ‘lemmatization’, None]