Perke’s Base Utils#

Extractor#

class perke.base.extractor.Extractor(valid_pos_tags: Set[str] | None = None)#

Base extractor, provides base functions for all extractors.

Variables:
  • word_normalization_method – Word normalization method.

  • sentences – List of sentence objects of the text

  • candidates – Dict of canonical forms of candidates to candidates, canonical form of a candidate is a string joined from normalized words of the candidate.

  • stopwords – Set of stopwords

  • valid_pos_tags – Set of valid part of speech tags.

__init__(valid_pos_tags: Set[str] | None = None) None#

Initializes the extractor.

Parameters:

valid_pos_tags (Set[str] | None) – Set of valid part of speech tags, defaults to nouns and adjectives. I.e. {'NOUN', 'ADJ'}.

get_n_best(n: int = 10, remove_redundants: bool = False, normalized: bool = False) List[Tuple[str, float]]#

Returns the n-best candidates.

Parameters:
  • n (int) – The number of candidates, defaults to 10.

  • remove_redundants (bool) – Whether redundant keyphrases are filtered out from the n-best list, defaults to False.

  • normalized (bool) – Whether to get normalized words instead of words of first occurring form of candidate, defaults to False.

Returns:

  • List of (candidate, weight) tuples, candidate can be either

  • canonical form or first occurrence joined words.

Return type:

List[Tuple[str, float]]

load_text(input: str | Path, word_normalization_method: Literal['stemming', 'lemmatization', None] = 'stemming', universal_pos_tags: bool = True) None#

Loads the text of a document or string.

Parameters:
  • input (str | Path) – Input, this can be either raw text or filepath.

  • word_normalization_method (Literal['stemming', 'lemmatization', None]) – Word normalization method, defaults to 'stemming'. See perke.base.types.WordNormalizationMethod for available methods.

  • universal_pos_tags (bool) – Whether to use universal part of speech tags or not, defaults to True.

Data Structures#

class perke.base.data_structures.Candidate(all_words: ~typing.List[~typing.List[str]] = <factory>, offsets: ~typing.List[int] = <factory>, all_pos_tags: ~typing.List[~typing.List[str]] = <factory>, normalized_words: ~typing.List[str] = <factory>, weight: float = 0)#

Represents a keyphrase candidate data structure.

Variables:
  • all_words (List[List[str]]) – Nested list of words, each words list in the list corresponds to one of the candidate occurrence.

  • offsets (List[int]) – List of offsets of each occurrence.

  • all_pos_tags (List[List[str]]) – Nested list of pos tags, each pos tags list in the list corresponds to one of the candidate occurrence.

  • normalized_words (List[str]) – List of normalized of words, all occurrences have the same list of normalized words.

  • weight (float) – Candidate weight in weighting algorithms.

add_occurrence(words: List[str], offset: int, pos_tags: List[str], normalized_words: List[str]) None#

Adds a new occurrence to the candidate.

Parameters:
  • words (List[str]) – List of words of the occurrence

  • offset (int) – The offset of the occurrence

  • pos_tags (List[str]) – List of part of speech tags assigned to words of the occurrence

  • normalized_words (List[str]) – List of normalized of words of the occurrence

property length: int#

Gets number of normalized words.

Returns:

Number of normalized words

class perke.base.data_structures.Sentence(words: List[str], pos_tags: List[str], normalized_words: List[str])#

Represents a sentence data structure.

Variables:
  • words (List[str]) – List of words

  • pos_tags (List[str]) – List of part of speech tags assigned to words

  • normalized_words (List[str]) – List of normalized of words

property length: int#

Gets number of words.

Returns:

Number of words

Readers#

class perke.base.readers.Reader(word_normalization_method: Literal['stemming', 'lemmatization', None], universal_pos_tags: bool)#

Base Reader

Variables:
  • word_normalization_method – Word normalization method

  • normalizer – The hazm normalizer instance

  • stemmer – The hazm stemmer instance

  • lemmatizer – The hazm lemmatizer instance

  • pos_tagger – The hazm pos tagger instance

__init__(word_normalization_method: Literal['stemming', 'lemmatization', None], universal_pos_tags: bool) None#

Initializes the reader.

Parameters:
  • word_normalization_method (Literal['stemming', 'lemmatization', None]) – Word normalization method, see perke.base.types.WordNormalizationMethod for available methods.

  • universal_pos_tags (bool) – Whether to use universal part of speech tags or not

class perke.base.readers.RawTextReader(input: str, word_normalization_method: Literal['stemming', 'lemmatization', None], universal_pos_tags)#

Bases: Reader

Reader for raw text

Variables:

text – Raw text to read sentences from

__init__(input: str, word_normalization_method: Literal['stemming', 'lemmatization', None], universal_pos_tags) None#

Initializes the reader.

Parameters:
  • input (str) – Input, this can be either raw text or filepath.

  • word_normalization_method (Literal['stemming', 'lemmatization', None]) – Word normalization method, see perke.base.types.WordNormalizationMethod for available methods.

  • universal_pos_tags – Whether to use universal part of speech tags or not

read() List[Sentence]#

Reads the input and uses hazm to preprocess.

Returns:

List of sentences

Return type:

List[Sentence]

Types#

perke.base.types.HierarchicalClusteringLinkageMethod#

alias of Literal[‘single’, ‘complete’, ‘average’]

perke.base.types.HierarchicalClusteringMetric#

alias of Literal[‘euclidean’, ‘seuclidean’, ‘jaccard’]

perke.base.types.TopicHeuristic#

alias of Literal[‘first_occurring’, ‘frequent’]

perke.base.types.WordNormalizationMethod#

alias of Literal[‘stemming’, ‘lemmatization’, None]