Perke’s Base Utils#
Extractor#
- class perke.base.extractor.Extractor(valid_pos_tags: Set[str] | None = None)#
Base extractor, provides base functions for all extractors.
- Variables:
word_normalization_method – Word normalization method.
sentences – List of sentence objects of the text
candidates – Dict of canonical forms of candidates to candidates, canonical form of a candidate is a string joined from normalized words of the candidate.
stopwords – Set of stopwords
valid_pos_tags – Set of valid part of speech tags.
- __init__(valid_pos_tags: Set[str] | None = None) None #
Initializes the extractor.
- Parameters:
valid_pos_tags (Set[str] | None) – Set of valid part of speech tags, defaults to nouns and adjectives. I.e.
{'NOUN', 'ADJ'}
.
- get_n_best(n: int = 10, remove_redundants: bool = False, normalized: bool = False) List[Tuple[str, float]] #
Returns the n-best candidates.
- Parameters:
n (int) – The number of candidates, defaults to
10
.remove_redundants (bool) – Whether redundant keyphrases are filtered out from the n-best list, defaults to
False
.normalized (bool) – Whether to get normalized words instead of words of first occurring form of candidate, defaults to
False
.
- Returns:
List of
(candidate, weight)
tuples,candidate
can be eithercanonical form or first occurrence joined words.
- Return type:
List[Tuple[str, float]]
- load_text(input: str | Path, word_normalization_method: Literal['stemming', 'lemmatization', None] = 'stemming', universal_pos_tags: bool = True) None #
Loads the text of a document or string.
- Parameters:
input (str | Path) – Input, this can be either raw text or filepath.
word_normalization_method (Literal['stemming', 'lemmatization', None]) – Word normalization method, defaults to
'stemming'
. Seeperke.base.types.WordNormalizationMethod
for available methods.universal_pos_tags (bool) – Whether to use universal part of speech tags or not, defaults to
True
.
Data Structures#
- class perke.base.data_structures.Candidate(all_words: ~typing.List[~typing.List[str]] = <factory>, offsets: ~typing.List[int] = <factory>, all_pos_tags: ~typing.List[~typing.List[str]] = <factory>, normalized_words: ~typing.List[str] = <factory>, weight: float = 0)#
Represents a keyphrase candidate data structure.
- Variables:
all_words (List[List[str]]) – Nested list of words, each words list in the list corresponds to one of the candidate occurrence.
offsets (List[int]) – List of offsets of each occurrence.
all_pos_tags (List[List[str]]) – Nested list of pos tags, each pos tags list in the list corresponds to one of the candidate occurrence.
normalized_words (List[str]) – List of normalized of words, all occurrences have the same list of normalized words.
weight (float) – Candidate weight in weighting algorithms.
- add_occurrence(words: List[str], offset: int, pos_tags: List[str], normalized_words: List[str]) None #
Adds a new occurrence to the candidate.
- Parameters:
words (List[str]) – List of words of the occurrence
offset (int) – The offset of the occurrence
pos_tags (List[str]) – List of part of speech tags assigned to words of the occurrence
normalized_words (List[str]) – List of normalized of words of the occurrence
- property length: int#
Gets number of normalized words.
- Returns:
Number of normalized words
- class perke.base.data_structures.Sentence(words: List[str], pos_tags: List[str], normalized_words: List[str])#
Represents a sentence data structure.
- Variables:
words (List[str]) – List of words
pos_tags (List[str]) – List of part of speech tags assigned to words
normalized_words (List[str]) – List of normalized of words
- property length: int#
Gets number of words.
- Returns:
Number of words
Readers#
- class perke.base.readers.Reader(word_normalization_method: Literal['stemming', 'lemmatization', None], universal_pos_tags: bool)#
Base Reader
- Variables:
word_normalization_method – Word normalization method
normalizer – The hazm normalizer instance
stemmer – The hazm stemmer instance
lemmatizer – The hazm lemmatizer instance
pos_tagger – The hazm pos tagger instance
- __init__(word_normalization_method: Literal['stemming', 'lemmatization', None], universal_pos_tags: bool) None #
Initializes the reader.
- Parameters:
word_normalization_method (Literal['stemming', 'lemmatization', None]) – Word normalization method, see
perke.base.types.WordNormalizationMethod
for available methods.universal_pos_tags (bool) – Whether to use universal part of speech tags or not
- class perke.base.readers.RawTextReader(input: str, word_normalization_method: Literal['stemming', 'lemmatization', None], universal_pos_tags)#
Bases:
Reader
Reader for raw text
- Variables:
text – Raw text to read sentences from
- __init__(input: str, word_normalization_method: Literal['stemming', 'lemmatization', None], universal_pos_tags) None #
Initializes the reader.
- Parameters:
input (str) – Input, this can be either raw text or filepath.
word_normalization_method (Literal['stemming', 'lemmatization', None]) – Word normalization method, see
perke.base.types.WordNormalizationMethod
for available methods.universal_pos_tags – Whether to use universal part of speech tags or not
Types#
- perke.base.types.HierarchicalClusteringLinkageMethod#
alias of
Literal
[‘single’, ‘complete’, ‘average’]
- perke.base.types.HierarchicalClusteringMetric#
alias of
Literal
[‘euclidean’, ‘seuclidean’, ‘jaccard’]
- perke.base.types.TopicHeuristic#
alias of
Literal
[‘first_occurring’, ‘frequent’]
- perke.base.types.WordNormalizationMethod#
alias of
Literal
[‘stemming’, ‘lemmatization’, None]