nestor.keyword module¶
author: Thurston Sexton
-
class
NLPSelect
(columns=0, special_replace=None)[source]¶ Bases:
nestor.keyword.Transformer
Extract specified natural language columns from a pd.DataFrame, and combine into a single series.
- Parameters
columns (int, or list of int or str.) – corresponding columns in X to extract, clean, and merge
-
class
TokenExtractor
(**tfidf_kwargs)[source]¶ Bases:
sklearn.base.TransformerMixin
A wrapper for the sklearn TfidfVectorizer class, with utilities for ranking by total tf-idf score, and getting a list of vocabulary.
- Parameters
tfidf_kwargs (arguments to pass to sklearn’s TfidfVectorizer)
Valid options modified here (see sklearn docs for more options) are –
- inputstring {‘filename’, ‘file’, ‘content’}, default=’content’
If ‘filename’, the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze.
If ‘file’, the sequence items must have a ‘read’ method (file-like object) that is called to fetch the bytes in memory.
Otherwise the input is expected to be the sequence strings or bytes items are expected to be analyzed directly.
- ngram_rangetuple (min_n, max_n), default=(1,1)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
- stop_wordsstring {‘english’} (default), list, or None
If a string, it is passed to _check_stop_list and the appropriate stop list is returned. ‘english’ is currently the only supported string value.
If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if
analyzer == 'word'
.If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.
- max_featuresint or None, default=5000
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.
- smooth_idfboolean, default=False
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
- sublinear_tfboolean, default=True
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
-
fit_transform
(self, X, y=None, **fit_params)[source]¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
- Returns
X_new (numpy array of shape [n_samples, n_features_new]) – Transformed array.
-
ranks_
¶ Retrieve the rank of each token, for sorting. Uses summed scoring over the TF-IDF for each token, so that: \(S_t = \Sum_{\text{MWO}}\text{TF-IDF}_t\)
- Returns
ranks (numpy.array)
-
scores_
¶ Returns actual scores of tokens, for progress-tracking (unit-normalized)
- Returns
numpy.array
-
generate_vocabulary_df
(transformer, filename=None, init=None)[source]¶ Helper method to create a formatted pandas.DataFrame and/or a .csv containing the token–tag/alias–classification relationship. Formatted as jargon/slang tokens, the Named Entity classifications, preferred labels, notes, and tf-idf summed scores:
tokens | NE | alias | notes | scores
This is intended to be filled out in excel or using the Tagging Tool.
- Parameters
transformer (object TokenExtractor) – the (TRAINED) token extractor used to generate the ranked list of vocab.
filename (str, optional) – the file location to read/write a csv containing a formatted vocabulary list
init (str or pd.Dataframe, optional) – file location of csv or dataframe of existing vocab list to read and update token classification values from
- Returns
vocab (pd.Dataframe) – the correctly formatted vocabulary list for token:NE, alias matching
-
get_tag_completeness
(tag_df)[source]¶ - Parameters
tag_df (pd.DataFrame) – heirarchical-column df containing
-
tag_extractor
(transformer, raw_text, vocab_df=None, readable=False)[source]¶ Wrapper for the TokenExtractor to streamline the generation of tags from text. Determines the documents in <raw_text> that contain each of the tags in <vocab>, using a TokenExtractor transformer object (i.e. the tfidf vocabulary).
As implemented, this function expects an existing transformer object, though in the future this will be changed to a class-like functionality (e.g. sklearn’s AdaBoostClassifier, etc) which wraps a transformer into a new one.
- Parameters
transformer (object KeywordExtractor) – instantiated, can be pre-trained
raw_text (pd.Series) – contains jargon/slang-filled raw text to be tagged
vocab_df (pd.DataFrame, optional) – An existing vocabulary dataframe or .csv filename, expected in the format of kex.generate_vocabulary_df().
readable (bool, default False) – whether to return readable, categorized, comma-sep str format (takes longer)
- Returns
pd.DataFrame, extracted tags for each document, whether binary indicator (default)
or in readable, categorized, comma-sep str format (readable=True, takes longer)
-
token_to_alias
(raw_text, vocab)[source]¶ Replaces known tokens with their “tag” form, i.e. the alias’ in some known vocabulary list
- Parameters
raw_text (pd.Series) – contains text with known jargon, slang, etc
vocab (pd.DataFrame) – contains alias’ keyed on known slang, jargon, etc.
- Returns
pd.Series – new text, with all slang/jargon replaced with unified representations