nestor.keyword module¶

author: Thurston Sexton

class NLPSelect(columns=0, special_replace=None)[source]¶

Bases: nestor.keyword.Transformer

Extract specified natural language columns from a pd.DataFrame, and combine into a single series.

Parameters: columns (int, or list of int or str.) – corresponding columns in X to extract, clean, and merge

get_params(self, deep=True)[source]¶

transform(self, X, y=None)[source]¶

class TokenExtractor(**tfidf_kwargs)[source]¶

Bases: sklearn.base.TransformerMixin

A wrapper for the sklearn TfidfVectorizer class, with utilities for ranking by total tf-idf score, and getting a list of vocabulary.

Parameters

tfidf_kwargs (arguments to pass to sklearn’s TfidfVectorizer)
Valid options modified here (see sklearn docs for more options) are –

inputstring {‘filename’, ‘file’, ‘content’}, default=’content’
If ‘filename’, the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze.

If ‘file’, the sequence items must have a ‘read’ method (file-like object) that is called to fetch the bytes in memory.

Otherwise the input is expected to be the sequence strings or bytes items are expected to be analyzed directly.

ngram_rangetuple (min_n, max_n), default=(1,1)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

stop_wordsstring {‘english’} (default), list, or None
If a string, it is passed to _check_stop_list and the appropriate stop list is returned. ‘english’ is currently the only supported string value.

If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.

If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

max_featuresint or None, default=5000
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

This parameter is ignored if vocabulary is not None.

smooth_idfboolean, default=False
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

sublinear_tfboolean, default=True
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

fit(self, X, y=None)[source]¶

fit_transform(self, X, y=None, **fit_params)[source]¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.

Returns

X_new (numpy array of shape [n_samples, n_features_new]) – Transformed array.

ranks_¶

Retrieve the rank of each token, for sorting. Uses summed scoring over the TF-IDF for each token, so that: \(S_t = \Sum_{\text{MWO}}\text{TF-IDF}_t\)

Returns: ranks (numpy.array)

scores_¶

Returns actual scores of tokens, for progress-tracking (unit-normalized)

Returns: numpy.array

transform(self, dask_documents, copy=True)[source]¶

vocab_¶

ordered list of tokens, rank-ordered by summed-tf-idf (see ranks_())

Returns: extracted_toks (numpy.array)

generate_vocabulary_df(transformer, filename=None, init=None)[source]¶

Helper method to create a formatted pandas.DataFrame and/or a .csv containing the token–tag/alias–classification relationship. Formatted as jargon/slang tokens, the Named Entity classifications, preferred labels, notes, and tf-idf summed scores:

tokens | NE | alias | notes | scores

This is intended to be filled out in excel or using the Tagging Tool.

Parameters

transformer (object TokenExtractor) – the (TRAINED) token extractor used to generate the ranked list of vocab.
filename (str, optional) – the file location to read/write a csv containing a formatted vocabulary list
init (str or pd.Dataframe, optional) – file location of csv or dataframe of existing vocab list to read and update token classification values from

Returns

vocab (pd.Dataframe) – the correctly formatted vocabulary list for token:NE, alias matching

get_tag_completeness(tag_df)[source]¶

Parameters: tag_df (pd.DataFrame) – heirarchical-column df containing

tag_extractor(transformer, raw_text, vocab_df=None, readable=False)[source]¶

Wrapper for the TokenExtractor to streamline the generation of tags from text. Determines the documents in <raw_text> that contain each of the tags in <vocab>, using a TokenExtractor transformer object (i.e. the tfidf vocabulary).

As implemented, this function expects an existing transformer object, though in the future this will be changed to a class-like functionality (e.g. sklearn’s AdaBoostClassifier, etc) which wraps a transformer into a new one.

Parameters

transformer (object KeywordExtractor) – instantiated, can be pre-trained
raw_text (pd.Series) – contains jargon/slang-filled raw text to be tagged
vocab_df (pd.DataFrame, optional) – An existing vocabulary dataframe or .csv filename, expected in the format of kex.generate_vocabulary_df().
readable (bool, default False) – whether to return readable, categorized, comma-sep str format (takes longer)

Returns

pd.DataFrame, extracted tags for each document, whether binary indicator (default)
or in readable, categorized, comma-sep str format (readable=True, takes longer)

token_to_alias(raw_text, vocab)[source]¶

Replaces known tokens with their “tag” form, i.e. the alias’ in some known vocabulary list

Parameters

raw_text (pd.Series) – contains text with known jargon, slang, etc
vocab (pd.DataFrame) – contains alias’ keyed on known slang, jargon, etc.

Returns

pd.Series – new text, with all slang/jargon replaced with unified representations

ngram_automatch(voc1, voc2, NE_types=None, NE_map_rules=None)[source]¶: Experimental method to auto-match tag combinations into higher-level concepts, for user-suggestion. Used in nestor.ui