nestor.tagtrees module¶

__author__ = “Thurston Sexton”

get_onehot(df, col, topn=700)[source]¶: DEPRECATED!

get_relevant(df, col, topn=20)[source]¶

DEPRECATED!

Parameters

df (a dataframe containing columns of tag assignments (comma-sep, str))
col (which column to extract)
topn (how many of the top most frequent tags to return)

Returns

list of (tag,count,numpy.array) tuples

heymann_taxonomy(dist_mat, cent_prog='pr', tau=0.0005, dynamic=False, dotfile=None, verbose=False)[source]¶

Parameters

dist_mat (pandas.DataFrame) – contains similarity matrix, indexed and named by tags
cent_prog (str) – algorithm to use in calculating node centrality

pr: PageRank eig: eigencentrality btw: betweenness cls: closeness
tau (float) – similarity threshold for retaining a node
dynamic (bool) – whether to re-calculate centrality after popping every tag
write_dot (str or None) – file location, where to save a .dot, if any.
verbose (bool) – print some stuff

node_adj_mat(tag_df, similarity='cosine', dag=False, pct_thres=None)[source]¶

Calculate the similarity of tags, in the form of a similarity kernel. Used as input to graph/network methods.

Parameters

tag_df (pandas.DataFrame) – standard Nestor tag occurrence matrix. Multi-column with top-level containing tag classifications (named-entity NE) and 2nd level containing tags. Each row corresponds to a single event (MWO), with binary indicators (1-occurs, 0-does not).
similarity (str) – cosine: cosine similarity (from sklearn.metrix.pairwise) count: count (the number of co-occurrences of each tag-tag pair)
dag (bool) – default adj_mat will be accross all nodes. This option will return a directed, acyclic graph (DAG), useful for things like Sankey Diagrams. Current implementation returns (P) -> (I) -> (S) structure (deletes others).
pct_thres (int or None) – If int, between [0,100]. The lower percentile at which to threshold edges/adjacency.

Returns

pandas.DataFrame, containing adjacency measures for each tag-tag (row-column) occurrence

tag_df_network(tag_df, **node_adj_kws)[source]¶

Starting from a multi-column binary tag-occurrence pandas.Dataframe (such as output by the Nestor UI and the nestor.keyword.tag_extractor() method, create a networkx graph, along with a node_info and edge_info dataframe for plotting convenience (e.g. in nestor.tagplots)

Parameters

tag_df (pandas.DataFrame) – standard Nestor tag occurrence matrix. Multi-column with top-level containing tag classifications (named-entity NE) and 2nd level containing tags. Each row corresponds to a single event (MWO), with binary indicators (1-occurs, 0-does not).
node_adj_kws

tag_network(adj_mat, column_lvl=0)[source]¶: Takes in an adjacency matrix (pandas.DataFrame, assumes multi-col/row) and returns a networkx Graph object with those nodes/edge weights.