nestor.tagtrees module

__author__ = “Thurston Sexton”

get_onehot(df, col, topn=700)[source]

DEPRECATED!

get_relevant(df, col, topn=20)[source]

DEPRECATED!

Parameters
  • df (a dataframe containing columns of tag assignments (comma-sep, str))

  • col (which column to extract)

  • topn (how many of the top most frequent tags to return)

Returns

list of (tag,count,numpy.array) tuples

heymann_taxonomy(dist_mat, cent_prog='pr', tau=0.0005, dynamic=False, dotfile=None, verbose=False)[source]
Parameters
  • dist_mat (pandas.DataFrame) – contains similarity matrix, indexed and named by tags

  • cent_prog (str) – algorithm to use in calculating node centrality

    pr: PageRank eig: eigencentrality btw: betweenness cls: closeness

  • tau (float) – similarity threshold for retaining a node

  • dynamic (bool) – whether to re-calculate centrality after popping every tag

  • write_dot (str or None) – file location, where to save a .dot, if any.

  • verbose (bool) – print some stuff

node_adj_mat(tag_df, similarity='cosine', dag=False, pct_thres=None)[source]

Calculate the similarity of tags, in the form of a similarity kernel. Used as input to graph/network methods.

Parameters
  • tag_df (pandas.DataFrame) – standard Nestor tag occurrence matrix. Multi-column with top-level containing tag classifications (named-entity NE) and 2nd level containing tags. Each row corresponds to a single event (MWO), with binary indicators (1-occurs, 0-does not).

  • similarity (str) – cosine: cosine similarity (from sklearn.metrix.pairwise) count: count (the number of co-occurrences of each tag-tag pair)

  • dag (bool) – default adj_mat will be accross all nodes. This option will return a directed, acyclic graph (DAG), useful for things like Sankey Diagrams. Current implementation returns (P) -> (I) -> (S) structure (deletes others).

  • pct_thres (int or None) – If int, between [0,100]. The lower percentile at which to threshold edges/adjacency.

Returns

pandas.DataFrame, containing adjacency measures for each tag-tag (row-column) occurrence

tag_df_network(tag_df, **node_adj_kws)[source]

Starting from a multi-column binary tag-occurrence pandas.Dataframe (such as output by the Nestor UI and the nestor.keyword.tag_extractor() method, create a networkx graph, along with a node_info and edge_info dataframe for plotting convenience (e.g. in nestor.tagplots)

Parameters
  • tag_df (pandas.DataFrame) – standard Nestor tag occurrence matrix. Multi-column with top-level containing tag classifications (named-entity NE) and 2nd level containing tags. Each row corresponds to a single event (MWO), with binary indicators (1-occurs, 0-does not).

  • node_adj_kws

tag_network(adj_mat, column_lvl=0)[source]

Takes in an adjacency matrix (pandas.DataFrame, assumes multi-col/row) and returns a networkx Graph object with those nodes/edge weights.