Survival Analysis¶

Mining Excavator dataset case study

[1]:

from pathlib import Path
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import nestor
from nestor import keyword as kex
import nestor.datasets as dat
def set_style():
    # This sets reasonable defaults for font size for a figure that will go in a paper
    sns.set_context("paper")

    # Set the font to be serif, rather than sans
    sns.set(font='serif')

    # Make the background white, and specify the specific font family
    sns.set_style("white", {
        "font.family": "serif",
        "font.serif": ["Times", "Palatino", "serif"]
    })
set_style()

/home/tbsexton/anaconda3/envs/nestor-dev/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/tbsexton/anaconda3/envs/nestor-dev/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)

[2]:

df = dat.load_excavators()
df.head().style

[2]:

	BscStartDate	Asset	OriginalShorttext	PMType	Cost
0	2004-07-01 00:00:00	A	BUCKET WON'T OPEN	PM01	183.05
1	2005-03-20 00:00:00	A	L/H BUCKET CYL LEAKING.	PM01	407.4
2	2006-05-05 00:00:00	A	SWAP BUCKET	PM01	0
3	2006-07-11 00:00:00	A	FIT BUCKET TOOTH	PM01	0
4	2006-11-10 00:00:00	A	REFIT BUCKET TOOTH	PM01	1157.27

Knowledge Extraction¶

Import vocabulary from tagging tool¶

[3]:

# merge and cleanse NLP-containing columns of the data
nlp_select = kex.NLPSelect(columns = ['OriginalShorttext'])
raw_text = nlp_select.transform(df)

[4]:

tex = kex.TokenExtractor()
toks = tex.fit_transform(raw_text)

#Import vocabulary
vocab_path = Path('.')/'support'/'mine_vocab_1g.csv'
vocab = kex.generate_vocabulary_df(tex, init=vocab_path)
tag_df = kex.tag_extractor(tex, raw_text, vocab_df=vocab)

relation_df = tag_df.loc[:, ['P I', 'S I']]
tags_read = kex._get_readable_tag_df(tag_df)
tag_df = tag_df.loc[:, ['I', 'P', 'S', 'U', 'X', 'NA']]

intialized successfully!
intialized successfully!

Quality of Extracted Keywords¶

[5]:

nbins = int(np.percentile(tag_df.sum(axis=1), 90))
print(f'Docs have at most {nbins} tokens (90th percentile)')

Docs have at most 5 tokens (90th percentile)

[6]:

tags_read.join(df[['OriginalShorttext']]).sample(10)

[6]:

	I	NA	P	S	U	X	OriginalShorttext
5033	engine, light, bay			changeout			Eng bay lights u/s changeout
3813	text, bolts		broken				broken bolts TEXT
2436	line, steel, mcv			reseal			Reseal MCV Steel lines
1963	hyd		error	repair	temp		REPAIR HYDRAULIC TEMP ERROR
2020	pump, hyd, valve	1main		reseal	relief		reseal#1main hyd. pump relief valve.
2105	hose, control, mcv			replace			REPLACE MCV CONTROL HOSE.
4884	right_hand, camera				working		RH CAMERA NOT WORKING
3896	horn, bracket	fit, 2nd		mounting		make	Fit up 2nd horn & make mounting bracket
5009	light, rear, counterweight			replace			Replace rear counterweight lights x 2
2996	lube		fault				lube fault

[7]:

# how many instances of each keyword class are there?
print('named entities: ')
print('I\tItem\nP\tProblem\nS\tSolution')
print('U\tUnknown\nX\tStop Word')
print('total tokens: ', vocab.NE.notna().sum())
print('total tags: ', vocab.groupby("NE").nunique().alias.sum())
vocab.groupby("NE").nunique()

named entities:
I       Item
P       Problem
S       Solution
U       Unknown
X       Stop Word
total tokens:  1767
total tags:  492

[7]:

	NE	alias	notes	score
NE
	1	3	2	766
I	1	317	19	585
P	1	53	6	119
S	1	42	2	95
U	1	68	57	92
X	1	9	1	9

Effectiveness of Tags¶

The entire goal, in some sense, is for us to remove low-occurence, unimportant information from our data, and form concept conglomerates that allow more useful statistical inferences to be made. Tags from nestor-gui, as the next plot shows, have no instances of 1x-occurrence concepts, compared to several thousand in the raw-tokens (this is by design, of course). Additionally, high occurence concepts that might have had misspellings or synonyms drastically inprove their average occurence rate.

[8]:

cts = (tex._model.transform(raw_text)>0.).astype(int).toarray().sum(axis=0)
# cts2 = (tex3._model.transform(replaced_text2)>0.).astype(int).toarray().sum(axis=0)

sns.distplot(cts,
#              np.concatenate((cts, cts2)),
             bins=np.logspace(0,3,10),
#              bins=np.linspace(0,1500,10),
             norm_hist=False,
             kde=False,
             label='Token Freqencies',
             hist_kws={'color':'grey'})
# cts
sns.distplot(tag_df[['I', 'P', 'S']].sum(),
             bins=np.logspace(0,3,10),
#              bins=np.linspace(0,1500,10),
             norm_hist=False,
             kde=False,
             label='Tag Freqencies',
             hist_kws={'hatch':'///', 'color':'dodgerblue'})

plt.yscale('log')
plt.xscale('log')
tag_df.sum().shape, cts.shape
plt.legend()
plt.xlabel('Tag/Token Frequencies')
plt.ylabel('# Instances')
sns.despine()
plt.savefig('toks_v_tags.png', dpi=300)

../_images/notebooks_survival-analysis_12_0.png

[9]:

# tag-completeness of work-orders?
tag_pct, tag_comp, tag_empt = kex.get_tag_completeness(tag_df)

# with sns.axes_style('ticks') as style:
sns.distplot(tag_pct.dropna(),
             kde=False, bins=nbins,
             kde_kws={'cut':0})
plt.xlim(0.1, 1.0)
plt.xlabel('precision (PPV)')

Tag completeness: 0.94 +/- 0.13
Complete Docs: 4444, or 81.02%
Empty Docs: 48, or 0.88%

[9]:

Text(0.5,0,'precision (PPV)')

../_images/notebooks_survival-analysis_13_2.png

Convergence over time, using `nestor-gui`¶

As part of the comparison study, an expert used nestor-gui for approximately 60min annotating 1-grams, followed by 20min focusing on 2-grams. Work was saved every 10 min, so we would like to see how the above plot was arrived at as the tokens were classified.

[10]:

study_fname = Path('.')/'support'/'vocab_study_results.csv'
study_df = pd.read_csv(study_fname, index_col=0)
study_long = pd.melt(study_df, var_name="time", value_name='PPV').dropna()
study_long['time_val'] = study_long.time.str.replace('min','').astype(float)

sns.set(style="white", rc={"axes.facecolor": (0, 0, 0, 0)}, context='paper')
pal = sns.cubehelix_palette(6, rot=-.25, light=.7)
g = sns.FacetGrid(study_long, col="time", hue="time", aspect=.8, height=2, palette=pal, col_wrap=3)
g.map(sns.distplot, "PPV", kde=False, bins=nbins, vertical=True,
      hist_kws=dict(alpha=1., histtype='stepfilled', edgecolor='w', lw=2))
g.map(plt.axvline, x=0, lw=1.4, clip_on=False, color='k')

# Define and use a simple function to label the plot in axes coordinates
def label(x, color, label):
    ax = plt.gca()
    ax.text(.2, 0, label, fontweight="bold", color=color,
            ha="left", va="center", transform=ax.transAxes)
g.map(label, "PPV")

# Remove axes details that don't play well with overlap
g.set_titles("")
g.set( xticks=[], xlabel='')
g.set_axis_labels(y_var='PPV')
g.despine(bottom=True, left=True)
plt.tight_layout()