public

Text Classification

https://learning.oreilly.com/library/view/text-analytics-with/9781484243541/html/427287_2_En_5_Chapter.xhtml

blue print

Evaluations

https://heartbeat.fritz.ai/introduction-to-machine-learning-model-evaluation-fa859e1b2d7f

test a model on different data than it was trained on. Divide data into Training set, Validation set, and Test set

k-fold cross-validation. Divide data into k partitions (5, or 10 usually), use k as test and remaining as training set. so each data set is used as Test data for onces, and as training data k-1 times.

Evaluations for Classifications

number of correct predictions vs all predictions mode.

sklearn.metrics.accuracy_score()

AUC score need to be closer to 1 and greater than 0.5. A perfect classifier will have ROC curve go along the Y axis and then along the X axis

from sklearn.metrics import roc_auc_score, roc_curve

https://en.wikipedia.org/wiki/F1_score

F1 = (2/(recall^-1 + precision^-1)) Score considers both precision and recall. Precision is the number of correct positive results divided by the total predicted positive observations (true poistives and false positives). Recall, is the number of correct positive results divided by the number of all relevant samples (total actual positives). Here “relevant samples” means all actual positives when we are talking positives.

Note the importance of precision and recall is an aspect of the problem.

Root Mean Squared Error (RMSE):

Mean Absolute Error (MAE):

Further reading (from Data Science from Scratch,2nd edition, Chapter 21 and Text Analytics with Python:A Practitioner’s Guide to Natural Language Processing, Chapter 2)

General Data Science bookmark

https://skymind.ai/wiki/word2vec https://heartbeat.fritz.ai/the-7-nlp-techniques-that-will-change-how-you-communicate-in-the-future-part-i-f0114b2f0497

I searched Word2Vec and find this website. Seemed the first stop for most Data Science/AI topics

Word2Vec

provided by gensim library

Word2Vec is a 2-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature, vectors for words in that corpus.

It turns text into a numerical form that deep nets can understand. Given enough data, usage and context, word2vec can make highly accurate guesses about a word’s meaning, the guesses can be used to establish a word’s association with other words.

Out of vocabulary words

https://medium.com/@shabeelkandi/handling-out-of-vocabulary-words-in-natural-language-processing-based-on-context-4bbba16214d5

summarize

https://heartbeat.fritz.ai/extractive-text-summarization-using-neural-networks-5845804c7701