The machine learning discipline is based on a set of techniques for data modeling that arise from artificial intelligence and statistics areas. These models are learned from data, and commonly used for classification and/or description purposes.
The machine learning field has lived an exponential protagonism increase in different application areas such as bioinformatics, industry, finance, and natural language processing.
The course is focused on the study of the principal tools in a classifical “data analysis pipeline”: data preprocessing, feature selection, learning scenarios, evaluation and comparison. The techniques are illustrated by the use of powerful machine learning software, and applied over different natural language processing problems.
Description of the principal machine learning scenarios. Formalisms and description of the data matrix associated to each scenario. Illustrative applications in each scenario. Supervised classification, clustering, “weakly supervised classification” (“positive unlabeled learning”, “learning from label proportions”, “partial labels”, “crowd learning”, etc.)
General purpose techniques and filters for data preprocessing. Software: WEKA
Principal techniques for feature selection. Software: WEKA
Validation of classification models. Using statistical tests to compare classification models. Software: WEKA, R, webpages
The “tm” (text-mining) R package. Construct a “document-term” matrix from a corpus by means of text-mining operators. Notebook-tutorial
“The machine learning approach”: based on a previously constructed “document-term” matrix, clustering of terms and classification of document. The “caret” R package. Notebook-tutorial
First steps using “deep learning” techniques to classify documents. Application of “word2vec2 techniques. The “h2o” R package. Notebook-tutorial.