Erasmus Mundus Master in Language
and Communication Technologies (LCT)

ooo

Language & communication technologies

University of the Basque Country

Machine Learning

The machine learning discipline is based on a set of techniques for data modeling that arise from artificial intelligence and statistics areas. These models are learned from data, and commonly used for classification and/or description purposes.

The machine learning field has lived an exponential protagonism increase in different application areas such as bioinformatics, industry, finance, and natural language processing.

The course is focused on the study of the principal tools in a classifical “data analysis pipeline”: data preprocessing, feature selection, learning scenarios, evaluation and comparison. The techniques are illustrated by the use of powerful machine learning software, and applied over different natural language processing problems.

Syllabus

The following is a list of the machine learning tools that we will cover during the next sessions:

General terms on the "data science" world: the "data science" term, relation among AI and data science, the big data term, kaggle repository, kdnuggets.com, data science for a better world...
Principal classification scenarios: supervised classification, unsupervised classification (clustering), weakly supervised classification (alternative scenarios).
Semi-supervised classification: usefulness in NLP tasks. Software, RSSL package in R.
One-class classification and outlier detection: usefulness in NLP tasks. Software, R packages.
Using statistical tests to compare the accuracy of different classifiers. Software: WEKA, R, web pages
Feature selection techniques. Techniques for selecting a "competitive" subset of original features.
General techniques and filters for data preprocessing. Preprocessing filters for any kind of data: missing data imputation, one-hot encoding, discretization, imbalanced class distributions...
"A short introduction to the tm (text mining) package in R: text processing". How to contruct by text mining operators a proper document-term matrix for further machine learning analysis. Starting from raw text such as files, html pages, twitter... A tutorial using R software.
"The machine learning approach: clustering words and classifying documents with R". A tutorial using R software.
"First steps on deep learning for NLP by R's h2o package (+word2vec)". A tutorial using R software

While the first 5 items are more data-analysis-general oriented, I consider the last three topics tightly fitted to your NLP interests. However, during the first 4 items, we will try to use NLP datasets to illustrate the use of the teached data analysis "general purpose" techniques.

← program Hizkuntzaren Azterketa eta Prozesamendua

Erasmus Mundus Master in Language and Communication Technologies (LCT)

Machine Learning

Syllabus

Erasmus Mundus Master in Language
and Communication Technologies (LCT)