DeepReading will aim to develop transfer and deep learning approaches to address the lack of training data and knowledge resources for many NLP tasks and domains, focusing mostly on Spanish, English, Basque, Catalan and Galician. Even though deep learning and new modern word embedding representations have improved the state of the art of NLP across tasks and languages, higher-level semantic tasks still require sufficient annotated data for supervised machine learning. For many languages and domains, the existence of such corpora is limited or simply non-existent, leading to much lower results than those obtained for English. This subproject also explores how to apply deep learning techniques for building automatically large-scale lexical knowledge bases from scratch from any language and domain.

The main objectives of the project are:

  • DeepReading will adopt the latest open standards for representing linguistic annotations of both documents and semantic resources. Moreover, all project outcomes will be free, open source and interoperable prepared for the Big Data and Deep Learning (DL) paradigms. It will be designed as an open framework that can be expanded easily and readily be exploited by future initiatives (academy and industry).
  • DeepReading will develop state-of-the-art high-level semantic processing modules and pipelines using the latest advances in deep learning systems and techniques for achieving Interoperable Multilingual Semantic Interpretation for English, Spanish, Catalan, Basque and Galician.
  • DeepReading will develop a transfer-learning toolset for deriving cross-lingual embeddings and projecting annotations and knowledge from languages to languages or from one language in a particular domain to new domains and genres.
  • DeepReading will leverage both large-scale multilingual knowledge bases and new deep neural network techniques to enrich and improve each other.
  • DeepReading will develop robust reasoning and semantic engines tailored to particular sectors and domains.
  • DeepReading will provide advanced deep semantic capabilities to multilingual domain-oriented content enabling systems. It will include Storyline extraction, Opinion Mining, e-learning, Summarization and community Question Answering systems on different domains and genres for the different languages covered by the project.