IXA group

http://ixa.si.ehu.es

Talk:Nicolai Erbs. Multilingual acquisition of large scale knowledge resources acquisition (02/242012

 

 

Speaker: Nicolai Erbs.

Technical University of Darmstadt (Germany)
Visiting IXA Group  (February-June, 2012)

Title: Multilingual acquisition of large scale knowledge resources acquisition.
Date: February 24, 2012
Time: 15:00-16:00
Where
: Computer Science Faculty, Room 3.2

 

Abstract

A vast amount of content is produced by many users every day, but due to
the lack of structure, their contribution is often ignored by other
users. This talk presents approaches such as keyphrase extraction and
link discovery, enabling automatic structure generation for texts, thus
making them more readable.

However, the major challenge of disambiguating word senses is not
tackled. Solving this challenge could improve the proposed approaches
significantly. Especially for the task of link discovery, named entity
disambiguation is a fundamental issue.

The talk introduces Wikipedia as a valuable knowledge repository, for it
is full of named entities. Basically all famous – and not quite as
famous – people have their own Wikipedia article, which are heavily
interconnected (e.g. two actors participated in the same movie). These
interconnection is represented in Wikipedia articles as links and can be
used as input for graph-based named entity disambiguation systems.

BerbaTek project's results and demos.

BerbaTek is a recently finished strategic research project with a duration of three years (2009-2011) funded by the Industry Department of the Basque Government.In order to carry out the project, a consortium was created which was made up of the Elhuyar Foundation, the IXA and Aholab research groups of the UPV/EHU-University of the Basque Country, the technology centre Vicomtech and the foundation Tecnalia Research & Innovation.

 

Yesterday, the results of this project were presented in a press conference with representatives of the Basque Government. Throughout the BerbaTek project we have created several language tools, resources and some demos to show the potential of the integration of language, voice and multimedia technologies, when it comes to creating applications for the areas that make up the languages industry, in other words, for translation, contents and teaching. Three demos were presented in the press conference:

  • Automatic dubbing demo. The automatic dubbing of films is a difficult challenge for the moment (different voices, colloquial language, different speeds), but for some types of documentaries (single speaker, voice-over, coordination of the lips not necessary or unimportant ) we’ve done a demo that performs satisfactorily. Given a documentary in Spanish and its transcription (which can be obtained automatically by means of any of the dictation programs for Spanish in the market), Vicomtech-IK4′s temporal alignment technology creates a subtitles file, a transcription with time marks for the beginning and end of each sentence. Then, the Matxin MT system, developed by the IXA group, automatically translates the subtitles into Basque, and Aholab’s text-to-speech technology obtains the synchronized voice. We have successfully applied this demo to the single-speaker sections of the television program Teknopolis produced by Elhuyar. This demo can be seen at work here.
  • Semantic multimedia search engine for science and technology content. This search engine is based on WNTerm, an ontology specialized in science and technology wich was created by Elhuyar and IXA. It is a network where scientific and technological terms are semantically related to each other, with subclasses, synonyms, etc. A new augmented version will be presented next month.
  • Personal teacher for language learning. For the field of education, we have created a demo of a personal tutor for language learning. The tutor is a 3D avatar developed by Vicomtech-IK4 that shows emotions, can speak Basque and can understand what is said in Basque, using Aholab‘s technology. The tutor assists us in various tasks: we can do grammar exercises (verb conjugation, word inflection) and reading comprehension exercises (fill in gaps in a text, choosing from several options) that are created automatically from texts using technology from IXA; we can evaluate our pronunciation, with Aholab technology; or it helps us when writing texts, with inflection of words, writing of numbers or querying dictionaries, by means of technology from IXA and Elhuyar. By the moment this demo works in local mode, but it will beavailable online by next spring.

The pieces of news has been received by media today:

Further information about this project can be found at Berbatek project’s website.

Talk: M. Cuadros. Multilingual acquisition of large scale knowledge resources

Speaker: Montse Cuadros (Vicomtech)
Title: Multilingual acquisition of large scale knowledge resourcesacquisition.
Date: Janaury 27, 2012
Time: 15:30-16:30
Where
: Computer Science Faculty, Room 3.2

Abstract

The main goal of the research presented in this thesis is to devise new
methods and tools to automatically create new semantic relations between
WordNet senses. That is, to accurately increase by automatic means the
knowledge represented in WordNet.

In particular, our research focuses on devising new methods and tools for:

- Acquiring relevant words from general or domain corpora for an
specific WordNet word-sense.

- Identifying the *implicit* word-senses of the acquired relevant
words with respect to an *existing* knowledge base (in particular,
WordNet).

- Empirically evaluating the quality of the resulting *new* semantic
relations in a controlled multilingual evaluation framework.

Zubiaga:Mining Twitter for real-time trend and information discovery (2011/12/12)

Arkaitz Zubiaga,  is a researcher in Social Media & Data Mining. He has been working in UNED for several years, and  got his PhD in July 2011, with the thesis Harnessing Folksonomies for Resource Classification. and now at the end of 2011 he is moving from NLP&IR group in Spanish UNED  to New York to join the Queens College of the City University of New York as a post-doctoral research associate. He is visiting us today.

Title:Mining Twitter for real-time trend and information discovery
Where:  Room 3.2 .Computer Science Faculty.
Speaker: Arkaitz Zubiaga from  NLP&IR Group at UNED in Madrid
Data: December 12 (today!!)
Time: 10:00-11:00

Abstract: The emergence of social networking services such as Twitter, Google+, and Facebook has led to new ways of sharing information with interested communities. In the last years, there has been an increasing trend in the use of social media services, not only by end-users but also by all kinds of groups, organizations, and governments. Social streams produce overwhelming amounts of data that cannot be followed entirely by users. Hence, analyzing, mining, and curating those streams in real-time can help users follow and discover a wide variety of knowledge related to current affairs. In this talk, I will summarize the main issues of performing real-time analyses of Twitter streams, and I will present our recent research on the characterization of trends, and summarization of events from those streams.

Talk: V. Kordoni. Automated Annotation and Acquisition of Linguistic Knowledge (2011/11/25)

Speaker:Valia Kordoni (LT-Lab DFKI GmbH & Dept. of Computational Linguistics, Saarland University)
Title: Automated Annotation and Acquisition of Linguistic Knowledge for Efficient Multilingual Grammar Engineering. Date: November 25, 2011
Time: 16:00-18:00
Where
: Computer Science Faculty, Room 3.2

 

 


Abstract

In this talk, I mainly deal with automated acquisition of linguistic knowledge as a means of enhancing robustness of lexicalised grammars for real life applications. The case study I focus on in the best part of this talk is Multiword Expressions (henceforward MWEs). Specifically, in the first part of the talk I am taking a closer look at the linguistic properties of MWEs, in particular, their lexical, syntactic, as well as semantic characteristics. The term Multiword Expressions has been used to describe expressions for which the syntactic or semantic properties of the whole expression cannot be derived from its parts (cf., Sag et al., 2002), including a large number of related but distinct phenomena, such as phrasal verbs (e.g., “come along”), nominal compounds (e.g., “frying pan”), institutionalised phrases (e.g., “breadand butter”), and many others. Jackendoff (1997) estimates the number of MWEs in a speaker’s lexicon to be comparable to the number of single words.
However, due to their heterogeneous characteristics, MWEs present a tough challenge for both linguistic and computational work (cf., Sag et al., 2002).
For instance, some MWEs are fixed, and do not present internal variation, such as “ad hoc”, while others allow different degrees of internal variability and modification, such as “spill beans” (“spill several/musical/mountains of beans”). With the observations about the linguistic properties of MWEs at hand, I turn in the second part of the talk to methods for the automated acquisition of these properties for robust grammar engineering. To this effect, I first investigate the hypothesis that MWEs can be detected by the distinct statistical properties of their component words, regardless of their type, comparing various statistical measures, a procedure which leads to extremely
interesting conclusions. I then investigate the influence of the size and quality of different corpora, using the BNC and the Web search engines Google and Yahoo. I conclude that, in terms of language usage, web generated corpora are fairly similar to more carefully built corpora, like the BNC, indicating that the lack of control and balance of these corpora are probably compensated by their size.
Then, I show a qualitative evaluation of the results of automatically adding extracted MWEs to existing linguistic resources. To this effect, I first discuss two main approaches commonly employed in NLP for treating MWEs: the words-with-spaces approach which models an MWE as a single lexical entry and it can adequately capture fixed MWEs like “by and large”, and compositional approaches which treat MWEs by general and compositional methods of linguistic analysis, being able to capture more syntactically flexible MWEs, like “rock boat”, which cannot be satisfactorily captured by a wordswith-spaces approach, since this would require lexical entries to be added for all the possible variations of an MWE (e.g., “rock/rocks/rocking this/that/his…boat”). On this basis, I argue that the process of the automatic addition of extracted MWEs to existing linguistic resources improves qualitatively, if a more compositional approach to grammar/lexicon automated
extension is adopted.
Finally, I also propose that the methods developed for the acquisition of linguistic knowledge in the case of the English MWEs can be tuned to enhance robustness of lexicalised grammars for languages with richer morphology and freer word order, as is the case of German, and can benefit from gold standard syntactically and semantically annotated corpora, for the (semi-automated) development of which I am briefly
showing a very simple statistical ranking model which significantly improves treebanking efficiency by prompting human annotators to the most relevant linguistic annotation decisions.

Talk. Tegau Andrews. An overview of Welsh language technologies. (2011/11/02)

Ixa Group has often collaborated with Bangor University in the development of language technology for less resourced languages. Mainly for Basque and Welsh and in the frame of SALTMIL.
Briony Williams, Delith Prys and Gruff Prys are our Welsh contacts.
Tegau Andrews from Bangor University will be with us next week, and we have programmed this talk:

Speaker: Tegau Andrews (Bangor University, Wales)
Uned Technolegau Iaith  /  Language Technologies Unit
Prifysgol Bangor     /   Bangor University

When: November 2, Wednesday
Where: Room 3.2
Time: 15.00
Title: From terminology standardization systems to machine translation: An overview of Welsh language technologies

Abstract:

An endangered language will progress if its speakers can make use of electronic technology” so postulates Wales-based linguistics professor David Crystal (Language Death, 2000: 141). Welsh, spoken by 20.8% of the population of Wales (Census 2001), is classed a vulnerable language by UNESCO, yet it is the Welsh Government’s stated aim to make Wales a truly bilingual nation.

This talk will focus on the progress being made in developing language technologies for Welsh speakers. It will range over topics such as Welsh machine translation, computer-aided translation tools, text-to-speech technology, terminology portals and e-learning resources, and present an overview of the work being done at the Terminology and Language Technologies Unit at Bangor University. The aim of such work is to enable and encourage Welsh speakers to use electronic technology in their own language.

Talks. David Martinez and Meladel Mistica. (2011/10/14)

Date: October 14, 2011
Time: 15:00
Where
: Computer Science Faculty, 3.2 room

[1]

Title: Word classes in Indonesian: A linguistic reality or a convenient
fallacy in natural language processing?
Speaker: Meladel Mistica (Australian National University)
Abstract[1]:
In this talk I will be presenting work on Indonesian (Bahasa Indonesia), and the claim that there is no noun-verb distinction within the language as it is spoken in regions such as Riau and Jakarta. We test this claim for the language as it is written by a variety of Indonesian speakers using empirical methods traditionally used in part-of-speech induction. In this study we use only morphological patterns that we generate from a pre-existing morphological analyser. We find that once the distribution of the data points in our experiments match the distribution of the text from which we gather our data, we obtain results that show a significant
distinction between the class of nouns and the class of verbs in Indonesian. Furthermore it shows promise that the labelling of word classes may be achieved only with morphological features, which could be applied to out-of-vocabulary items.

[2]

Title: Text classification of patient reports and event-modifier
identification for the biomedical literature
Speaker: David Martinez (NICTA – National ICT Australia)
Abstract[2]:
The first short talk describes the implementation and evaluation of a text classification system of pathology reports for the Royal Melbourne Hospital, which relied on document-level annotations obtained from the medical workflow. We observed that a basic machine learning framework with linguistic features carries the potential to make an impact in their process.
The second talk describes our work on modifiers of biomedical events over the BioNLP-2009 dataset. Our system combines a simple bag-of-words method with two grammar-based approaches, namely the English Resource Grammar and the RASP parser. We interpret the output of the respective parsers via MRS (Minimal Recursion Semantics), and feed them into a machine learner. Our results indicate that grammar-based techniques can enhance the accuracy of methods for detecting event modification.

Talk. Dan Jurafsky. Extracting many kinds of meaning from text and speech. (2011/09/13)

Speaker: Professor  Dan Jurafsky (Stanford University).
Date: September 13, 2011
Time: 16:00
Where
: Computer Science Faculty

Title: Extracting many kinds of meaning from text and speech.
Abstract:
Understanding natural language, while one of the oldest goals of artificial intelligence, is immensely difficult because language expresses so many kinds of meanings, embedded as it is in the rich social world of humans. In this talk I discuss work in our lab on extracting three kinds of meaning that link to the human world. We show how to learn world knowledge about events and their participants, `narrative schemes’ about how the world works, in a purely unsupervised way from large bodies of text. We show a new algorithm for the task of ‘coreference’: deciding when two mentions in a text refer to the same person or organization. Finally, we show how to automatically detect human interpersonal stances from speech and text cues in spoken conversation, detecting whether a speaker is friendly, awkward, or flirtatious. This talk describes joint work with Nate Chambers, Angel Chang, Heeyoung Lee, Chris Manning, Dan McFarland, Yves Peirsman, Karthik Raghunathan, Rajesh Ranganath, and Mihai Surdeanu.
BIO:
 Dan Jurafsky is Professor of Linguistics and Professor by Courtesy of Computer Science at Stanford University. Dan received a B.A in Linguistics in 1983 and a Ph.D. in Computer Science in 1992, both from the University of California at Berkeley, and also taught at the University of Colorado, Boulder. His research focuses on natural language understanding as well as the application of natural language processing to the behavioral and social sciences. Other research interests include the linguistics of Chinese and the linguistics of food. He is the recipient of a MacArthur Fellowship, and is the co-author with Jim Martin of the widely-used textbook “Speech and Language Processing“. It was the first book that included deep descriptions of both text and speech technology. Teachers and students of Language Technology, we know very well this nice book.

Talk. Marta Recasens. Coreference: Theory, Annotation, Resolution and Evaluation (2011/06/28)

“Coreference: Theory, Annotation, Resolution and Evaluation”

Speaker:Marta Recasens
CLIC Centre de Llenguatge i Computació (Universitat de Barcelona)
Date: June 28, 2011
Time: 16:15
Where: Computer Science Faculty, Room 2.3

Talk. Atro Voutilainen. Dependency treebank for Finnish (2011/06/08)

Speaker:Atro Voutilainen (University of Helsinki)
Date: June 8, 2011
Time: 11:30
Where
: Computer Science Faculty, Room 3.2

Title: Building a dependency treebank and other LRs for Finnish

Abstract

  • Research infrastructure FIN-CLARIN
    • LR web service for R&D
    • corpora, language models, software, open sourc
    • FIN-CLARIN project
  • FinnTreeBank
    • user needs
    • grammar definition corpus
    • a parsebank with dependency syntactic annotation
  • Tagging and dependency parsing
    • Finnish
    • linguistic modelling
    • tools, technologies
    • modelling methods: experiments, comparisons