Programma del workshop:
- 09:30 Apertura dei Lavori
Leonardo Lesmo, Università degli Studi di Torino:
Il progetto PARLI.
Rodolfo Delmonte, Università Ca' Foscari di Venezia:
Dependency Treebank Annotation and Null Elements: an experiment with VIT.
State of the art parsers are currently trained on converted versions of Penn Treebank into
dependency representations which however don’t include null elements. This is done to
facilitate structural learning and prevent the probabilistic engine to postulate the existence
of deprecated null elements everywhere (see R. Gaizauskas, 1995). However it is a fact that
in this way, the semantics of the representation used and produced on runtime is inconsistent
and will reduce dramatically its usefulness in real life applications like Information Extraction,
Q/A and other semantically driven fields by hampering the mapping of a complete logical form.
What systems have come up with are “Quasi”-logical forms or partial logical forms mapped
directly from the surface representation in dependency structure. We show the most common
problems derived from the conversion and then describe an algorithm that we have implemented
to apply to our converted Italian Treebank, that can be used on any CONLL-style treebank or
representation to produce an “almost complete” semantically consistent dependency treebank.
Maria Simi, Università di Pisa,
Cristina Bosco, Università degli Studi di Torino,
Simonetta Montemagni, Istituto Linguistica Computazionale di Pisa:
Towards harmonization and merging of Italian Dependency Treebanks.
The talk describes a methodology for the construction of a Merged Italian Dependency
Treebank (MIDT) starting from the already existing resources TUT and ISST-TANL.
In particular, the effort has been oriented to a detailed comparative analysis of the
structures annotated in TUT and ISST-TANL and to the harmonization of the annotation
schemes of these resources. The issues raised during the comparison of the annotation
schemes underlying the two treebanks are discussed with a particular emphasis on the
definition of a set of linguistic categories to be used as a bridge between the specific schemes.
As a result of this study we have implemented conversion scripts from TUT and ISST-TANL
to MIDT, obtained a preliminary version of a shared resource in MIDT format, and measured
the performance of the DeSR statistical parser trained on the new MIDT resource.
Manuela Sanguinetti, Università degli Studi di Torino,
Cristina Bosco, Università degli Studi di Torino ,
Leonardo Lesmo, Università degli Studi di Torino:
ParTUT and translation shift study.
Parallel corpora, and parallel treebanks in particular, are currently considered among the crucial
resources for a variety of NLP tasks, e.g. machine translation and cross-lingual information
extraction, and for research in the field of translation studies and contrastive linguistics.
In this talk we present ParTUT, an ongoing project for the development of a parallel treebank
for Italian, English and French annotated in the pure dependency format of the Turin
University Treebank. The main topic of the talk will include a brief discussion on the study
of translational divergences and their implications for the development of an alignment tool of parallel
parse trees that, benefitting from the linguistic information provided, could properly deal with
such divergences. As a final remark, we will discuss whether and to what extent the specific features of
the TUT representation format may affect the design and implementation of an alignment system
Cristina Bosco, Università degli Studi di Torino,
Anna Corazza, Università Federico II di Napoli,
Anita Alicante, Università Federico II di Napoli,
Alberto Lavelli, Fondazione Bruno Kessler di Trento:
Evaluation methodologies and Italian word order.
The aim of this talk is to describe the methodology applied in the
PARLI project for the evaluation task.
The contribution focusses in particular on the debate on the
issues raised by Morphologically Rich Languages, and more precisely on
the investigation, in a cross-paradigm perspective, of the influence of
constituent order on the data-driven parsing of one of such
languages (i.e. Italian). It shows therefore new evidence from
experiments on Italian, a language characterized by a rich verbal
inflection, which leads to a widespread diffusion of the pro-drop
phenomenon and to a relatively free word order. The experiments are
performed by using state-of-the-art data-driven parsers (i.e.
MaltParser and Berkeley parser) and are based on an Italian treebank
available in formats that vary according to two dimensions, i.e. the
paradigm of representation (dependency vs. constituency) and the level
of detail of linguistic information. The aim of this works however goes
beyond the results obtained; it sought instead to contribute to the
debate by exploring new methdological perspectives in the evaluation field.
Simona Colombo, Università degli Studi di Torino,
Elisa Corino, Università degli Studi di Torino:
CMC e Corpora: usare il web per studiare la lingua.
Il gruppo di ricerca dell’Università di Torino, sotto la supervisione della Professoressa
Marello e di Manuel Barbera, negli ultimo anni ha creato un gruppo di corpora diversi
per studiare diversi aspetti della lingua.
Verrà fatta un’analisi dei corpora sviluppati, illustrandone l’interfaccia di utilizzo ed i
possibili usi applicativi nello studio dei diversi aspetti della lingua.
Le risorse analizzate includeranno l’analisi dei NUNC (www.corpora.unito.it), corpora
multilingue, specialistici e generici, costruiti collezionando le risorse dei newsgroup,
di VALERE (http://www.progettovalere.org) per lo studio dei differenti registri dell’italiano,
di RIDIRE (http://lablita.dit.unifi.it/projects/RIDIRE) web corpus suddiviso in domini semantici,
nato come risorsa per lo studio dell’italiano come L2.
(1) Barbera, M., Corino, E. & Onesti, C. 2008. Corpora e linguistica in rete. Perugia: Guerra Edizioni.
(2) Baroni, M. & Bernardini, M. S (eds) 2006. Wacky! Working papers on the Web as Corpus.Bologna:
- 12:30 Pausa Pranzo
Giuseppe Attardi, Università di Pisa:
Syntactic Dependencies: learning, adapting, annotating
Dependency annotations allow representing the syntactic structure of sentences in a way that
closely reflects the underlying semantic relations. Dependency trees can be exploited in many
tasks of text analysis, including entity recognition, sentiment analysis, text entailment, text
classification, relation extraction, machine translation.
High accuracy in parsing can be achieved by training a statistical parser on an annotated corpus.
However the accuracy decreases when the documents to analyze come from a different domain
than the original trianing corpus. Therefore the training resources must be extended to cover
a sample as large as possible of the language.
We explored several ways to achieve this: self-training, active learning and crowd sourcing.
We report on our experiments with these techniques. Active learning proved quite effective in
several tasks and competitions, like Evalita and SPLeT. In order to exploit crowd sourcing, we
built a "game with a purpose", called Phratris, that engages users in composing a dependency
tree in a fashion similar to the popular game of Tetris.
We finally describe how to represent dedendencies in an enriched search index, that allows very
fast search over documents based on dependency relations present in sentences.
Such an index has been built for the Italian and English Wikipedia allowing for semantic search.
Alessandro Moschitti, Università di Trento:
Fast Prototyping of Natural Language Systems using Kernel Methods.
Building NLP resources is usually rather expensive in terms of time and human labor.
On one hand, this has led to the definition of methods for cheaply obtaining the desired
annotation, e.g., the Phrase Detective game to gather coreference resolution data in Italian
and English. On the other hand, machine learning methods that can limit the use of training
data have been studied, especially in case of high costly syntactic and semantic annotation.
In this talk, we describe both approaches by focusing on the use of syntactic and semantic
kernels for the design of Natural Language Processing applications. These kernels used in
Support Vector Machines enable the design of core NLP systems such as automatic Question
Classification, Semantic Role Labeling and Verb Class categorization. This research has been
partially developed within the PARLI project in cooperation with the University of Rome Tor
Vergata and has lead to the design of accurate Italian SRL systems.
Roberto Basili, Università di Roma Tor Vergata,
Danilo Croce, Università di Roma Tor Vergata:
Semantic Role Labeling for Italian.
In the talk we will present the main aspects of the semantic role labeling (SRL) model over
texts in Italian developed during the PARLI project. The corresponding system, based on
Charles Fillmore's frame semantics paradigm, integrates structured learning methodologies
and is strongly based on a semantic tree kernel formulated by the Tor Vergata group in
cooperation with the University of Trento during the project activities. The model achieved
the best SRL performances in the evaluation campaign of EvalIta 2011, FLaIT task. In the talk,
the architecture of the system, its flexibility in the treatment of SRL for Italian and English as well
as its extensive evaluation will be presented.
Tavola Rotonda: Applicazioni di NLP presenti e futuri
Chairperson: Giuseppe Attardi;
Gian Piero Oggero, Expert System;
Alessio Bosca, CELI
- 17:00 Chiusura lavori
Sottomissione degli abstract e pubblicazione
- tutti i partecipanti al progetto sono invitati a sottomettere abstract relativi all'attività
- la scadenza per la sottomissione è il 25 agosto
- gli abstract vanno inviati per email a email@example.com e firstname.lastname@example.org
- tutti gli abstract sottomessi verranno messi online su questo sito prima del workshop
- è inoltre prevista la pubblicazione di un volume di post-proceedings contenente
una versione estesa degli abstract sottomessi.