Risorse

La lista seguente include le risorse sviluppate per il NLP per la lingua italiana da tutti i ricercatori che operano in quest'area. Cliccando sul nome di ogni risorsa si apre una scheda con le informazioni di base ad essa relative (nella maggior parte dei casi in inglese, in accordo a come sono fornite dagli autori) e il link al suo sito web.

Le informazioni relative ai sistemi e risorse che hanno partecipato alla campagna di valutazione Evalita 2011 saranno rese disponibili molto presto dopo il workshop (Roma, 24-25 gennaio 2012).

Tutte le segnalazioni e proposte in merito a risorse presenti e non presenti nella lista sono benvenute, e possono essere inviate utilizzando il form della sezione Segnalazioni di questo sito.

Lessici e tesauri
(inclusi Dizionari, Wiktionary e WordNets)

BabelNet

Nome	BabelNet
Autori/e	Roberto Navigli, Simone Paolo Ponzetto
Descrizione	BabelNet is a very large multilingual semantic network with millions of concepts obtained from an integration of WordNet and Wikipedia based on an automatic mapping algorithm, and from the translations of the concepts (i.e. English Wikipedia pages and WordNet synsets) based on Wikipedia cross-language links and the output of a machine translation system. BabelNetXplorer is the online interface available on the BabelNet website for accessing the resource.
Licenza e download	free
Link	http://babelnet.org/
Contatti	navigli[at]di.uniroma1.it

ItalWordNet

Nome	ItalWordNet (Italian WordNet)
Autori/e
Descrizione	ItalWordNet is an updated version of the EuroWordNet Italian database. The ItalWordNet database was produced within a national Italian programme called SI-TAL. It contains a total of 49,360 synsets. Unlike the EuroWordNet database, the ItalWordNet is provided in XML format. However, it remains partially compatible with the EuroWordNet, as both models are similar, except for some new relations created for adjectives and which were added to ItalWordNet.
Licenza e download	Copyright by ELRA
Link	http://catalog.elra.info/product_info.php?products_id=1110
Contatti	http://www.elra.info/

Morph-it!

Nome	Morph-it!
Autore/i	Marco Baroni, Eros Zanchetta
Descrizione	Morph-it! is a free morphological resource for the Italian language, a lexicon of inflected forms with their lemma and morphological features. The lexicon currently contains 505,074 entries and 35,056 lemmas. Morph-it! can be used as a data source for a lemmatizer/morphological analyzer/morphological generator.
Licenza e download	Creative Commons and CC-GNU LGPL
Link	http://dev.sslmit.unibo.it/corpora/corpus.php?path=&name=Repubblica
Contatti	marco.baroni[at]unitn.it

MultiWordNet

Nome	MultiWordNet
Autori/e	Emanuele Pianta, Luisa Bentivogli, Pamela Forner, Christian Girardi, Marcello Ranieri, Manuela Speranza, Massimiliano Bampi, Gabriela Cavaglià, Francesca Filiaci, Bernardo Magnini, Carlo Strapparava, Lisa Zenoniani
Descrizione	MultiWordNet is a multilingual lexical database in which the Italian WordNet is strictly aligned with Princeton WordNet 1.6. The Italian synsets are created in correspondence with the Princeton WordNet synsets, whenever possible, and semantic relations are imported from the corresponding English synsets.
Licenza e download	Free license for academic, commercial for others (asking to contact)
Link	http://multiwordnet.fbk.eu/english/home.php
Contatti	manspera[at]fbk.eu

PAROLE

Nome	PAROLE-SIMPLE-CLIPS PISA Italian Lexicon
Autore/i
Descrizione	PAROLE-SIMPLE-CLIPS is a four-level, general purpose lexicon that has been elaborated over three different projects. The kernel of the morphological and syntactic lexicons was built in the framework of the LE-PAROLE project. The linguistic model and the core of the semantic lexicon were elaborated in the LE-SIMPLE project, while the phonological level of description and the extension of the lexical coverage were performed in the context of the Italian project Corpora e Lessici dell'Italiano Parlato e Scritto (CLIPS).
Licenza e download	Copyright by ELRA
Link	http://catalog.elra.info/product_info.php?products_id=881?
Contatti	http://www.elra.info/

Wordnet Domains

Nome	Wordnet Domains
Autore/i	Bernardo Magnini, Emanuele Pianta, Luisa Bentivogli, Christian Girardi, Manuela Speranza, Gabriela Cavaglià, Pamela Forner, Giovanni Pezzulo, Lisa Zenoniani
Descrizione	WordNet Domains is a lexical resource created in a semi-automatic way by augmenting WordNet with domain labels. WordNet Synsets have been annotated with at least one semantic domain label, selected from a set of about two hundred labels structured according the WordNet Domain Hierarchy.
Licenza e download	Free license by registration
Link	http://wndomains.fbk.eu/
Contatti	manspera[at]fbk.eu

Terminologie

Ontologie

ItalWordNet

Nome	ItalWordNet (Italian WordNet)
Autore/i
Descrizione	ItalWordNet is an updated version of the EuroWordNet Italian database. The ItalWordNet database was produced within a national Italian programme called SI-TAL. It contains a total of 49,360 synsets. Unlike the EuroWordNet database, the ItalWordNet is provided in XML format. However, it remains partially compatible with the EuroWordNet, as both models are similar, except for some new relations created for adjectives and which were added to ItalWordNet.
Licenza e download	Copyright by ELRA
Link	http://catalog.elra.info/product_info.php?products_id=1110
Contatti	http://www.elra.info/

SentiWordNet

Nome	SentiWordNet
Autore/i	Stefano Baccianella, Andrea Esuli, Fabrizio Sebastiani
Descrizione	SentiWordNet is a lexical resource in which each WordNet synset is associated to three numerical scores Obj(s), Pos(s) and Neg(s), describing how objective, positive, and negative the terms contained in the synset are. A typical use of SentiWordNet is to enrich the text representation in opinion mining applications, adding information on the sentiment-related properties of the terms in text. It is endowed with a Web-based graphical user interface.
Licenza e download	Free license by registration, for non-profit purposes only
Link	http://swn.isti.cnr.it/
Contatti	bentivo[at]fbk.eu

Corpora di testo parlato

QALL-ME_benchmark

Nome	QALL-ME benchmark
Autore/i	Cabrio E., Kouylekov M., Magnini B., Negri M., Hasler L., Orasan C., Tomas D., Vicedo J. L., Neumann G., Weber C.
Descrizione	The QALL-ME benchmark is a collection of several thousand spoken utterances related to the domain of tourism, both audio files and their corresponding transcriptions, in the four languages involved in the project: English, German, Italian and Spanish. These utterances ask for information about cultural events, accommodation, movies, gastro, etc. and have been transcribed according to guidelines set out by the QALL-ME consortium.
Licenza e download	Creative Commons Licence 3.0
Link	http://qallme.fbk.eu/index.php?location=benchmark
Contatti

VIT-parlato

Nome	Venice Italian Treebank parlato (VIT-parlato)
Autore/i	Rodolfo Delmonte
Descrizione	VIT-parlato is a treebank of constituent tree structures with tagged words. The full version of the resurce is made available for free for all the interested user. The content of the treebank can be seen in the web site (see link) by using a viewer-parser developed in JAVA This treebank can be seen in the web site (see link) by using a viewer-parser developed in JAVA, which shows the content of each sentence putting the words in a list on the left and the structure on the right (the viewer does not work in Mozzilla Firefox).
Licenza e download	only for visualization
Link	http://project.cgm.unive.it/resource/VIT/Browser-VIT/indices/indexparsing_it2.htm
Contatti	delmont[at]unive.it

Corpora di testo scritto

CORIS/CODIS

Nome	CORIS/CODIS
Autore/i	Rema Rossini Favretti, Nicola Grandi, Malvina Nissim, Fabio Tamburini, Andrea Zaninello
Descrizione	The corpus of written Italian CORIS/CODIS started with the purpose of creating a representative and sizeable general reference corpus of written Italian which would be easily accessible and user-friendly. CORIS contains 120 million words and has been updated every three years by means of a built-in monitor corpus. It consists of a collection of authentic and commonly occurring texts in electronic format chosen by virtue of their representativeness of modern Italian.
Licenza e download	free by signing license agreement
Link	http://dslo.unibo.it/coris_eng.html
Contatti	rema.rossini[at]unibo.it

CRIPCO

Nome	CRIPCO (Cross-document Italian People Coreference Corpus)
Autore/i	Luisa Bentivogli, Christian Girardi, Emanuele Pianta
Descrizione	The CRIPCO corpus is a subset of the news stories published by the local newspaper "L'Adige" from 1999 to 2006 annotated with information about person cross-document coreference.
Licenza e download	freely available for research purpose by subscription
Link	http://hlt.fbk.eu/en/CRIPCO/
Contatti	manspera[at]fbk.eu

I-CAB

Nome	I-CAB (Italian Content Annotation Bank)
Autore/i	B. Magnini, A. Cappelli, E. Pianta, M. Speranza, V. Bartalesi Lenzi, R. Sprugnoli, L. Romano, C. Girardi, M. Negri
Descrizione	I-CAB is a corpus consisting of news stories taken from the local newspaper "L'Adige", where the semantic information is annotated at different levels: temporal expressions, entities (i.e. persons, organizations, locations, and geo-political entities) and relations between entities (e.g. the affiliation relation connecting a person to an organization)
Licenza e download	freely available for research purposes upon acceptance of a license agreement
Link	http://ontotext.fbk.eu/icab.html
Contatti	manspera[at]fbk.eu

ISST

Nome	ISST (Italian Syntactic-Semantic Treebank)
Autore/i	S. Montemagni, F. Barsotti, M. Battista, N. Calzolari, O. Corazzari, A. Lenci, A. Zampolli, F. Fanciulli, M. Massetani, R. Raffaelli, R. Basili, M. T. Pazienza, D. Saracino, F. Zanzotto, N. Mana, F. Pianesi, R. Delmonte
Descrizione	ISST has a five-level structure covering orthographic, morpho-syntactic, syntactic and semantic levels of linguistic description. Syntactic annotation is distributed over two different levels: the constituent structure level and the functional relations level. The fifth level deals with lexico-semantic annotation, which is carried out in terms of sense tagging of lexical heads (nouns, verbs and adjectives) augmented with other types of semantic information: ItalWordNet (see ELRA-M0018) is the reference lexical resource used for the sense tagging task.
Licenza e download	Copyright by ELRA
Link	http://catalog.elra.info/product_info.php?products_id=887
Contatti	http://www.elra.info/

itWAC

Nome	itWaC
Authors	the Wacky bunch
Descrizione	itWaC is a 2 billion word corpus constructed from the Web limiting the crawl to the .it domain and using medium-frequency words from the Repubblica corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the TreeTagger, and lemmatized using the Morph-it! lexicon.
Licenza e download	for free, by contacting the contact address
Link	http://wacky.sslmit.unibo.it/doku.php?id=corporab
Contatti	wacky[at]sslmit.unibo.it

la Repubblica

Nome	"la Repubblica" corpus
Autore/i	M. Baroni, S. Bernardini, F. Comastri, L. Piccioni, A. Volpi, G. Aston, M. Mazzoleni, E. Zanchetta, S. Castagnoli
Descrizione	The "la Repubblica" corpus is a very large corpus of Italian newspaper text (approximately 380M tokens). The corpus is tokenized, pos-tagged (with the Treetagger trained with ad-hoc resources), lemmatized (with Morph-it) and categorized in terms of genre and topic (with SVMLight trained with ad-hoc resources).
Licenza e download	Query service available to registered users
Link	http://dev.sslmit.unibo.it/corpora/corpus.php?path=&name=Repubblica
Contatti	marco.baroni[at]unitn.it

SWiiT

Nome	SWiiT: Semantic WIkipedia for ITalian
Autore/i	Silvana Marianela Bernaola Biggio, Roberto Zanoli, Manuela Speranza
Descrizione	SWiiT is the Italian Wikipedia annotated at five different levels: basic NLP processing (tokenization, sentence splitting and PoS-tagging), entity mentions (person, organization, location and geo-political entities), entity subtypes (not completed), entity co-reference (not completed), dependency parsing (not completed)
Licenza e download
Link	http://textpro.fbk.eu/resources/SWiiT.html
Contatti	manspera[at]fbk.eu

TUT

Nome	Turin University Treebank (TUT)
Autore/i	Leonardo Lesmo, Cristina Bosco, Alessandro Mazzei, Vincenzo Lombardo, Livio Robaldo
Descrizione	TUT is a dependency-based treebank develoed in parallel with the parsing environment TULE (see tools). It features and annotation based on the major principles of Hudson's word grammar, but emploies also null elements for representing discontinuous and elliptical structures or pro-drops. It has been made available also in other constituency formats
Licenza e download	Creative Commons
Link	http://www.di.unito.it/~tutreeb
Contatti	bosco[at]di.unito.it

CCG-TUT

Nome	Combinatory Categorial Grammar - Turin University Treebank (CCG-TUT)
Autore/i	Johan Bos, Cristina Bosco, Alessandro Mazzei
Descrizione	A process of conversion from TUT generates CCG-TUT, the Italian CCGbank, taking as input a set of sentences in the TUT format of dependencies. These are (1) mapped onto constituency trees (i.e. ConsTUT format), (2) which in turn undergo surgery to become binary trees, and (3) are then mapped into CCG derivations. ConsTUT is a TUT-oriented constituency-based annotation with TUT relations annotated on constituents. In ConsTUT trees each terminal category X corresponds to a node (i.e. word) of a TUT tree, and projects into non-terminal nodes which represent intermediate (Xbar) and maximal (XP) projections of X, according to Xbar theory (for more details and download of TUT in CONS-TUT format see TUT). The Italian CCGbank comes in three different formats: (1) derivations (Prolog terms), (2) derivations (pretty printed) and (3) tuples of words, POS and CCG category.
Licenza e download	Creative Commons
Link	http://www.di.unito.it/~tutreeb
Contatti	bosco[at]di.unito.it

VIT

Nome	Venice Italian Treebank (VIT)
Autore/i	Rodolfo Delmonte
Descrizione	VIT is a treebank which includes 275,000 tokens composed by different representation layers: the first is of constituent tree strctures with tagged words; the second is derived from the first, and is composed by dependency structures organized in 8 columns which describe the morphological and semantic features, the lemma, the functional marker which differentiates topic/focus; other layers refer to the orthograohical version with multiwords and in canonical form. The full treebank is available against payment, while smaller versions for free will be soon available on the PARLI web site.
Licenza e download	only for visualization
Link	http://www.elda.org/catalogue/en/text/W0040.html
Contatti	delmont[at]unive.it

VIT-scritto

Nome	Venice Italian Treebank scritto (VIT-scritto)
Autore/i	Rodolfo Delmonte
Descrizione	VIT-scritto is a treebank composed by constituent structure trees with tagged words. The full version is called VIT (see the previous resource in this list). This treebank can be seen in the web site (see link) by using a viewer-parser developed in JAVA, which shows the content of each sentence putting the words in a list on the left and the structure on the right (the viewer does not work in Mozzilla Firefox).
Licenza e download	only for visualization
Link	http://project.cgm.unive.it/resource/VIT/Browser-VIT/indices/indexparsing_b.htm
Contatti	delmont[at]unive.it

Semantic VIT Fragment

Nome	Semantic VIT Fragment
Autore/i	Rodolfo Delmonte
Descrizione	The Semantic VIT Fragment is a portion of the VIT treebank (see above), which includes 511 sentences. With respect to the original resource VIT, in the Fragment the annotation has been improved with the introduction of null elements. This smaller version of the resource is available fro free.
Licenza e download	free
Link	http://project.cgm.unive.it/?page_id=200
Contatti	delmont[at]unive.it

Wikisents for FrameNet

Nome	Wikisents for FrameNet
Autore/i	Claudio Giuliano, Alfio Massimiliano Gliozzo, Carlo Strapparava, Sara Tonelli
Descrizione	A large set of sentences extracted from Wikipedia, to which a frame label has been assigned using a Word Sense Disambiguation approach. Such sentences can be used either to extend the amount of sentences already annotated for each frame through a manual validation, or exploited as training data for frame identification.
Licenza e download	free download
Link	http://hlt.fbk.eu/en/Technology/Wikisents_for_FrameNet
Contatti	A

Corpora cross-linguistici
(paralleli o comparabili)

MultiSemCor

Nome	MultiSemCor
Autore/i	Emanuele Pianta, Luisa Bentivogli, Pamela Forner, Christian Girardi, Marcello Ranieri
Descrizione	MultiSemCor is an English/Italian parallel corpus, aligned at the word level and annotated with PoS, lemma and word sense. The parallel corpus is created by exploiting the SemCor corpus, which is a subset of the English Brown corpus containing about 700,000 running words. In SemCor all the words are tagged by PoS, and more than 200,000 content words are also lemmatized and sense-tagged with reference to the Princeton WordNet lexical database.
Licenza e download	License obtained by registration
Link	http://multisemcor.fbk.eu/index.php
Contatti	manspera[at]fbk.eu

Altri

TOP

Risorse

Ultimo aggiornamento 20 Gennaio 2012, Contatti: bosco[at]di.unito.it