Resources

The following list includes the resources developed for the Italian NLP by all the researchers working in this area. Clicking on the name of each resource you can see the basic information about it and the link to the resource web site, if exist.

The information about systems and resources which participated to the Evalita 2011 contest will be made available very soon after the workshop (Rome, January the 24-25th 2012).

All suggestions and proposals about listed and not listed resources is welcome, and can be sent by using the Suggestions form.

Lexica and thesauri
(including Dictionaries, Wiktionaries, and WordNets)

BabelNet

Nome	BabelNet
Autori/e	Roberto Navigli, Simone Paolo Ponzetto
Descrizione	BabelNet is a very large multilingual semantic network with millions of concepts obtained from an integration of WordNet and Wikipedia based on an automatic mapping algorithm, and from the translations of the concepts (i.e. English Wikipedia pages and WordNet synsets) based on Wikipedia cross-language links and the output of a machine translation system. BabelNetXplorer is the online interface available on the BabelNet website for accessing the resource.
Licenza e download	free
Link	http://babelnet.org/
Contatti	navigli[at]di.uniroma1.it

ItalWordNet

Name	ItalWordNet (Italian WordNet)
Author(s)
Description	ItalWordNet is an updated version of the EuroWordNet Italian database. The ItalWordNet database was produced within a national Italian programme called SI-TAL. It contains a total of 49,360 synsets. Unlike the EuroWordNet database, the ItalWordNet is provided in XML format. However, it remains partially compatible with the EuroWordNet, as both models are similar, except for some new relations created for adjectives and which were added to ItalWordNet.
Licence and download	Copyright by ELRA
Link	http://catalog.elra.info/product_info.php?products_id=1110
Contact	http://www.elra.info/

Morph-it!

Name	Morph-it!
Author(s)	Marco Baroni, Eros Zanchetta
Description	Morph-it! is a free morphological resource for the Italian language, a lexicon of inflected forms with their lemma and morphological features. The lexicon currently contains 505,074 entries and 35,056 lemmas. Morph-it! can be used as a data source for a lemmatizer/morphological analyzer/morphological generator.
Licence and download	Creative Commons and CC-GNU LGPL
Link	http://dev.sslmit.unibo.it/corpora/corpus.php?path=&name=Repubblica
Contact	marco.baroni[at]unitn.it

MultiWordNet

Name	MultiWordNet
Author(s)	Emanuele Pianta, Luisa Bentivogli, Pamela Forner, Christian Girardi, Marcello Ranieri, Manuela Speranza, Massimiliano Bampi, Gabriela Cavaglià, Francesca Filiaci, Bernardo Magnini, Carlo Strapparava, Lisa Zenoniani
Description	MultiWordNet is a multilingual lexical database in which the Italian WordNet is strictly aligned with Princeton WordNet 1.6. The Italian synsets are created in correspondence with the Princeton WordNet synsets, whenever possible, and semantic relations are imported from the corresponding English synsets.
Licence and download	Free license for academic, commercial for others (asking to contact)
Link	http://multiwordnet.fbk.eu/english/home.php
Contact	manspera[at]fbk.eu

PAROLE

Name	PAROLE-SIMPLE-CLIPS PISA Italian Lexicon
Author(s)
Description	PAROLE-SIMPLE-CLIPS is a four-level, general purpose lexicon that has been elaborated over three different projects. The kernel of the morphological and syntactic lexicons was built in the framework of the LE-PAROLE project. The linguistic model and the core of the semantic lexicon were elaborated in the LE-SIMPLE project, while the phonological level of description and the extension of the lexical coverage were performed in the context of the Italian project Corpora e Lessici dell'Italiano Parlato e Scritto (CLIPS).
Licence and download	Copyright by ELRA
Link	http://catalog.elra.info/product_info.php?products_id=881?
Contact	http://www.elra.info/

Wordnet Domains

Name	Wordnet Domains
Author(s)	Bernardo Magnini, Emanuele Pianta, Luisa Bentivogli, Christian Girardi, Manuela Speranza, Gabriela Cavaglià, Pamela Forner, Giovanni Pezzulo, Lisa Zenoniani
Description	WordNet Domains is a lexical resource created in a semi-automatic way by augmenting WordNet with domain labels. WordNet Synsets have been annotated with at least one semantic domain label, selected from a set of about two hundred labels structured according the WordNet Domain Hierarchy.
Licence and download	Free license by registration
Link	http://wndomains.fbk.eu/
Contact	manspera[at]fbk.eu

Terminologies

Ontologies

ItalWordNet

Name	ItalWordNet (Italian WordNet)
Author(s)
Description	ItalWordNet is an updated version of the EuroWordNet Italian database. The ItalWordNet database was produced within a national Italian programme called SI-TAL. It contains a total of 49,360 synsets. Unlike the EuroWordNet database, the ItalWordNet is provided in XML format. However, it remains partially compatible with the EuroWordNet, as both models are similar, except for some new relations created for adjectives and which were added to ItalWordNet.
Licence and download	Copyright by ELRA
Link	http://catalog.elra.info/product_info.php?products_id=1110
Contact	http://www.elra.info/

SentiWordNet

Name	SentiWordNet
Author/s	Stefano Baccianella, Andrea Esuli, Fabrizio Sebastiani
Description	SentiWordNet is a lexical resource in which each WordNet synset is associated to three numerical scores Obj(s), Pos(s) and Neg(s), describing how objective, positive, and negative the terms contained in the synset are. A typical use of SentiWordNet is to enrich the text representation in opinion mining applications, adding information on the sentiment-related properties of the terms in text. It is endowed with a Web-based graphical user interface.
Licence and download	Free license by registration, for non-profit purposes only
Link	http://swn.isti.cnr.it/
Contact	bentivo[at]fbk.eu

Spoken corpora

QALL-ME_benchmark

Name	QALL-ME benchmark
Author(s)	Cabrio E., Kouylekov M., Magnini B., Negri M., Hasler L., Orasan C., Tomas D., Vicedo J. L., Neumann G., Weber C.
Description	The QALL-ME benchmark is a collection of several thousand spoken utterances related to the domain of tourism, both audio files and their corresponding transcriptions, in the four languages involved in the project: English, German, Italian and Spanish. These utterances ask for information about cultural events, accommodation, movies, gastro, etc. and have been transcribed according to guidelines set out by the QALL-ME consortium.
Licence and download	Creative Commons Licence 3.0
Link	http://qallme.fbk.eu/index.php?location=benchmark
Contact

VIT-parlato

Name	Venice Italian Treebank parlato (VIT-parlato)
Author(s)	Rodolfo Delmonte
Description	VIT-parlato is a treebank of constituent tree structures with tagged words. The full version of the resurce is made available for free for all the interested user. The content of the treebank can be seen in the web site (see link) by using a viewer-parser developed in JAVA This treebank can be seen in the web site (see link) by using a viewer-parser developed in JAVA, which shows the content of each sentence putting the words in a list on the left and the structure on the right (the viewer does not work in Mozzilla Firefox).
Licence and download	only for visualization
Link	http://project.cgm.unive.it/resource/VIT/Browser-VIT/indices/indexparsing_it2.htm
Contact	delmont[at]unive.it

Written corpora

CORIS/CODIS

Name	CORIS/CODIS
Author(s)	Rema Rossini Favretti, Nicola Grandi, Malvina Nissim, Fabio Tamburini, Andrea Zaninello
Description	The corpus of written Italian CORIS/CODIS started with the purpose of creating a representative and sizeable general reference corpus of written Italian which would be easily accessible and user-friendly. CORIS contains 120 million words and has been updated every three years by means of a built-in monitor corpus. It consists of a collection of authentic and commonly occurring texts in electronic format chosen by virtue of their representativeness of modern Italian.
Licence and download	free by signing license agreement
Link	http://dslo.unibo.it/coris_eng.html
Contact	rema.rossini[at]unibo.it

CRIPCO

Name	CRIPCO (Cross-document Italian People Coreference Corpus)
Author(s)	Luisa Bentivogli, Christian Girardi, Emanuele Pianta
Description	The CRIPCO corpus is a subset of the news stories published by the local newspaper "L'Adige" from 1999 to 2006 annotated with information about person cross-document coreference.
Licence and download	freely available for research purpose by subscription
Link	http://hlt.fbk.eu/en/CRIPCO/
Contact	manspera[at]fbk.eu

I-CAB

Name	I-CAB (Italian Content Annotation Bank)
Author(s)	B. Magnini, A. Cappelli, E. Pianta, M. Speranza, V. Bartalesi Lenzi, R. Sprugnoli, L. Romano, C. Girardi, M. Negri
Description	I-CAB is a corpus consisting of news stories taken from the local newspaper "L'Adige", where the semantic information is annotated at different levels: temporal expressions, entities (i.e. persons, organizations, locations, and geo-political entities) and relations between entities (e.g. the affiliation relation connecting a person to an organization)
Licence and download	freely available for research purposes upon acceptance of a license agreement
Link	http://ontotext.fbk.eu/icab.html
Contact	manspera[at]fbk.eu

ISST

Name	ISST (Italian Syntactic-Semantic Treebank)
Author(s)	S. Montemagni, F. Barsotti, M. Battista, N. Calzolari, O. Corazzari, A. Lenci, A. Zampolli, F. Fanciulli, M. Massetani, R. Raffaelli, R. Basili, M. T. Pazienza, D. Saracino, F. Zanzotto, N. Mana, F. Pianesi, R. Delmonte
Description	ISST has a five-level structure covering orthographic, morpho-syntactic, syntactic and semantic levels of linguistic description. Syntactic annotation is distributed over two different levels: the constituent structure level and the functional relations level. The fifth level deals with lexico-semantic annotation, which is carried out in terms of sense tagging of lexical heads (nouns, verbs and adjectives) augmented with other types of semantic information: ItalWordNet (see ELRA-M0018) is the reference lexical resource used for the sense tagging task.
Licence and download	Copyright by ELRA
Link	http://catalog.elra.info/product_info.php?products_id=887
Contact	http://www.elra.info/

itWAC

Name	itWaC
Author(s)	the Wacky bunch
Description	itWaC is a 2 billion word corpus constructed from the Web limiting the crawl to the .it domain and using medium-frequency words from the Repubblica corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the TreeTagger, and lemmatized using the Morph-it! lexicon.
Licence and download	for free, by contacting the contact address
Link	http://wacky.sslmit.unibo.it/doku.php?id=corporab
Contact	wacky[at]sslmit.unibo.it

la Repubblica

Name	"la Repubblica" corpus
Author(s)	M. Baroni, S. Bernardini, F. Comastri, L. Piccioni, A. Volpi, G. Aston, M. Mazzoleni, E. Zanchetta, S. Castagnoli
Description	The "la Repubblica" corpus is a very large corpus of Italian newspaper text (approximately 380M tokens). The corpus is tokenized, pos-tagged (with the Treetagger trained with ad-hoc resources), lemmatized (with Morph-it) and categorized in terms of genre and topic (with SVMLight trained with ad-hoc resources).
Licence and download	Query service available to registered users
Link	http://dev.sslmit.unibo.it/corpora/corpus.php?path=&name=Repubblica
Contact	marco.baroni[at]unitn.it

SWiiT

Name	SWiiT: Semantic WIkipedia for ITalian
Author(s)	Silvana Marianela Bernaola Biggio, Roberto Zanoli, Manuela Speranza
Description	SWiiT is the Italian Wikipedia annotated at five different levels: basic NLP processing (tokenization, sentence splitting and PoS-tagging), entity mentions (person, organization, location and geo-political entities), entity subtypes (not completed), entity co-reference (not completed), dependency parsing (not completed)
Licence and download
Link	http://textpro.fbk.eu/resources/SWiiT.html
Contact	manspera[at]fbk.eu

TUT

Name	Turin University Treebank (TUT)
Author(s)	Leonardo Lesmo, Cristina Bosco, Alessandro Mazzei, Vincenzo Lombardo, Livio Robaldo
Description	TUT is a dependency-based treebank develoed in parallel with the parsing environment TULE (see tools). It features and annotation based on the major principles of Hudson's word grammar, but emploies also null elements for representing discontinuous and elliptical structures or pro-drops. It has been made available also in other constituency formats
Licence and download	Creative Commons
Link	http://www.di.unito.it/~tutreeb
Contact	bosco[at]di.unito.it

CCG-TUT

Name	Combinatory Categorial Grammar - Turin University Treebank (CCG-TUT)
Author(s)	Johan Bos, Cristina Bosco, Alessandro Mazzei
Description	A process of conversion from TUT generates CCG-TUT, the Italian CCGbank, taking as input a set of sentences in the TUT format of dependencies. These are (1) mapped onto constituency trees (i.e. ConsTUT format), (2) which in turn undergo surgery to become binary trees, and (3) are then mapped into CCG derivations. ConsTUT is a TUT-oriented constituency-based annotation with TUT relations annotated on constituents. In ConsTUT trees each terminal category X corresponds to a node (i.e. word) of a TUT tree, and projects into non-terminal nodes which represent intermediate (Xbar) and maximal (XP) projections of X, according to Xbar theory (for more details and download of TUT in CONS-TUT format see TUT). The Italian CCGbank comes in three different formats: (1) derivations (Prolog terms), (2) derivations (pretty printed) and (3) tuples of words, POS and CCG category.
Licence and download	Creative Commons
Link	http://www.di.unito.it/~tutreeb/CCG-TUT/
Contact	bosco[at]di.unito.it

VIT

Name	Venice Italian Treebank (VIT)
Author(s)	Rodolfo Delmonte
Description	VIT is a treebank which includes 275,000 tokens composed by different representation layers: the first is of constituent tree strctures with tagged words; the second is derived from the first, and is composed by dependency structures organized in 8 columns which describe the morphological and semantic features, the lemma, the functional marker which differentiates topic/focus; other layers refer to the orthograohical version with multiwords and in canonical form. The full treebank is available against payment, while smaller versions for free will be soon available on the PARLI web site.
Licence and download	only for visualization
Link	http://www.elda.org/catalogue/en/text/W0040.html
Contact	delmont[at]unive.it

VIT-scritto

Name	Venice Italian Treebank scritto (VIT-scritto)
Author(s)	Rodolfo Delmonte
Description	VIT-scritto is a treebank composed by constituent structure trees with tagged words. The full version is called VIT (see the previous resource in this list). This treebank can be seen in the web site (see link) by using a viewer-parser developed in JAVA, which shows the content of each sentence putting the words in a list on the left and the structure on the right (the viewer does not work in Mozzilla Firefox).
Licence and download	only for visualization
Link	http://project.cgm.unive.it/resource/VIT/Browser-VIT/indices/indexparsing_b.htm
Contact	delmont[at]unive.it

Semantic VIT Fragment

Nome	Semantic VIT Fragment
Autore/i	Rodolfo Delmonte
Descrizione	The Semantic VIT Fragment is a portion of the VIT treebank (see above), which includes 511 sentences. With respect to the original resource VIT, in the Fragment the annotation has been improved with the introduction of null elements. This smaller version of the resource is available fro free.
Licenza e download	free
Link	http://project.cgm.unive.it/?page_id=200
Contatti	delmont[at]unive.it

Wikisents for FrameNet

Name	Wikisents for FrameNet
Author(s)	Claudio Giuliano, Alfio Massimiliano Gliozzo, Carlo Strapparava, Sara Tonelli
Description	A large set of sentences extracted from Wikipedia, to which a frame label has been assigned using a Word Sense Disambiguation approach. Such sentences can be used either to extend the amount of sentences already annotated for each frame through a manual validation, or exploited as training data for frame identification.
Licence and download	free download
Link	http://hlt.fbk.eu/en/Technology/Wikisents_for_FrameNet
Contact

Cross-linguistic corpora
(parallel or comparable)

MultiSemCor

Name	MultiSemCor
Author(s)	Emanuele Pianta, Luisa Bentivogli, Pamela Forner, Christian Girardi, Marcello Ranieri
Description	MultiSemCor is an English/Italian parallel corpus, aligned at the word level and annotated with PoS, lemma and word sense. The parallel corpus is created by exploiting the SemCor corpus, which is a subset of the English Brown corpus containing about 700,000 running words. In SemCor all the words are tagged by PoS, and more than 200,000 content words are also lemmatized and sense-tagged with reference to the Princeton WordNet lexical database.
Licence and download	License obtained by registration
Link	http://multisemcor.fbk.eu/index.php
Contact	manspera[at]fbk.eu

Others

TOP

Resources

Last updated January the 20th 2012, Contact: bosco[at]di.unito.it