Resources

The following list includes the resources developed for the Italian NLP by all the researchers working in this area. Clicking on the name of each resource you can see the basic information about it and the link to the resource web site, if exist.

The information about systems and resources which participated to the Evalita 2011 contest will be made available very soon after the workshop (Rome, January the 24-25th 2012).

All suggestions and proposals about listed and not listed resources is welcome, and can be sent by using the Suggestions form.

Lexica and thesauri
(including Dictionaries, Wiktionaries, and WordNets)
  • BabelNet
    Nome BabelNet
    Autori/e Roberto Navigli, Simone Paolo Ponzetto
    Descrizione BabelNet is a very large multilingual semantic network with millions of concepts obtained from an integration of WordNet and Wikipedia based on an automatic mapping algorithm, and from the translations of the concepts (i.e. English Wikipedia pages and WordNet synsets) based on Wikipedia cross-language links and the output of a machine translation system.
    BabelNetXplorer is the online interface available on the BabelNet website for accessing the resource.
    Licenza e download free
    Link http://babelnet.org/
    Contatti navigli[at]di.uniroma1.it
  • ItalWordNet
    Name ItalWordNet (Italian WordNet)
    Author(s)
    Description ItalWordNet is an updated version of the EuroWordNet Italian database. The ItalWordNet database was produced within a national Italian programme called SI-TAL. It contains a total of 49,360 synsets. Unlike the EuroWordNet database, the ItalWordNet is provided in XML format. However, it remains partially compatible with the EuroWordNet, as both models are similar, except for some new relations created for adjectives and which were added to ItalWordNet.
    Licence and download Copyright by ELRA
    Link http://catalog.elra.info/product_info.php?products_id=1110
    Contact http://www.elra.info/
  • Morph-it!
    Name Morph-it!
    Author(s) Marco Baroni, Eros Zanchetta
    Description Morph-it! is a free morphological resource for the Italian language, a lexicon of inflected forms with their lemma and morphological features. The lexicon currently contains 505,074 entries and 35,056 lemmas. Morph-it! can be used as a data source for a lemmatizer/morphological analyzer/morphological generator.
    Licence and download Creative Commons and CC-GNU LGPL
    Link http://dev.sslmit.unibo.it/corpora/corpus.php?path=&name=Repubblica
    Contact marco.baroni[at]unitn.it
  • MultiWordNet
    Name MultiWordNet
    Author(s) Emanuele Pianta, Luisa Bentivogli, Pamela Forner, Christian Girardi, Marcello Ranieri, Manuela Speranza, Massimiliano Bampi, Gabriela Cavaglià, Francesca Filiaci, Bernardo Magnini, Carlo Strapparava, Lisa Zenoniani
    Description MultiWordNet is a multilingual lexical database in which the Italian WordNet is strictly aligned with Princeton WordNet 1.6. The Italian synsets are created in correspondence with the Princeton WordNet synsets, whenever possible, and semantic relations are imported from the corresponding English synsets.
    Licence and download Free license for academic, commercial for others (asking to contact)
    Link http://multiwordnet.fbk.eu/english/home.php
    Contact manspera[at]fbk.eu
  • PAROLE
    Name PAROLE-SIMPLE-CLIPS PISA Italian Lexicon
    Author(s)
    Description PAROLE-SIMPLE-CLIPS is a four-level, general purpose lexicon that has been elaborated over three different projects. The kernel of the morphological and syntactic lexicons was built in the framework of the LE-PAROLE project. The linguistic model and the core of the semantic lexicon were elaborated in the LE-SIMPLE project, while the phonological level of description and the extension of the lexical coverage were performed in the context of the Italian project Corpora e Lessici dell'Italiano Parlato e Scritto (CLIPS).
    Licence and download Copyright by ELRA
    Link http://catalog.elra.info/product_info.php?products_id=881?
    Contact http://www.elra.info/
  • Wordnet Domains
    Name Wordnet Domains
    Author(s) Bernardo Magnini, Emanuele Pianta, Luisa Bentivogli, Christian Girardi, Manuela Speranza, Gabriela Cavaglià, Pamela Forner, Giovanni Pezzulo, Lisa Zenoniani
    Description WordNet Domains is a lexical resource created in a semi-automatic way by augmenting WordNet with domain labels. WordNet Synsets have been annotated with at least one semantic domain label, selected from a set of about two hundred labels structured according the WordNet Domain Hierarchy.
    Licence and download Free license by registration
    Link http://wndomains.fbk.eu/
    Contact manspera[at]fbk.eu
Terminologies
Ontologies
  • ItalWordNet
    Name ItalWordNet (Italian WordNet)
    Author(s)
    Description ItalWordNet is an updated version of the EuroWordNet Italian database. The ItalWordNet database was produced within a national Italian programme called SI-TAL. It contains a total of 49,360 synsets. Unlike the EuroWordNet database, the ItalWordNet is provided in XML format. However, it remains partially compatible with the EuroWordNet, as both models are similar, except for some new relations created for adjectives and which were added to ItalWordNet.
    Licence and download Copyright by ELRA
    Link http://catalog.elra.info/product_info.php?products_id=1110
    Contact http://www.elra.info/
  • SentiWordNet
    Name SentiWordNet
    Author/s Stefano Baccianella, Andrea Esuli, Fabrizio Sebastiani
    Description SentiWordNet is a lexical resource in which each WordNet synset is associated to three numerical scores Obj(s), Pos(s) and Neg(s), describing how objective, positive, and negative the terms contained in the synset are. A typical use of SentiWordNet is to enrich the text representation in opinion mining applications, adding information on the sentiment-related properties of the terms in text. It is endowed with a Web-based graphical user interface.
    Licence and download Free license by registration, for non-profit purposes only
    Link http://swn.isti.cnr.it/
    Contact bentivo[at]fbk.eu
Spoken corpora
  • QALL-ME_benchmark
    Name QALL-ME benchmark
    Author(s) Cabrio E., Kouylekov M., Magnini B., Negri M., Hasler L., Orasan C., Tomas D., Vicedo J. L., Neumann G., Weber C.
    Description The QALL-ME benchmark is a collection of several thousand spoken utterances related to the domain of tourism, both audio files and their corresponding transcriptions, in the four languages involved in the project: English, German, Italian and Spanish. These utterances ask for information about cultural events, accommodation, movies, gastro, etc. and have been transcribed according to guidelines set out by the QALL-ME consortium.
    Licence and download Creative Commons Licence 3.0
    Link http://qallme.fbk.eu/index.php?location=benchmark
    Contact
  • VIT-parlato
    Name Venice Italian Treebank parlato (VIT-parlato)
    Author(s) Rodolfo Delmonte
    Description VIT-parlato is a treebank of constituent tree structures with tagged words. The full version of the resurce is made available for free for all the interested user. The content of the treebank can be seen in the web site (see link) by using a viewer-parser developed in JAVA This treebank can be seen in the web site (see link) by using a viewer-parser developed in JAVA, which shows the content of each sentence putting the words in a list on the left and the structure on the right (the viewer does not work in Mozzilla Firefox).
    Licence and download only for visualization
    Link http://project.cgm.unive.it/resource/VIT/Browser-VIT/indices/indexparsing_it2.htm
    Contact delmont[at]unive.it
Written corpora
  • CORIS/CODIS
    Name CORIS/CODIS
    Author(s) Rema Rossini Favretti, Nicola Grandi, Malvina Nissim, Fabio Tamburini, Andrea Zaninello
    Description The corpus of written Italian CORIS/CODIS started with the purpose of creating a representative and sizeable general reference corpus of written Italian which would be easily accessible and user-friendly. CORIS contains 120 million words and has been updated every three years by means of a built-in monitor corpus. It consists of a collection of authentic and commonly occurring texts in electronic format chosen by virtue of their representativeness of modern Italian.
    Licence and download free by signing license agreement
    Link http://dslo.unibo.it/coris_eng.html
    Contact rema.rossini[at]unibo.it
  • CRIPCO
    Name CRIPCO (Cross-document Italian People Coreference Corpus)
    Author(s) Luisa Bentivogli, Christian Girardi, Emanuele Pianta
    Description The CRIPCO corpus is a subset of the news stories published by the local newspaper "L'Adige" from 1999 to 2006 annotated with information about person cross-document coreference.
    Licence and download freely available for research purpose by subscription
    Link http://hlt.fbk.eu/en/CRIPCO/
    Contact manspera[at]fbk.eu
  • I-CAB
    Name I-CAB (Italian Content Annotation Bank)
    Author(s) B. Magnini, A. Cappelli, E. Pianta, M. Speranza, V. Bartalesi Lenzi, R. Sprugnoli, L. Romano, C. Girardi, M. Negri
    Description I-CAB is a corpus consisting of news stories taken from the local newspaper "L'Adige", where the semantic information is annotated at different levels: temporal expressions, entities (i.e. persons, organizations, locations, and geo-political entities) and relations between entities (e.g. the affiliation relation connecting a person to an organization)
    Licence and download freely available for research purposes upon acceptance of a license agreement
    Link http://ontotext.fbk.eu/icab.html
    Contact manspera[at]fbk.eu
  • ISST
    Name ISST (Italian Syntactic-Semantic Treebank)
    Author(s) S. Montemagni, F. Barsotti, M. Battista, N. Calzolari, O. Corazzari, A. Lenci, A. Zampolli, F. Fanciulli, M. Massetani, R. Raffaelli, R. Basili, M. T. Pazienza, D. Saracino, F. Zanzotto, N. Mana, F. Pianesi, R. Delmonte
    Description ISST has a five-level structure covering orthographic, morpho-syntactic, syntactic and semantic levels of linguistic description. Syntactic annotation is distributed over two different levels: the constituent structure level and the functional relations level. The fifth level deals with lexico-semantic annotation, which is carried out in terms of sense tagging of lexical heads (nouns, verbs and adjectives) augmented with other types of semantic information: ItalWordNet (see ELRA-M0018) is the reference lexical resource used for the sense tagging task.
    Licence and download Copyright by ELRA
    Link http://catalog.elra.info/product_info.php?products_id=887
    Contact http://www.elra.info/
  • itWAC
    Name itWaC
    Author(s) the Wacky bunch
    Description itWaC is a 2 billion word corpus constructed from the Web limiting the crawl to the .it domain and using medium-frequency words from the Repubblica corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the TreeTagger, and lemmatized using the Morph-it! lexicon.
    Licence and download for free, by contacting the contact address
    Link http://wacky.sslmit.unibo.it/doku.php?id=corporab
    Contact wacky[at]sslmit.unibo.it
  • la Repubblica
    Name "la Repubblica" corpus
    Author(s) M. Baroni, S. Bernardini, F. Comastri, L. Piccioni, A. Volpi, G. Aston, M. Mazzoleni, E. Zanchetta, S. Castagnoli
    Description The "la Repubblica" corpus is a very large corpus of Italian newspaper text (approximately 380M tokens). The corpus is tokenized, pos-tagged (with the Treetagger trained with ad-hoc resources), lemmatized (with Morph-it) and categorized in terms of genre and topic (with SVMLight trained with ad-hoc resources).
    Licence and download Query service available to registered users
    Link http://dev.sslmit.unibo.it/corpora/corpus.php?path=&name=Repubblica
    Contact marco.baroni[at]unitn.it
  • SWiiT
    Name SWiiT: Semantic WIkipedia for ITalian
    Author(s) Silvana Marianela Bernaola Biggio, Roberto Zanoli, Manuela Speranza
    Description SWiiT is the Italian Wikipedia annotated at five different levels: basic NLP processing (tokenization, sentence splitting and PoS-tagging), entity mentions (person, organization, location and geo-political entities), entity subtypes (not completed), entity co-reference (not completed), dependency parsing (not completed)
    Licence and download
    Link http://textpro.fbk.eu/resources/SWiiT.html
    Contact manspera[at]fbk.eu
  • TUT
    NameTurin University Treebank (TUT)
    Author(s)Leonardo Lesmo, Cristina Bosco, Alessandro Mazzei, Vincenzo Lombardo, Livio Robaldo
    DescriptionTUT is a dependency-based treebank develoed in parallel with the parsing environment TULE (see tools). It features and annotation based on the major principles of Hudson's word grammar, but emploies also null elements for representing discontinuous and elliptical structures or pro-drops. It has been made available also in other constituency formats
    Licence and downloadCreative Commons
    Link http://www.di.unito.it/~tutreeb
    Contactbosco[at]di.unito.it
  • CCG-TUT
    NameCombinatory Categorial Grammar - Turin University Treebank (CCG-TUT)
    Author(s)Johan Bos, Cristina Bosco, Alessandro Mazzei
    Description A process of conversion from TUT generates CCG-TUT, the Italian CCGbank, taking as input a set of sentences in the TUT format of dependencies. These are (1) mapped onto constituency trees (i.e. ConsTUT format), (2) which in turn undergo surgery to become binary trees, and (3) are then mapped into CCG derivations. ConsTUT is a TUT-oriented constituency-based annotation with TUT relations annotated on constituents. In ConsTUT trees each terminal category X corresponds to a node (i.e. word) of a TUT tree, and projects into non-terminal nodes which represent intermediate (Xbar) and maximal (XP) projections of X, according to Xbar theory (for more details and download of TUT in CONS-TUT format see TUT). The Italian CCGbank comes in three different formats: (1) derivations (Prolog terms), (2) derivations (pretty printed) and (3) tuples of words, POS and CCG category.
    Licence and downloadCreative Commons
    Link http://www.di.unito.it/~tutreeb/CCG-TUT/
    Contactbosco[at]di.unito.it
  • VIT
    Name Venice Italian Treebank (VIT)
    Author(s) Rodolfo Delmonte
    Description VIT is a treebank which includes 275,000 tokens composed by different representation layers: the first is of constituent tree strctures with tagged words; the second is derived from the first, and is composed by dependency structures organized in 8 columns which describe the morphological and semantic features, the lemma, the functional marker which differentiates topic/focus; other layers refer to the orthograohical version with multiwords and in canonical form. The full treebank is available against payment, while smaller versions for free will be soon available on the PARLI web site.
    Licence and download only for visualization
    Link http://www.elda.org/catalogue/en/text/W0040.html
    Contact delmont[at]unive.it
  • VIT-scritto
    Name Venice Italian Treebank scritto (VIT-scritto)
    Author(s) Rodolfo Delmonte
    Description VIT-scritto is a treebank composed by constituent structure trees with tagged words. The full version is called VIT (see the previous resource in this list). This treebank can be seen in the web site (see link) by using a viewer-parser developed in JAVA, which shows the content of each sentence putting the words in a list on the left and the structure on the right (the viewer does not work in Mozzilla Firefox).
    Licence and download only for visualization
    Link http://project.cgm.unive.it/resource/VIT/Browser-VIT/indices/indexparsing_b.htm
    Contact delmont[at]unive.it
  • Semantic VIT Fragment
    Nome Semantic VIT Fragment
    Autore/i Rodolfo Delmonte
    Descrizione The Semantic VIT Fragment is a portion of the VIT treebank (see above), which includes 511 sentences. With respect to the original resource VIT, in the Fragment the annotation has been improved with the introduction of null elements. This smaller version of the resource is available fro free.
    Licenza e download free
    Link http://project.cgm.unive.it/?page_id=200
    Contatti delmont[at]unive.it
  • Wikisents for FrameNet
    Name Wikisents for FrameNet
    Author(s) Claudio Giuliano, Alfio Massimiliano Gliozzo, Carlo Strapparava, Sara Tonelli
    Description A large set of sentences extracted from Wikipedia, to which a frame label has been assigned using a Word Sense Disambiguation approach. Such sentences can be used either to extend the amount of sentences already annotated for each frame through a manual validation, or exploited as training data for frame identification.
    Licence and download free download
    Link http://hlt.fbk.eu/en/Technology/Wikisents_for_FrameNet
    Contact
Cross-linguistic corpora
(parallel or comparable)
  • MultiSemCor
    Name MultiSemCor
    Author(s) Emanuele Pianta, Luisa Bentivogli, Pamela Forner, Christian Girardi, Marcello Ranieri
    Description MultiSemCor is an English/Italian parallel corpus, aligned at the word level and annotated with PoS, lemma and word sense. The parallel corpus is created by exploiting the SemCor corpus, which is a subset of the English Brown corpus containing about 700,000 running words. In SemCor all the words are tagged by PoS, and more than 200,000 content words are also lemmatized and sense-tagged with reference to the Princeton WordNet lexical database.
    Licence and download License obtained by registration
    Link http://multisemcor.fbk.eu/index.php
    Contact manspera[at]fbk.eu
Others




TOP


Last updated January the 20th 2012, Contact: bosco[at]di.unito.it