Risorse

La lista seguente include le risorse sviluppate per il NLP per la lingua italiana da tutti i ricercatori che operano in quest'area. Cliccando sul nome di ogni risorsa si apre una scheda con le informazioni di base ad essa relative (nella maggior parte dei casi in inglese, in accordo a come sono fornite dagli autori) e il link al suo sito web.

Le informazioni relative ai sistemi e risorse che hanno partecipato alla campagna di valutazione Evalita 2011 saranno rese disponibili molto presto dopo il workshop (Roma, 24-25 gennaio 2012).

Tutte le segnalazioni e proposte in merito a risorse presenti e non presenti nella lista sono benvenute, e possono essere inviate utilizzando il form della sezione Segnalazioni di questo sito.

Lessici e tesauri
(inclusi Dizionari, Wiktionary e WordNets)
  • BabelNet
    Nome BabelNet
    Autori/e Roberto Navigli, Simone Paolo Ponzetto
    Descrizione BabelNet is a very large multilingual semantic network with millions of concepts obtained from an integration of WordNet and Wikipedia based on an automatic mapping algorithm, and from the translations of the concepts (i.e. English Wikipedia pages and WordNet synsets) based on Wikipedia cross-language links and the output of a machine translation system.
    BabelNetXplorer is the online interface available on the BabelNet website for accessing the resource.
    Licenza e download free
    Link http://babelnet.org/
    Contatti navigli[at]di.uniroma1.it
  • ItalWordNet
    Nome ItalWordNet (Italian WordNet)
    Autori/e
    Descrizione ItalWordNet is an updated version of the EuroWordNet Italian database. The ItalWordNet database was produced within a national Italian programme called SI-TAL. It contains a total of 49,360 synsets. Unlike the EuroWordNet database, the ItalWordNet is provided in XML format. However, it remains partially compatible with the EuroWordNet, as both models are similar, except for some new relations created for adjectives and which were added to ItalWordNet.
    Licenza e download Copyright by ELRA
    Link http://catalog.elra.info/product_info.php?products_id=1110
    Contatti http://www.elra.info/
  • Morph-it!
    Nome Morph-it!
    Autore/i Marco Baroni, Eros Zanchetta
    Descrizione Morph-it! is a free morphological resource for the Italian language, a lexicon of inflected forms with their lemma and morphological features. The lexicon currently contains 505,074 entries and 35,056 lemmas. Morph-it! can be used as a data source for a lemmatizer/morphological analyzer/morphological generator.
    Licenza e download Creative Commons and CC-GNU LGPL
    Link http://dev.sslmit.unibo.it/corpora/corpus.php?path=&name=Repubblica
    Contatti marco.baroni[at]unitn.it
  • MultiWordNet
    Nome MultiWordNet
    Autori/e Emanuele Pianta, Luisa Bentivogli, Pamela Forner, Christian Girardi, Marcello Ranieri, Manuela Speranza, Massimiliano Bampi, Gabriela CavagliĆ , Francesca Filiaci, Bernardo Magnini, Carlo Strapparava, Lisa Zenoniani
    Descrizione MultiWordNet is a multilingual lexical database in which the Italian WordNet is strictly aligned with Princeton WordNet 1.6. The Italian synsets are created in correspondence with the Princeton WordNet synsets, whenever possible, and semantic relations are imported from the corresponding English synsets.
    Licenza e download Free license for academic, commercial for others (asking to contact)
    Link http://multiwordnet.fbk.eu/english/home.php
    Contatti manspera[at]fbk.eu
  • PAROLE
    Nome PAROLE-SIMPLE-CLIPS PISA Italian Lexicon
    Autore/i
    Descrizione PAROLE-SIMPLE-CLIPS is a four-level, general purpose lexicon that has been elaborated over three different projects. The kernel of the morphological and syntactic lexicons was built in the framework of the LE-PAROLE project. The linguistic model and the core of the semantic lexicon were elaborated in the LE-SIMPLE project, while the phonological level of description and the extension of the lexical coverage were performed in the context of the Italian project Corpora e Lessici dell'Italiano Parlato e Scritto (CLIPS).
    Licenza e download Copyright by ELRA
    Link http://catalog.elra.info/product_info.php?products_id=881?
    Contatti http://www.elra.info/
  • Wordnet Domains
    Nome Wordnet Domains
    Autore/i Bernardo Magnini, Emanuele Pianta, Luisa Bentivogli, Christian Girardi, Manuela Speranza, Gabriela Cavaglià, Pamela Forner, Giovanni Pezzulo, Lisa Zenoniani
    Descrizione WordNet Domains is a lexical resource created in a semi-automatic way by augmenting WordNet with domain labels. WordNet Synsets have been annotated with at least one semantic domain label, selected from a set of about two hundred labels structured according the WordNet Domain Hierarchy.
    Licenza e download Free license by registration
    Link http://wndomains.fbk.eu/
    Contatti manspera[at]fbk.eu
Terminologie
Ontologie
  • ItalWordNet
    Nome ItalWordNet (Italian WordNet)
    Autore/i
    Descrizione ItalWordNet is an updated version of the EuroWordNet Italian database. The ItalWordNet database was produced within a national Italian programme called SI-TAL. It contains a total of 49,360 synsets. Unlike the EuroWordNet database, the ItalWordNet is provided in XML format. However, it remains partially compatible with the EuroWordNet, as both models are similar, except for some new relations created for adjectives and which were added to ItalWordNet.
    Licenza e download Copyright by ELRA
    Link http://catalog.elra.info/product_info.php?products_id=1110
    Contatti http://www.elra.info/
  • SentiWordNet
    Nome SentiWordNet
    Autore/i Stefano Baccianella, Andrea Esuli, Fabrizio Sebastiani
    Descrizione SentiWordNet is a lexical resource in which each WordNet synset is associated to three numerical scores Obj(s), Pos(s) and Neg(s), describing how objective, positive, and negative the terms contained in the synset are. A typical use of SentiWordNet is to enrich the text representation in opinion mining applications, adding information on the sentiment-related properties of the terms in text. It is endowed with a Web-based graphical user interface.
    Licenza e download Free license by registration, for non-profit purposes only
    Link http://swn.isti.cnr.it/
    Contatti bentivo[at]fbk.eu
Corpora di testo parlato
  • QALL-ME_benchmark
    Nome QALL-ME benchmark
    Autore/i Cabrio E., Kouylekov M., Magnini B., Negri M., Hasler L., Orasan C., Tomas D., Vicedo J. L., Neumann G., Weber C.
    Descrizione The QALL-ME benchmark is a collection of several thousand spoken utterances related to the domain of tourism, both audio files and their corresponding transcriptions, in the four languages involved in the project: English, German, Italian and Spanish. These utterances ask for information about cultural events, accommodation, movies, gastro, etc. and have been transcribed according to guidelines set out by the QALL-ME consortium.
    Licenza e download Creative Commons Licence 3.0
    Link http://qallme.fbk.eu/index.php?location=benchmark
    Contatti
  • VIT-parlato
    Nome Venice Italian Treebank parlato (VIT-parlato)
    Autore/i Rodolfo Delmonte
    Descrizione VIT-parlato is a treebank of constituent tree structures with tagged words. The full version of the resurce is made available for free for all the interested user. The content of the treebank can be seen in the web site (see link) by using a viewer-parser developed in JAVA This treebank can be seen in the web site (see link) by using a viewer-parser developed in JAVA, which shows the content of each sentence putting the words in a list on the left and the structure on the right (the viewer does not work in Mozzilla Firefox).
    Licenza e download only for visualization
    Link http://project.cgm.unive.it/resource/VIT/Browser-VIT/indices/indexparsing_it2.htm
    Contatti delmont[at]unive.it
Corpora di testo scritto
  • CORIS/CODIS
    Nome CORIS/CODIS
    Autore/i Rema Rossini Favretti, Nicola Grandi, Malvina Nissim, Fabio Tamburini, Andrea Zaninello
    Descrizione The corpus of written Italian CORIS/CODIS started with the purpose of creating a representative and sizeable general reference corpus of written Italian which would be easily accessible and user-friendly. CORIS contains 120 million words and has been updated every three years by means of a built-in monitor corpus. It consists of a collection of authentic and commonly occurring texts in electronic format chosen by virtue of their representativeness of modern Italian.
    Licenza e download free by signing license agreement
    Link http://dslo.unibo.it/coris_eng.html
    Contatti rema.rossini[at]unibo.it
  • CRIPCO
    Nome CRIPCO (Cross-document Italian People Coreference Corpus)
    Autore/i Luisa Bentivogli, Christian Girardi, Emanuele Pianta
    Descrizione The CRIPCO corpus is a subset of the news stories published by the local newspaper "L'Adige" from 1999 to 2006 annotated with information about person cross-document coreference.
    Licenza e download freely available for research purpose by subscription
    Link http://hlt.fbk.eu/en/CRIPCO/
    Contatti manspera[at]fbk.eu
  • I-CAB
    Nome I-CAB (Italian Content Annotation Bank)
    Autore/i B. Magnini, A. Cappelli, E. Pianta, M. Speranza, V. Bartalesi Lenzi, R. Sprugnoli, L. Romano, C. Girardi, M. Negri
    Descrizione I-CAB is a corpus consisting of news stories taken from the local newspaper "L'Adige", where the semantic information is annotated at different levels: temporal expressions, entities (i.e. persons, organizations, locations, and geo-political entities) and relations between entities (e.g. the affiliation relation connecting a person to an organization)
    Licenza e download freely available for research purposes upon acceptance of a license agreement
    Link http://ontotext.fbk.eu/icab.html
    Contatti manspera[at]fbk.eu
  • ISST
    Nome ISST (Italian Syntactic-Semantic Treebank)
    Autore/i S. Montemagni, F. Barsotti, M. Battista, N. Calzolari, O. Corazzari, A. Lenci, A. Zampolli, F. Fanciulli, M. Massetani, R. Raffaelli, R. Basili, M. T. Pazienza, D. Saracino, F. Zanzotto, N. Mana, F. Pianesi, R. Delmonte
    Descrizione ISST has a five-level structure covering orthographic, morpho-syntactic, syntactic and semantic levels of linguistic description. Syntactic annotation is distributed over two different levels: the constituent structure level and the functional relations level. The fifth level deals with lexico-semantic annotation, which is carried out in terms of sense tagging of lexical heads (nouns, verbs and adjectives) augmented with other types of semantic information: ItalWordNet (see ELRA-M0018) is the reference lexical resource used for the sense tagging task.
    Licenza e download Copyright by ELRA
    Link http://catalog.elra.info/product_info.php?products_id=887
    Contatti http://www.elra.info/
  • itWAC
    Nome itWaC
    Authors the Wacky bunch
    Descrizione itWaC is a 2 billion word corpus constructed from the Web limiting the crawl to the .it domain and using medium-frequency words from the Repubblica corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the TreeTagger, and lemmatized using the Morph-it! lexicon.
    Licenza e download for free, by contacting the contact address
    Link http://wacky.sslmit.unibo.it/doku.php?id=corporab
    Contatti wacky[at]sslmit.unibo.it
  • la Repubblica
    Nome "la Repubblica" corpus
    Autore/i M. Baroni, S. Bernardini, F. Comastri, L. Piccioni, A. Volpi, G. Aston, M. Mazzoleni, E. Zanchetta, S. Castagnoli
    Descrizione The "la Repubblica" corpus is a very large corpus of Italian newspaper text (approximately 380M tokens). The corpus is tokenized, pos-tagged (with the Treetagger trained with ad-hoc resources), lemmatized (with Morph-it) and categorized in terms of genre and topic (with SVMLight trained with ad-hoc resources).
    Licenza e download Query service available to registered users
    Link http://dev.sslmit.unibo.it/corpora/corpus.php?path=&name=Repubblica
    Contatti marco.baroni[at]unitn.it
  • SWiiT
    Nome SWiiT: Semantic WIkipedia for ITalian
    Autore/i Silvana Marianela Bernaola Biggio, Roberto Zanoli, Manuela Speranza
    Descrizione SWiiT is the Italian Wikipedia annotated at five different levels: basic NLP processing (tokenization, sentence splitting and PoS-tagging), entity mentions (person, organization, location and geo-political entities), entity subtypes (not completed), entity co-reference (not completed), dependency parsing (not completed)
    Licenza e download
    Link http://textpro.fbk.eu/resources/SWiiT.html
    Contatti manspera[at]fbk.eu
  • TUT
    NomeTurin University Treebank (TUT)
    Autore/iLeonardo Lesmo, Cristina Bosco, Alessandro Mazzei, Vincenzo Lombardo, Livio Robaldo
    Descrizione TUT is a dependency-based treebank develoed in parallel with the parsing environment TULE (see tools). It features and annotation based on the major principles of Hudson's word grammar, but emploies also null elements for representing discontinuous and elliptical structures or pro-drops. It has been made available also in other constituency formats
    Licenza e downloadCreative Commons
    Link http://www.di.unito.it/~tutreeb
    Contattibosco[at]di.unito.it
  • CCG-TUT
    NomeCombinatory Categorial Grammar - Turin University Treebank (CCG-TUT)
    Autore/iJohan Bos, Cristina Bosco, Alessandro Mazzei
    Descrizione A process of conversion from TUT generates CCG-TUT, the Italian CCGbank, taking as input a set of sentences in the TUT format of dependencies. These are (1) mapped onto constituency trees (i.e. ConsTUT format), (2) which in turn undergo surgery to become binary trees, and (3) are then mapped into CCG derivations. ConsTUT is a TUT-oriented constituency-based annotation with TUT relations annotated on constituents. In ConsTUT trees each terminal category X corresponds to a node (i.e. word) of a TUT tree, and projects into non-terminal nodes which represent intermediate (Xbar) and maximal (XP) projections of X, according to Xbar theory (for more details and download of TUT in CONS-TUT format see TUT). The Italian CCGbank comes in three different formats: (1) derivations (Prolog terms), (2) derivations (pretty printed) and (3) tuples of words, POS and CCG category.
    Licenza e downloadCreative Commons
    Link http://www.di.unito.it/~tutreeb
    Contattibosco[at]di.unito.it
  • VIT
    Nome Venice Italian Treebank (VIT)
    Autore/i Rodolfo Delmonte
    Descrizione VIT is a treebank which includes 275,000 tokens composed by different representation layers: the first is of constituent tree strctures with tagged words; the second is derived from the first, and is composed by dependency structures organized in 8 columns which describe the morphological and semantic features, the lemma, the functional marker which differentiates topic/focus; other layers refer to the orthograohical version with multiwords and in canonical form. The full treebank is available against payment, while smaller versions for free will be soon available on the PARLI web site.
    Licenza e download only for visualization
    Link http://www.elda.org/catalogue/en/text/W0040.html
    Contatti delmont[at]unive.it
  • VIT-scritto
    Nome Venice Italian Treebank scritto (VIT-scritto)
    Autore/i Rodolfo Delmonte
    Descrizione VIT-scritto is a treebank composed by constituent structure trees with tagged words. The full version is called VIT (see the previous resource in this list). This treebank can be seen in the web site (see link) by using a viewer-parser developed in JAVA, which shows the content of each sentence putting the words in a list on the left and the structure on the right (the viewer does not work in Mozzilla Firefox).
    Licenza e download only for visualization
    Link http://project.cgm.unive.it/resource/VIT/Browser-VIT/indices/indexparsing_b.htm
    Contatti delmont[at]unive.it
  • Semantic VIT Fragment
    Nome Semantic VIT Fragment
    Autore/i Rodolfo Delmonte
    Descrizione The Semantic VIT Fragment is a portion of the VIT treebank (see above), which includes 511 sentences. With respect to the original resource VIT, in the Fragment the annotation has been improved with the introduction of null elements. This smaller version of the resource is available fro free.
    Licenza e download free
    Link http://project.cgm.unive.it/?page_id=200
    Contatti delmont[at]unive.it
  • Wikisents for FrameNet
    Nome Wikisents for FrameNet
    Autore/i Claudio Giuliano, Alfio Massimiliano Gliozzo, Carlo Strapparava, Sara Tonelli
    Descrizione A large set of sentences extracted from Wikipedia, to which a frame label has been assigned using a Word Sense Disambiguation approach. Such sentences can be used either to extend the amount of sentences already annotated for each frame through a manual validation, or exploited as training data for frame identification.
    Licenza e download free download
    Link http://hlt.fbk.eu/en/Technology/Wikisents_for_FrameNet
    Contatti A
Corpora cross-linguistici
(paralleli o comparabili)
  • MultiSemCor
    Nome MultiSemCor
    Autore/i Emanuele Pianta, Luisa Bentivogli, Pamela Forner, Christian Girardi, Marcello Ranieri
    Descrizione MultiSemCor is an English/Italian parallel corpus, aligned at the word level and annotated with PoS, lemma and word sense. The parallel corpus is created by exploiting the SemCor corpus, which is a subset of the English Brown corpus containing about 700,000 running words. In SemCor all the words are tagged by PoS, and more than 200,000 content words are also lemmatized and sense-tagged with reference to the Princeton WordNet lexical database.
    Licenza e download License obtained by registration
    Link http://multisemcor.fbk.eu/index.php
    Contatti manspera[at]fbk.eu
Altri




TOP


Ultimo aggiornamento 20 Gennaio 2012, Contatti: bosco[at]di.unito.it