Lexica and thesauri
(including Dictionaries, Wiktionaries, and WordNets)
|
|
-
BabelNet
Nome |
BabelNet |
Autori/e |
Roberto Navigli, Simone Paolo Ponzetto |
Descrizione |
BabelNet is a very large multilingual semantic network with millions of concepts obtained from
an integration of WordNet and Wikipedia based on an automatic mapping algorithm, and from the
translations of the concepts (i.e. English Wikipedia pages and WordNet synsets) based on Wikipedia
cross-language links and the output of a machine translation system.
BabelNetXplorer is the online interface available on the BabelNet website for accessing the resource.
|
Licenza e download |
free |
Link |
http://babelnet.org/ |
Contatti |
navigli[at]di.uniroma1.it |
-
ItalWordNet
Name |
ItalWordNet (Italian WordNet) |
Author(s) |
|
Description |
ItalWordNet is an updated version of the EuroWordNet Italian
database. The ItalWordNet database was produced within a national Italian programme
called SI-TAL. It contains a total of 49,360 synsets. Unlike the EuroWordNet database,
the ItalWordNet is provided in XML format. However, it remains partially compatible with
the EuroWordNet, as both models are similar, except for some new relations created for
adjectives and which were added to ItalWordNet.
|
Licence and download |
Copyright by ELRA |
Link |
http://catalog.elra.info/product_info.php?products_id=1110 |
Contact |
http://www.elra.info/ |
-
Morph-it!
Name |
Morph-it! |
Author(s) |
Marco Baroni, Eros Zanchetta |
Description |
Morph-it! is a free morphological resource for the Italian language, a lexicon of
inflected forms with their lemma and morphological features. The lexicon currently contains
505,074 entries and 35,056 lemmas. Morph-it! can be used as a data source for a
lemmatizer/morphological analyzer/morphological generator.
|
Licence and download |
Creative Commons and CC-GNU LGPL |
Link |
http://dev.sslmit.unibo.it/corpora/corpus.php?path=&name=Repubblica |
Contact |
marco.baroni[at]unitn.it |
-
MultiWordNet
Name |
MultiWordNet |
Author(s) |
Emanuele Pianta, Luisa Bentivogli, Pamela Forner, Christian Girardi, Marcello Ranieri,
Manuela Speranza, Massimiliano Bampi, Gabriela Cavaglià, Francesca Filiaci, Bernardo Magnini,
Carlo Strapparava, Lisa Zenoniani
|
Description |
MultiWordNet is a multilingual lexical database in which the Italian WordNet is
strictly aligned with Princeton WordNet 1.6. The Italian synsets are created in correspondence
with the Princeton WordNet synsets, whenever possible, and semantic relations are imported
from the corresponding English synsets.
|
Licence and download |
Free license for academic, commercial for others (asking to contact) |
Link |
http://multiwordnet.fbk.eu/english/home.php |
Contact |
manspera[at]fbk.eu |
-
PAROLE
Name |
PAROLE-SIMPLE-CLIPS PISA Italian Lexicon |
Author(s) |
|
Description |
PAROLE-SIMPLE-CLIPS is a four-level, general purpose lexicon that has been elaborated
over three different projects. The kernel of the morphological and syntactic lexicons was built
in the framework of the LE-PAROLE project. The linguistic model and the core of the semantic
lexicon were elaborated in the LE-SIMPLE project, while the phonological level of description
and the extension of the lexical coverage were performed in the context of the Italian project
Corpora e Lessici dell'Italiano Parlato e Scritto (CLIPS).
|
Licence and download |
Copyright by ELRA |
Link |
http://catalog.elra.info/product_info.php?products_id=881? |
Contact |
http://www.elra.info/ |
-
Wordnet Domains
Name |
Wordnet Domains |
Author(s) |
Bernardo Magnini, Emanuele Pianta, Luisa Bentivogli, Christian Girardi, Manuela Speranza,
Gabriela Cavaglià, Pamela Forner, Giovanni Pezzulo, Lisa Zenoniani
|
Description |
WordNet Domains is a lexical resource created in a semi-automatic way by augmenting
WordNet with domain labels. WordNet Synsets have been annotated with at least one semantic
domain label, selected from a set of about two hundred labels structured according the
WordNet Domain Hierarchy.
|
Licence and download |
Free license by registration |
Link |
http://wndomains.fbk.eu/ |
Contact |
manspera[at]fbk.eu |
|
Terminologies
|
|
|
Ontologies
|
|
-
ItalWordNet
Name |
ItalWordNet (Italian WordNet) |
Author(s) |
|
Description |
ItalWordNet is an updated version of the EuroWordNet Italian
database. The ItalWordNet database was produced within a national Italian programme
called SI-TAL. It contains a total of 49,360 synsets. Unlike the EuroWordNet database,
the ItalWordNet is provided in XML format. However, it remains partially compatible with
the EuroWordNet, as both models are similar, except for some new relations created for
adjectives and which were added to ItalWordNet.
|
Licence and download |
Copyright by ELRA |
Link |
http://catalog.elra.info/product_info.php?products_id=1110 |
Contact |
http://www.elra.info/ |
-
SentiWordNet
Name |
SentiWordNet |
Author/s |
Stefano Baccianella, Andrea Esuli, Fabrizio Sebastiani
|
Description |
SentiWordNet is a lexical resource in which each WordNet synset is associated to three
numerical scores Obj(s), Pos(s) and Neg(s), describing how objective, positive, and negative
the terms contained in the synset are. A typical use of SentiWordNet is to enrich the text
representation in opinion mining applications, adding information on the sentiment-related
properties of the terms in text. It is endowed with a Web-based graphical user interface.
|
Licence and download |
Free license by registration, for non-profit purposes only |
Link |
http://swn.isti.cnr.it/ |
Contact |
bentivo[at]fbk.eu |
|
Spoken corpora
|
|
-
QALL-ME_benchmark
Name |
QALL-ME benchmark |
Author(s) |
Cabrio E., Kouylekov M., Magnini B., Negri M., Hasler L., Orasan C., Tomas D.,
Vicedo J. L., Neumann G., Weber C.
|
Description |
The QALL-ME benchmark is a collection of several thousand spoken utterances
related to the domain of tourism, both audio files and their corresponding transcriptions,
in the four languages involved in the project: English, German, Italian and Spanish.
These utterances ask for information about cultural events, accommodation, movies,
gastro, etc. and have been transcribed according to guidelines set out by the QALL-ME
consortium.
|
Licence and download |
Creative Commons Licence 3.0 |
Link |
http://qallme.fbk.eu/index.php?location=benchmark |
Contact |
|
-
VIT-parlato
Name |
Venice Italian Treebank parlato (VIT-parlato) |
Author(s) |
Rodolfo Delmonte
|
Description |
VIT-parlato is a treebank of constituent tree structures with tagged words.
The full version of the resurce is made available for free for all the interested user.
The content of the treebank can be seen in the web site (see link) by using a
viewer-parser developed in JAVA
This treebank can be seen in the web site (see link) by using a viewer-parser developed
in JAVA, which shows the content of each sentence putting the words in a list on the left and
the structure on the right (the viewer does not work in Mozzilla Firefox).
|
Licence and download |
only for visualization |
Link |
http://project.cgm.unive.it/resource/VIT/Browser-VIT/indices/indexparsing_it2.htm |
Contact |
delmont[at]unive.it |
|
Written corpora
|
|
-
CORIS/CODIS
Name |
CORIS/CODIS |
Author(s) |
Rema Rossini Favretti, Nicola Grandi, Malvina Nissim, Fabio Tamburini, Andrea Zaninello
|
Description |
The corpus of written Italian CORIS/CODIS started with the purpose of creating a
representative and sizeable general reference corpus of written Italian which would be
easily accessible and user-friendly. CORIS contains 120 million words and has been
updated every three years by means of a built-in monitor corpus. It consists of a collection
of authentic and commonly occurring texts in electronic format chosen by virtue of their
representativeness of modern Italian.
|
Licence and download |
free by signing license agreement
|
Link |
http://dslo.unibo.it/coris_eng.html |
Contact |
rema.rossini[at]unibo.it
|
-
CRIPCO
Name |
CRIPCO (Cross-document Italian People Coreference Corpus) |
Author(s) |
Luisa Bentivogli, Christian Girardi, Emanuele Pianta |
Description |
The CRIPCO corpus is a subset of the news stories published by the local
newspaper "L'Adige" from 1999 to 2006 annotated with information about person
cross-document coreference.
|
Licence and download |
freely available for research purpose by subscription
|
Link |
http://hlt.fbk.eu/en/CRIPCO/ |
Contact |
manspera[at]fbk.eu |
-
I-CAB
Name |
I-CAB (Italian Content Annotation Bank) |
Author(s) |
B. Magnini, A. Cappelli, E. Pianta, M. Speranza, V. Bartalesi Lenzi, R. Sprugnoli,
L. Romano, C. Girardi, M. Negri
|
Description |
I-CAB is a corpus consisting of news stories taken from the local
newspaper "L'Adige", where the semantic information is annotated at different levels:
temporal expressions, entities (i.e. persons, organizations, locations, and geo-political entities)
and relations between entities (e.g. the affiliation relation connecting a person to an organization)
|
Licence and download |
freely available for research purposes upon acceptance of a license agreement
|
Link |
http://ontotext.fbk.eu/icab.html |
Contact |
manspera[at]fbk.eu |
-
ISST
Name |
ISST (Italian Syntactic-Semantic Treebank) |
Author(s) |
S. Montemagni, F. Barsotti, M. Battista, N. Calzolari, O. Corazzari, A. Lenci, A. Zampolli,
F. Fanciulli, M. Massetani, R. Raffaelli, R. Basili, M. T. Pazienza, D. Saracino, F. Zanzotto, N. Mana,
F. Pianesi, R. Delmonte
|
Description |
ISST has a five-level structure covering orthographic, morpho-syntactic, syntactic and
semantic levels of linguistic description. Syntactic annotation is distributed over two different
levels: the constituent structure level and the functional relations level. The fifth level deals
with lexico-semantic annotation, which is carried out in terms of sense tagging of lexical heads
(nouns, verbs and adjectives) augmented with other types of semantic information: ItalWordNet
(see ELRA-M0018) is the reference lexical resource used for the sense tagging task.
|
Licence and download |
Copyright by ELRA
|
Link |
http://catalog.elra.info/product_info.php?products_id=887
|
Contact |
http://www.elra.info/ |
-
itWAC
Name |
itWaC |
Author(s) |
the Wacky bunch |
Description |
itWaC is a 2 billion word corpus constructed from the Web limiting
the crawl to the .it domain and using medium-frequency words from the Repubblica corpus and
basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the TreeTagger, and
lemmatized using the Morph-it! lexicon.
|
Licence and download |
for free, by contacting the contact address |
Link |
http://wacky.sslmit.unibo.it/doku.php?id=corporab |
Contact |
wacky[at]sslmit.unibo.it |
-
la Repubblica
Name |
"la Repubblica" corpus |
Author(s) |
M. Baroni, S. Bernardini, F. Comastri, L. Piccioni, A. Volpi, G. Aston,
M. Mazzoleni, E. Zanchetta, S. Castagnoli |
Description |
The "la Repubblica" corpus is a very large corpus of Italian newspaper text (approximately
380M tokens). The corpus is tokenized, pos-tagged (with the Treetagger trained with ad-hoc
resources), lemmatized (with Morph-it) and categorized in terms of genre and topic (with
SVMLight trained with ad-hoc resources). |
Licence and download |
Query service available to registered users |
Link |
http://dev.sslmit.unibo.it/corpora/corpus.php?path=&name=Repubblica |
Contact |
marco.baroni[at]unitn.it |
-
SWiiT
Name |
SWiiT: Semantic WIkipedia for ITalian |
Author(s) |
Silvana Marianela Bernaola Biggio, Roberto Zanoli, Manuela Speranza
|
Description |
SWiiT is the Italian Wikipedia annotated at five different levels: basic NLP processing
(tokenization, sentence splitting and PoS-tagging), entity mentions (person, organization,
location and geo-political entities), entity subtypes (not completed), entity co-reference
(not completed), dependency parsing (not completed)
|
Licence and download |
|
Link |
http://textpro.fbk.eu/resources/SWiiT.html |
Contact |
manspera[at]fbk.eu |
-
TUT
Name | Turin University Treebank (TUT) |
Author(s) | Leonardo Lesmo, Cristina Bosco, Alessandro Mazzei,
Vincenzo Lombardo, Livio Robaldo |
Description | TUT is a dependency-based treebank develoed in parallel with
the parsing environment TULE (see tools). It features and annotation
based on the major principles of Hudson's word grammar, but emploies also null elements for
representing discontinuous and elliptical structures or pro-drops. It has been made available
also in other constituency formats |
Licence and download | Creative Commons |
Link |
http://www.di.unito.it/~tutreeb |
Contact | bosco[at]di.unito.it |
-
CCG-TUT
Name | Combinatory Categorial Grammar - Turin University Treebank (CCG-TUT) |
Author(s) | Johan Bos, Cristina Bosco, Alessandro Mazzei |
Description |
A process of conversion from TUT generates CCG-TUT, the Italian CCGbank, taking as input a set of sentences in
the TUT format of dependencies. These are (1) mapped onto constituency trees (i.e. ConsTUT format),
(2) which in turn undergo surgery to become binary trees, and (3) are then mapped into CCG
derivations. ConsTUT is a TUT-oriented constituency-based annotation with TUT relations annotated
on constituents. In ConsTUT trees each terminal category X corresponds to a node (i.e. word) of a
TUT tree, and projects into non-terminal nodes which represent intermediate (Xbar) and maximal
(XP) projections of X, according to Xbar theory (for more details and download of TUT in CONS-TUT
format see TUT). The Italian CCGbank comes in three different formats: (1) derivations (Prolog terms),
(2) derivations (pretty printed) and (3) tuples of words, POS and CCG category. |
Licence and download | Creative Commons |
Link |
http://www.di.unito.it/~tutreeb/CCG-TUT/ |
Contact | bosco[at]di.unito.it |
-
VIT
Name |
Venice Italian Treebank (VIT) |
Author(s) |
Rodolfo Delmonte
|
Description |
VIT is a treebank which includes 275,000 tokens composed by different
representation layers: the first is of constituent tree strctures with tagged words;
the second is derived from the first, and is composed by dependency structures organized
in 8 columns which describe the morphological and semantic features, the lemma,
the functional marker which differentiates topic/focus; other layers refer to the
orthograohical version with multiwords and in canonical form.
The full treebank is available against payment, while smaller versions for free will be
soon available on the PARLI web site.
|
Licence and download |
only for visualization |
Link |
http://www.elda.org/catalogue/en/text/W0040.html |
Contact |
delmont[at]unive.it |
-
VIT-scritto
Name |
Venice Italian Treebank scritto (VIT-scritto) |
Author(s) |
Rodolfo Delmonte
|
Description |
VIT-scritto is a treebank composed by constituent structure trees with tagged words.
The full version is called VIT (see the previous resource in this list).
This treebank can be seen in the web site (see link) by using a viewer-parser developed
in JAVA, which shows the content of each sentence putting the words in a list on the left and
the structure on the right (the viewer does not work in Mozzilla Firefox).
|
Licence and download |
only for visualization |
Link |
http://project.cgm.unive.it/resource/VIT/Browser-VIT/indices/indexparsing_b.htm |
Contact |
delmont[at]unive.it |
-
Semantic VIT Fragment
Nome |
Semantic VIT Fragment |
Autore/i |
Rodolfo Delmonte
|
Descrizione |
The Semantic VIT Fragment is a portion of the VIT treebank (see above), which
includes 511 sentences. With respect to the original resource VIT, in the Fragment the
annotation has been improved with the introduction of null elements.
This smaller version of the resource is available fro free.
|
Licenza e download |
free |
Link |
http://project.cgm.unive.it/?page_id=200 |
Contatti |
delmont[at]unive.it |
-
Wikisents for FrameNet
Name |
Wikisents for FrameNet |
Author(s) |
Claudio Giuliano, Alfio Massimiliano Gliozzo, Carlo Strapparava, Sara Tonelli
|
Description |
A large set of sentences extracted from Wikipedia, to which a frame label has
been assigned using a Word Sense Disambiguation approach. Such sentences can
be used either to extend the amount of sentences already annotated for each frame
through a manual validation, or exploited as training data for frame identification.
|
Licence and download |
free download |
Link |
http://hlt.fbk.eu/en/Technology/Wikisents_for_FrameNet |
Contact |
|
|
Cross-linguistic corpora
(parallel or comparable)
|
|
-
MultiSemCor
Name |
MultiSemCor |
Author(s) |
Emanuele Pianta, Luisa Bentivogli, Pamela Forner, Christian Girardi, Marcello Ranieri
|
Description |
MultiSemCor is an English/Italian parallel corpus, aligned at the word level
and annotated with PoS, lemma and word sense. The parallel corpus is created
by exploiting the SemCor corpus, which is a subset of the English Brown corpus
containing about 700,000 running words. In SemCor all the words are tagged by
PoS, and more than 200,000 content words are also lemmatized and sense-tagged
with reference to the Princeton WordNet lexical database.
|
Licence and download |
License obtained by registration |
Link |
http://multisemcor.fbk.eu/index.php |
Contact |
manspera[at]fbk.eu |
|
Others
|
|
|
|