PerSCiDO facilitates the exploration of research datasets.
Share your research datasets using PerSCiDO!
Numbers
Datasets: 31
Downloaded: 538
Sense Embeddings Models
- Contributor Loïc Vial
- 0000-0001-6572-5887 loic-vial
- Institution/Laboratory: Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG
This dataset contains the models of sense embeddings, or sense vectors, produced for the article called "Sense Embeddings in Knowledge-Based Word Sense Disambiguation" by Loïc Vial, Benjamin Lecouteux and Didier Schwab, in proceedings of the 12th International Conference on Computational Semantics (IWCS 2017).
Read me file
readme.txt
Read me file
This dataset contains the models of sense embeddings, or sense vectors, produced for the article called "Sense Embeddings in Knowledge-Based Word Sense Disambiguation" by Loïc Vial, Benjamin Lecouteux and Didier Schwab, in proceedings of the 12th International Conference on Computational Semantics (IWCS 2017).
There are 3 directories, and each directory contains 5 files.
The 3 directories are:
- The directory "words", which contains the original word embeddings models that were used for creating the sense embeddings models.
- The directory "senses", which contains the produced sense embeddings models.
- The directory "combined", which contains embeddings models that contain both the words and the senses.
Each file indicate its origins in its name:
- The files prefixed with "baroni_c" are the context-counting vectors originating from Baroni et al. work called "Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors" (ACL 2014) (http://clic.cimec.unitn.it/composes/semantic-vectors.html).
- The files prefixed with "baroni_p" are the context-predicting vectors originating from the same work.
- The files prefixed with "deps" originate from Levy and Goldberg work called "Dependency-Based Word Embeddings" (ACL 2014) (https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/).
- The files prefixed with "glove" originate from Pennington et al. work called "GloVe: Global Vectors for Word Representation" (EMNLP 2014) (https://nlp.stanford.edu/projects/glove/)
- The files prefixed with "word2vec" originate from Mikolov et al. work called "Distributed Representations of Words and Phrases and their Compositionality" (NIPS 2013) (https://code.google.com/archive/p/word2vec/)
The sense embeddings models contain all 206,941 senses from WordNet 3.0, accessible through their sense key (e.g. "sense%1:10:00::").
The format of the models is the same binary format as the typical word embeddings models. The tools and scripts that come from the word2vec toolkit (https://github.com/dav/word2vec) can parse them, for instance.
More details on the format, it consists of:
1) A string representation of an integer denoting the total number of vectors in the model, followed by a space character (hexadecimal value 0x20).
2) A string representation of an integer denoting the number of dimension of each vector, followed by a newline character (hexadecimal value 0x0A).
3) For each vector:
3a) A string denoting the word or sense, followed by a space character (hexadecimal value 0x20).
3b) A 32-bits float representation of the vector.
There are 3 directories, and each directory contains 5 files.
The 3 directories are:
- The directory "words", which contains the original word embeddings models that were used for creating the sense embeddings models.
- The directory "senses", which contains the produced sense embeddings models.
- The directory "combined", which contains embeddings models that contain both the words and the senses.
Each file indicate its origins in its name:
- The files prefixed with "baroni_c" are the context-counting vectors originating from Baroni et al. work called "Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors" (ACL 2014) (http://clic.cimec.unitn.it/composes/semantic-vectors.html).
- The files prefixed with "baroni_p" are the context-predicting vectors originating from the same work.
- The files prefixed with "deps" originate from Levy and Goldberg work called "Dependency-Based Word Embeddings" (ACL 2014) (https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/).
- The files prefixed with "glove" originate from Pennington et al. work called "GloVe: Global Vectors for Word Representation" (EMNLP 2014) (https://nlp.stanford.edu/projects/glove/)
- The files prefixed with "word2vec" originate from Mikolov et al. work called "Distributed Representations of Words and Phrases and their Compositionality" (NIPS 2013) (https://code.google.com/archive/p/word2vec/)
The sense embeddings models contain all 206,941 senses from WordNet 3.0, accessible through their sense key (e.g. "sense%1:10:00::").
The format of the models is the same binary format as the typical word embeddings models. The tools and scripts that come from the word2vec toolkit (https://github.com/dav/word2vec) can parse them, for instance.
More details on the format, it consists of:
1) A string representation of an integer denoting the total number of vectors in the model, followed by a space character (hexadecimal value 0x20).
2) A string representation of an integer denoting the number of dimension of each vector, followed by a newline character (hexadecimal value 0x0A).
3) For each vector:
3a) A string denoting the word or sense, followed by a space character (hexadecimal value 0x20).
3b) A 32-bits float representation of the vector.
2017 09 27
The size of this dataset is more than 4000 Mb
Archive files
- baroni_c_combined.bin 997 124 596 ko
- baroni_p_combined.bin 799 099 604 ko
- deps_combined.bin 452 379 830 ko
- word2vec_combined.bin 3 805 853 896 ko
- glove_combined.bin 2 511 255 694 ko
- glove_senses.bin 247 007 678 ko
- baroni_c_senses.bin 408 686 584 ko
- baroni_p_senses.bin 327 847 131 ko
- deps_senses.bin 247 007 678 ko
- word2vec_senses.bin 247 007 678 ko
- deps_words.bin 205 372 163 ko
- word2vec_words.bin 3 558 846 213 ko
- baroni_c_words.bin 588 438 022 ko
- baroni_p_words.bin 471 252 484 ko
- glove_words.bin 2 264 248 027 ko
Related publications
Other metadata
-
External Identifiers:
-
Subjects:
Computer Science, Linguistics, Mathematics -
Keywords:
sense embeddings, sense vectors, word sense disambiguation -
Corresponding tasks:
word sense disambiguation -
Encoding data format:
word2vec binary format
Vial L., Lecouteux B., Schwab D. (2017). Sense Embedding Models, companion datasets for the IWCS 2017 publication "Sense Embeddings in Knowledge-Based Word Sense Disambiguation". [dataset], doi:10.18709/PERSCIDO.2017.10.DS117. Published 2017 via Perscido-Grenoble-Alpes;