PerSCiDO facilitates the exploration of research datasets.
Share your research datasets using PerSCiDO!
There are 3 directories, and each directory contains 5 files.
The 3 directories are:
- The directory "words", which contains the original word embeddings models that were used for creating the sense embeddings models.
- The directory "senses", which contains the produced sense embeddings models.
- The directory "combined", which contains embeddings models that contain both the words and the senses.
Each file indicate its origins in its name:
- The files prefixed with "baroni_c" are the context-counting vectors originating from Baroni et al. work called "Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors" (ACL 2014) (http://clic.cimec.unitn.it/composes/semantic-vectors.html).
- The files prefixed with "baroni_p" are the context-predicting vectors originating from the same work.
- The files prefixed with "deps" originate from Levy and Goldberg work called "Dependency-Based Word Embeddings" (ACL 2014) (https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/).
- The files prefixed with "glove" originate from Pennington et al. work called "GloVe: Global Vectors for Word Representation" (EMNLP 2014) (https://nlp.stanford.edu/projects/glove/)
- The files prefixed with "word2vec" originate from Mikolov et al. work called "Distributed Representations of Words and Phrases and their Compositionality" (NIPS 2013) (https://code.google.com/archive/p/word2vec/)
The sense embeddings models contain all 206,941 senses from WordNet 3.0, accessible through their sense key (e.g. "sense%1:10:00::").
The format of the models is the same binary format as the typical word embeddings models. The tools and scripts that come from the word2vec toolkit (https://github.com/dav/word2vec) can parse them, for instance.
More details on the format, it consists of:
1) A string representation of an integer denoting the total number of vectors in the model, followed by a space character (hexadecimal value 0x20).
2) A string representation of an integer denoting the number of dimension of each vector, followed by a newline character (hexadecimal value 0x0A).
3) For each vector:
3a) A string denoting the word or sense, followed by a space character (hexadecimal value 0x20).
3b) A 32-bits float representation of the vector.
- baroni_c_combined.bin 997 124 596 ko
- baroni_p_combined.bin 799 099 604 ko
- deps_combined.bin 452 379 830 ko
- word2vec_combined.bin 3 805 853 896 ko
- glove_combined.bin 2 511 255 694 ko
- glove_senses.bin 247 007 678 ko
- baroni_c_senses.bin 408 686 584 ko
- baroni_p_senses.bin 327 847 131 ko
- deps_senses.bin 247 007 678 ko
- word2vec_senses.bin 247 007 678 ko
- deps_words.bin 205 372 163 ko
- word2vec_words.bin 3 558 846 213 ko
- baroni_c_words.bin 588 438 022 ko
- baroni_p_words.bin 471 252 484 ko
- glove_words.bin 2 264 248 027 ko
Subjects:Computer Science, Linguistics, Mathematics
Keywords:sense embeddings, sense vectors, word sense disambiguation
Corresponding tasks:word sense disambiguation
Encoding data format:word2vec binary format
Vial L., Lecouteux B., Schwab D. (2017). Sense Embedding Models, companion datasets for the IWCS 2017 publication "Sense Embeddings in Knowledge-Based Word Sense Disambiguation". [dataset], doi:10.18709/PERSCIDO.2017.10.DS117. Published 2017 via Perscido-Grenoble-Alpes;