Semantic property space

This repository contains preliminary code to investigate to what extent word embedding vectors contain information about semantic properties. The fundamental assumption is that word embedding vector dimensions can pick up information about some semantic properties from co-occurrance patterns in natural language. We test this by means of a binary supervised classification task in which embedding vectors are used as infput vectors for a supervised classifier which predicts whether a word has a specific semantic property or not. For training and testing, we make use of the CSLB semantic property norm dataset collected by Devereux et al. (2014) and our own extensions to it by means of crowdsourcing. We compare the results to performance achieved by employing a simple full-vector-cosine-siminarlty based nearest-neighbor approach. The full experimental set-up and datasets extensions are described in the following paper:

[insert reference]

If you make use of the data and/or annotations described in this paper, please also refer to the creators of the CSLB dataset:

Devereux, B.J., Tyler, L.K., Geertzen, J., Randall, B. (2014). The Centre for Speech, Language and the Brain (CSLB) Concept Property Norms. Behavior Research Methods, 46(4), pp 1119-1127. DOI: 10.3758/s13428-013-0420-4.

If you have questions, please contact Pia Sommerauer (pia.sommerauer@live.com) or Antske Fokkens (antske.fokkens@vu.nl). The documentation is still in progress.

Experiments:

learn features with logistic regression: logistic_regression.py
learn features with a neural network: neural_net.py
predict features via the nearest neighbor of the centroid of its positive examples: nearest_neighbors.py

Running experiments

(See example script: example_experiments.sh)

python logistic_regression.py [path_to_model] [model_name] [model_type] [feature]

python neural_net.py [path_to_model] [model_name] [model_type] [feature]

python nearest_neighbors.py [path_to_model] [model_name] [model_type] [neighbors_n_begin] [neighbors_n_end] [neighbors_n_step] [feature]

Data

We used the semantic property norms collected by Devereux et al. (2014) and extended the dataset via a crowdsourcing task. The original data can be dowlowaded at https://cslb.psychol.cam.ac.uk/propnorms. Our extension of the dataset (including intermediate steps and decisions) will be made available at (insert link).

Instructions:

store positive and negative examples in data/
naming convention:
- [feature]-pos.txt; e.g. fruit_test-pos.txt
- [feature]-neg-all.txt; e.g. fruit_test.neg-all.txt
each line should contain a single word

Hypotheses

Hypotheses about specific semantic properties formulated by the authors (independently from each other in the first stage and combined in a second stage) can be found in hypotheses/.

Results

Predictions are written to results/[model_name]/[experiment_name]/[parameters]/[feature].txt

e.g. results/word2vec_google_news/nearest_neighbors/100/fruit_test.txt e.g. results/word2vec_google_news/logistic_regression/default/fruit_test.txt e.g. results/word2vec_google_news/neural_net/default/fruit_test.txt

Evaluation

Evaluation is written to evation/.
Precision, recall, f1 per feature excluding out-of-vocabulary words
Train/test folds: Leave-one-out cross-validation
Results of the evaluation are written to evaluation/[feature].txt

Supported models

Word2vec skip-gram in .bin format (model_type: ‘w2v’)
Hyperwords models (Levy & Goldstein 2015):
- Word2vec skip-gram (model_type: ‘sgns’)
- PPMI (model_type: ‘ppmi’)
- PPMI reduced with SVD (model_type: ‘svd’)

Instructions for feature implication annotation:

Files: cslb_data/extracted_cslb/feature_implications_annotated_antkse.csv, cslb_data/extracted_cslb/feature_implications_annotated_pia.csv

Anntoate:

features which always correlate: 1 (e.g. is_a_mammal - is_an_animal)
features which can, but do not necessarily correlate: m (for maybe)
features which are mutually exclusive: 0
not applicable: na

Direction for implications: The feature in the column implies the feature in the row. We want to increase the data sets of the row-features by adding concepts associated with the column features. We want to compile better sets of negative examples by using concepts associated with features of which we know that they cannot apply to the positive examples of the target feature.

Ask yourself: Is something that is [column] also [row]? Answer with: always, sometimes, never, does not apply.

Our target features for more experiments are listed in the rows.

Crowdsourcing set-up

(to be filled in)

semantic_space_navigation