Semantic property space

This repository contains preliminary code to investigate to what extent word embedding vectors contain information about semantic properties. The fundamental assumption is that word embedding vector dimensions can pick up information about some semantic properties from co-occurrance patterns in natural language. We test this by means of a binary supervised classification task in which embedding vectors are used as infput vectors for a supervised classifier which predicts whether a word has a specific semantic property or not. For training and testing, we make use of the CSLB semantic property norm dataset collected by Devereux et al. (2014) and our own extensions to it by means of crowdsourcing. We compare the results to performance achieved by employing a simple full-vector-cosine-siminarlty based nearest-neighbor approach. The full experimental set-up and datasets extensions are described in the following paper:

[insert reference]

If you make use of the data and/or annotations described in this paper, please also refer to the creators of the CSLB dataset:

Devereux, B.J., Tyler, L.K., Geertzen, J., Randall, B. (2014). The Centre for Speech, Language and the Brain (CSLB) Concept Property Norms. Behavior Research Methods, 46(4), pp 1119-1127. DOI: 10.3758/s13428-013-0420-4.

If you have questions, please contact Pia Sommerauer (pia.sommerauer@live.com) or Antske Fokkens (antske.fokkens@vu.nl). The documentation is still in progress.

Experiments:

Data

We used the semantic property norms collected by Devereux et al. (2014) and extended the dataset via a crowdsourcing task. The original data can be dowlowaded at https://cslb.psychol.cam.ac.uk/propnorms. Our extension of the dataset (including intermediate steps and decisions) will be made available at (insert link).

Instructions:

Hypotheses

Hypotheses about specific semantic properties formulated by the authors (independently from each other in the first stage and combined in a second stage) can be found in hypotheses/.

Results

Predictions are written to results/[model_name]/[experiment_name]/[parameters]/[feature].txt

e.g. results/word2vec_google_news/nearest_neighbors/100/fruit_test.txt e.g. results/word2vec_google_news/logistic_regression/default/fruit_test.txt e.g. results/word2vec_google_news/neural_net/default/fruit_test.txt

Evaluation

Supported models

Instructions for feature implication annotation:

Files: cslb_data/extracted_cslb/feature_implications_annotated_antkse.csv, cslb_data/extracted_cslb/feature_implications_annotated_pia.csv

Anntoate:

Direction for implications: The feature in the column implies the feature in the row. We want to increase the data sets of the row-features by adding concepts associated with the column features. We want to compile better sets of negative examples by using concepts associated with features of which we know that they cannot apply to the positive examples of the target feature.

Ask yourself: Is something that is [column] also [row]? Answer with: always, sometimes, never, does not apply.

Our target features for more experiments are listed in the rows.

Crowdsourcing set-up

(to be filled in)