More is not always better: balancing sense distributions for allwords Word Sense Disambiguation
This package contains all the data and scripts used to run the experiments presented in:
@InProceedings{C16-1330,
author = "Postma, Marten
and Izquierdo Bevia, Ruben
and Vossen, Piek",
title = "More is not always better: balancing sense distributions for all-words Word Sense Disambiguation",
booktitle = "Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers",
year = "2016",
publisher = "The COLING 2016 Organizing Committee",
pages = "3496--3506",
location = "Osaka, Japan",
url = "http://aclweb.org/anthology/C16-1330"
}
It is designed in a way that allows automatic and full reproducibility of the results. There are only 3 steps required:
- Download this repository to your computer (Linux/OsX)
- Run the installation script
- Run the script for reproducing all the experiments
Table of Content
- 1) Content of this package
- 1.1) Scripts
- 1.2) Data
- 2) Requirements
- 3) Installation
- 4) Identifiers used for the experiments
- 5) Reproducing all the experiments
- 6) Changes to the IMS system
1) Content of this package
This section explains the meaning of each folder or script contained in this package.
1.1) Scripts
- install.sh
- installation script that will download and install all the required material:
- IMS WSD system http://www.comp.nus.edu.sg/~nlp/software.html
- WordNet 3.0 https://wordnet.princeton.edu/wordnet/download
- SemEval2013 task 12 test dataset https://www.cs.york.ac.uk/semeval2013/task12
- run_all_experiments.sh
- Automatically runs and reproduces all the experiments contained in this paper
- train_ims_paralel.sh
- trains the IMS models for a specific experiment, allowing to split the task into a certain number of subprocesses for efficiency
- train_ims.sh
-
trains the IMS models for a set of lemmas
- evaluate_semeval2013.sh
-
runs the classification for SemEval2013 using a set of trained models and runs the evaluation using the official scorer
- evaluate_mfs_lfs.py
-
From an instance output file, and by calling to the official scorer, computes the figures for the test instances where Most Frequent Sense applies and for the rest
- semeval2013_to_allwords_format.py
- Converts the XML format of the original SemEval2013 dataset to the allwords format expected by the IMS system
1.2) Data
- evaluation
- Folder with the output of the evaluation. For each experiment there are two files (EXP_ID corresponds to the experiment ID, which are explained later in this document):
- EXP_ID.out → contains the instance output for the experiment (format used by the official SemEval scorer)
- EXO_ID.figures.txt → contains the evaluation figures obtained by the scripts evaluate_semeval2013.sh and evaluate_mfs_lfs.py
-
experiments_data.tgz * There is a folder for each experiment, with the name of the experiment id, and contains all the training data for the experiment. The format is the same used by the IMS system, where basically there are two files for each lemma.pos, one file contains all the instances with the corresponding contexts, and the other contains the sense keys for the instances
- models.tgz
- Contains the IMS trained models for all the experiments of the paper
- ims_amended
- Amended IMS files (see section below)
- prop_wn30.original.xml
-
Original property file for IMS. It will be automatically modified to point to the location of WordNet3.0, once downloaded.
- sem2013.lemma_pos.list
- List of lemmas (and pos) for the semeval2013 dataset. It is used to split the training in several batches, as the training for each word expert is independent of the rest
2) Requirements
These are the requirements before running the installation script:
- Python3 → the main scripts developed by us are written in Python (version 3 required)
- NLTK 3.0 library for python: you can install it very easily: http://www.nltk.org/install.html
- Java8 → This is a requiremenf of the IMS system that we use in our experiment
3) Installation
For installing all the required libraries and resources just run:
$ . install.sh
The script will:
- Unzip (you need tgz) the data and models folders
- Download IMS, copy the amended files in the correct place and recompile IMS to generate an updated jar file.
- Download WordNet 3.0 and SemEval2013 dataset
- Convert the SemEval2013 dataset to the proper IMS format
4) Identifiers used for the experiments
It is important to note that the convention used to name the experiments in the paper and in this package differ. In the paper we used numbers from 1 to 11 to simplify the naming of these experiments, while for the folders and files related to these experiments in this package we used more representative names. We will include here a table table that established the correspondence between the identifiers used for the experiments in this package and in the paper (mainly in the results section and in Table 1 that contains all the results).
Paper experiment identifier | Package experiment identifier | Overall Accuracy |
---|---|---|
1 | Bs | 65.60 |
2 | Bps | 66.80 |
3 | Bpsw | 68.90 |
4 | Bsw | 69.30 |
5 | BsAp+lfs | 63.20 |
6 | BsApw+lfs | 62.00 |
7 | BsAw+lfs | 67.50 |
8 | Bpsw+gold+1 | 85.40 |
9 | Bpsw+gold+5 | 80.40 |
10 | Bpsw+gold+1+lowest_freq | 86.80 |
11 | Bpsw+gold+5+lowest_freq.figures.txt | 82.00 |
5) Reproducing all the experiments
You need to make sure that the java version that you are using is 1.8, otherwise the IMS system might not work:
$ java version java version “1.8.0_73” Java(TM) SE Runtime Environment (build 1.8.0_73b02) Java HotSpot(TM) 64Bit Server VM (build 25.73b02, mixed mode)
You will need to see the path to your java executable script in 2 scripts:
- Train_ims.sh
- In the line 4 → JAVA=”path/to/java1.8/java”
- Evaluate_semeval2013.sh
- Line 16 → JAVA=”path/to/java1.8/java”
Once that you have set the proper thats, you just need to call to the script run_all_experiments.sh, for instance:
$ . run_all_experiments.sh > log.out 2> log.err &
To make it faster, you can split the training for each experiment in several subprocess. Change the value of the variable NUM_PROC (by default set to 15) in the run_all_experiments.sh script. You can set it to 1 if you do not want any parallelisation
The script will call to the training and evaluation for each experiment (the data used for each experiment is in the folder “experiments_data”. It will save the trained models for the folder models/EXP_ID and the results of the evaluation to the files evaluation/EXP_ID.out and evaluation/EXP_ID.figures.txt.
You can also call the script “train_ims_paralel.sh” to train the classifiers for a single experiment:
$ train_ims_paralel.sh PATH_TO_DATA PATH_TO_FOLDER NUM_PROC
You can also call to the classification and evaluation for a single experiment by using the script evaluate_semeval2013.sh, by providing the path to the model you want to evaluate, and the path to the .out and .figures.txt files where you want to store the results.
$ evaluate_semeval2013.sh PATH_TO_TRAINED_MODEL file.out > file.figures.txt
6) Changes to the IMS system
Under some circumstances and with extended training set, IMS raises an exception. For avoiding this we had to include a couple of minor changes into the IMS code. The corrected files can be found in the folder “ims_amended_files”. The installation script (install.sh) will take care of copying them to the proper place the recompile IMS. The changes are:
- CLibLinearLexeltWriter.java
- Line 145:
- Original → if (values.get(i).equals(value))
- Modified → if (values.get(i) != null && values.get(i).equals(value))
- Line 145:
- CStatistic.java
- Line 481
- Original → values[i.getValue()] = i.getKey();
- Modification → surrounded by try/catch block