Skip to the content.

Resource Interoperability between Event-Centric Language Resources

This repository contains an overview of the document intersections of language resources. It is described in the following paper:

Chantal van Son, Oana Inel, Roser Morante, Lora Aroyo and Piek Vossen (2018). Resource Interoperability for Sustainable Benchmarking: The Case of Events. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. pdf

Language Resources

Currently, the filenames of the following annotated corpora are compared (more corpora may be added in the future):

Note that for all multilingual resources, only the English parts have been taken into account.

For the Penn Treebank and PropBank I have used the ptb package from NLTK to obtain all filenames contained in the corpora. For all other corpora, the analysis is based on the files in the data/filelists folder specifying all files contained in the corpora. Most of them can be directly downloaded from the LDC website (if you are a member of LDC) or are included in the distributions. Some of them were generated by me, others were kindly send to me by the creators of the corpora.

Content

This repository contains the following:

UpSet Visualization

The intersections of the corpora can be visualized using UpSet. There are multiple ways to use the UpSet visualization tool:

  1. UpSet Web Version: For an interactive visualization, go to the UpSet Web Version, choose Load Data and paste the following link: https://raw.githubusercontent.com/ChantalvanSon/CorpusComparison/master/data/upset.json
  2. UpSetR Shiny App For a static visualization, go to the UpSetR Shiny App and upload this document (can be found under data).
  3. UpSetR: Install the UpSetR package in R and create the plot using this document (can be found under data).
  4. pyUpSet: Install pyUpSet in Python to create the plot; this will need some more pre-processing and the visualizations are not as nice as the other methods.

box