Datasets
ID | Dataset | URL | Languages | Annotation | Size |
---|---|---|---|---|---|
D1.1 | Included in D3.2 | github.com/cltl/grounding-toxicity | English, German, Spanish, Dutch, Turkish, Arabic | Target spans, Target category, toxic reasoning | 24 threads, 125 comments |
D2.1 | HateXplain Target Spans | github.com/cltl/Target-Spans-Detection | English | Target spans | 3,480 comments |
D1.2 | Reddit data from Banned subreddits | Refer the drive link | English | Toxicity | 1.3 million comments |
D1.3 | Test set selected from D1.2 that reflects inappropriate languages spans | github.com/cltl/Target-Spans-Detection | English | Inappropriate language spans, Target spans and category | 498 subthreads, 4-5 comments per subthread (ca. 20K comments) |
D1.4 | Test m D1.2 that reflects inappropriate and target spans with types | github.com/cltl/Target-Spans-Detection | English | Inappropriate language spans, Target spans & category | 498 subthreads, 4-5 comments per subthread (ca. 20K comments) |
D3.1 | Topic based comment threads from Reddit communities | github.com/cltl/Reddit_topic_toxicity | English, German, Spanish, Dutch, Turkish, Arabic | Topic, Toxicity | 1.5 million comments |
D3.2 | Comment and threads related to Real world social and political events | github.com/cltl/grounding-toxicity | English, German, Spanish, Dutch, Turkish, Arabic | World events, Toxicity, Sentiment, Emotions | 4.5 million comments |
D3.4 | Toxic reasoning data in English | In Progress | English | Toxic reasoning | |
D3.5 | Toxic reasoning dynamic context data on sample from D3.2 - Expert and ChatGPT | In Progress | English, German, Spanish, Dutch, Turkish, Arabic | Toxic reasoning | 1275 |
D3.6 | Toxic reasoning dynamic context data on full D3.2 | In Progress | English, German, Spanish, Dutch, Turkish, Arabic | Toxic reasoning | 4.5 million |
D6 | RefNews-12: news articles | github.com/cltl/refnews | English | Topics and entities | 106,167 documents |
D7 | RADD-Wikidata-5-EN | github.com/cltl/exploiting-ambiguity | English | Ambiguity (De Dicto / De Re) | 500 sentence pairs |
Models
ID | Model | URL | Languages | Input | Output |
---|---|---|---|---|---|
M2.1 | Target Span Detection | github.com/cltl/Target-Spans-Detection | English | comment | 0: not part of the target span / 1: beginning of the target span / 2: inside the target span |
M2 | Target Span Detection | github.com/sybmo/MA_thesis | English | Comment | 0: not part of the target span / 1: beginning of the target span / 2: inside the target span |
M3.2 | Lexicon-based toxicity scores | github.com/cltl/Reddit_topic_toxicity | English, German, Spanish, Dutch, Turkish, Arabic | comment w/o context | toxicity between 0 and 1 |
M3.2 | Lexicon-based toxicity, sentiment and emotion scores | github.com/cltl/grounding-toxicity/tree/main | English, German, Spanish, Dutch, Turkish, Arabic | comment w/o context | toxicity, sentiment and emotion scores between 0 and |
_ | Probing the representations of named entities in Transformer-based Language Models | github.com/cltl/entity-news | English | News articles | Topic classifications |
_ | Reasoning about Ambiguous Definite Descriptions (Pre-trained only! no fine-tuned models) | github.com/cltl/exploiting-ambiguity | English | Questions about ambiguous statements | Predictions and explanations |
M6 | Context models | Being trained at Huawei office right now | English | Comments in context | Message-level toxicity |
M7 | Cross-domain toxic spans | github.com/sfschouten/toxic-cross-domain | English | Comment | Toxic spans |
_ | A WordNet View on Crosslingual Contextualized Language Models | github.com/cltl/probing-cross-linqual-model | English, German, Dutch | ||
M3.1 | The Constant in HATE: Patterns of Toxicity in Reddit across Topics and Languages | github.com/cltl/Reddit_topic_toxicity/tree/main | English, German, Spanish, Dutch, Turkish, Arabic | Comment | Toxicity score |
M3.2 | Grounding Toxicity in Real-World Events across Languages | github.com/cltl/grounding-toxicity | English, German, Spanish, Dutch, Turkish, Arabic | Comment | Toxicity, sentiment and emotion scores |
Publications
ID | Paper | URL | Languages |
---|---|---|---|
P1.3 | Content Moderation in Online Platforms: A Study of Annotation Methods for Inappropriate Language | aclanthology.org/2024.trac-1.11.pdf | English, German, Spanish, Dutch, Turkish, Arabic |
P1.4 | Unveiling Dynamics of Targeting and Inappropriateness in Online | Under review | English |
P1.1 | SeqL at SemEval-2022 Task 11: An Ensemble of Transformer Based Models for Complex Named Entity Recognition Task | aclanthology.org/2022.semeval-1.218 | 11 Languages |
P1.1 | Unknown Script: Impact of Script on Cross-Lingual Transfer | aclanthology.org/2024.naacl-srw | English,Arabic,Amharic, English |
P2.1 | Cross-domain toxic span detection | cross-domain-toxic-spans-detection | English |
P2.2 | Annotating Targets of Toxic Language at the Span Level | aclanthology.org/2022.trac-1.6 | English |
P2.3 | The Role of Context in Detecting the Target of Hate Speech | aclanthology.org/2022.trac-1.5 | Dutch |
P2.4 | Technical report on the role of discourse context for toxicity classification | English | |
P3.1 | The Constant in HATE: Patterns of Toxicity in Reddit across Topics and Language | aclanthology.org/2024.trac-1.1 | English, German, Spanish, Dutch, Turkish, Arabic |
P3.2 | Grounding Toxicity in Real-World Events across Languages | arxiv.org/pdf/2405.13754 | English, German, Spanish, Dutch, Turkish, Arabic |
P3.3 | Reasoning about Ambiguous Definite Descriptions | aclanthology.org/2023.findings-emnlp.296 | English |
P3.4 | Toxic Reasoning on implicit hatespeech | In progress | English |
P3.4 | Probing the representations of named entities in Transformer-based Language Models | aclanthology.org/2022.blackboxnlp-1.32 | English |
_ | A WordNet View on Crosslingual Contextualized Language Models | aclanthology.org/2023.gwc-1.2 | English, German, Dutch |
_ | Confidently Wrong: Exploring the Calibration and Expression of (Un)Certainty of Large Language Models in a Multilingual Setting | aclanthology.org/2023.mmnlg-1.1 | Amharic, Dutch, English, German, Hindi,and Spanish |