DreamsLab WP1

Repository for datasets, models, and publications

Datasets

ID Dataset URL Languages Annotation Size
D1.1 Included in D3.2 github.com/cltl/grounding-toxicity English, German, Spanish, Dutch, Turkish, Arabic Target spans, Target category, toxic reasoning 24 threads, 125 comments
D2.1 HateXplain Target Spans github.com/cltl/Target-Spans-Detection English Target spans 3,480 comments
D1.2 Reddit data from Banned subreddits Refer the drive link English Toxicity 1.3 million comments
D1.3 Test set selected from D1.2 that reflects inappropriate languages spans github.com/cltl/Target-Spans-Detection English Inappropriate language spans, Target spans and category 498 subthreads, 4-5 comments per subthread (ca. 20K comments)
D1.4 Test m D1.2 that reflects inappropriate and target spans with types github.com/cltl/Target-Spans-Detection English Inappropriate language spans, Target spans & category 498 subthreads, 4-5 comments per subthread (ca. 20K comments)
D3.1 Topic based comment threads from Reddit communities github.com/cltl/Reddit_topic_toxicity English, German, Spanish, Dutch, Turkish, Arabic Topic, Toxicity 1.5 million comments
D3.2 Comment and threads related to Real world social and political events github.com/cltl/grounding-toxicity English, German, Spanish, Dutch, Turkish, Arabic World events, Toxicity, Sentiment, Emotions 4.5 million comments
D3.4 Toxic reasoning data in English In Progress English Toxic reasoning
D3.5 Toxic reasoning dynamic context data on sample from D3.2 - Expert and ChatGPT In Progress English, German, Spanish, Dutch, Turkish, Arabic Toxic reasoning 1275
D3.6 Toxic reasoning dynamic context data on full D3.2 In Progress English, German, Spanish, Dutch, Turkish, Arabic Toxic reasoning 4.5 million
D6 RefNews-12: news articles github.com/cltl/refnews English Topics and entities 106,167 documents
D7 RADD-Wikidata-5-EN github.com/cltl/exploiting-ambiguity English Ambiguity (De Dicto / De Re) 500 sentence pairs

Models

ID Model URL Languages Input Output
M2.1 Target Span Detection github.com/cltl/Target-Spans-Detection English comment 0: not part of the target span / 1: beginning of the target span / 2: inside the target span
M2 Target Span Detection github.com/sybmo/MA_thesis English Comment 0: not part of the target span / 1: beginning of the target span / 2: inside the target span
M3.2 Lexicon-based toxicity scores github.com/cltl/Reddit_topic_toxicity English, German, Spanish, Dutch, Turkish, Arabic comment w/o context toxicity between 0 and 1
M3.2 Lexicon-based toxicity, sentiment and emotion scores github.com/cltl/grounding-toxicity/tree/main English, German, Spanish, Dutch, Turkish, Arabic comment w/o context toxicity, sentiment and emotion scores between 0 and
_ Probing the representations of named entities in Transformer-based Language Models github.com/cltl/entity-news English News articles Topic classifications
_ Reasoning about Ambiguous Definite Descriptions (Pre-trained only! no fine-tuned models) github.com/cltl/exploiting-ambiguity English Questions about ambiguous statements Predictions and explanations
M6 Context models Being trained at Huawei office right now English Comments in context Message-level toxicity
M7 Cross-domain toxic spans github.com/sfschouten/toxic-cross-domain English Comment Toxic spans
_ A WordNet View on Crosslingual Contextualized Language Models github.com/cltl/probing-cross-linqual-model English, German, Dutch
M3.1 The Constant in HATE: Patterns of Toxicity in Reddit across Topics and Languages github.com/cltl/Reddit_topic_toxicity/tree/main English, German, Spanish, Dutch, Turkish, Arabic Comment Toxicity score
M3.2 Grounding Toxicity in Real-World Events across Languages github.com/cltl/grounding-toxicity English, German, Spanish, Dutch, Turkish, Arabic Comment Toxicity, sentiment and emotion scores

Publications

ID Paper URL Languages
P1.3 Content Moderation in Online Platforms: A Study of Annotation Methods for Inappropriate Language aclanthology.org/2024.trac-1.11.pdf English, German, Spanish, Dutch, Turkish, Arabic
P1.4 Unveiling Dynamics of Targeting and Inappropriateness in Online Under review English
P1.1 SeqL at SemEval-2022 Task 11: An Ensemble of Transformer Based Models for Complex Named Entity Recognition Task aclanthology.org/2022.semeval-1.218 11 Languages
P1.1 Unknown Script: Impact of Script on Cross-Lingual Transfer aclanthology.org/2024.naacl-srw English,Arabic,Amharic, English
P2.1 Cross-domain toxic span detection cross-domain-toxic-spans-detection English
P2.2 Annotating Targets of Toxic Language at the Span Level aclanthology.org/2022.trac-1.6 English
P2.3 The Role of Context in Detecting the Target of Hate Speech aclanthology.org/2022.trac-1.5 Dutch
P2.4 Technical report on the role of discourse context for toxicity classification English
P3.1 The Constant in HATE: Patterns of Toxicity in Reddit across Topics and Language aclanthology.org/2024.trac-1.1 English, German, Spanish, Dutch, Turkish, Arabic
P3.2 Grounding Toxicity in Real-World Events across Languages arxiv.org/pdf/2405.13754 English, German, Spanish, Dutch, Turkish, Arabic
P3.3 Reasoning about Ambiguous Definite Descriptions aclanthology.org/2023.findings-emnlp.296 English
P3.4 Toxic Reasoning on implicit hatespeech In progress English
P3.4 Probing the representations of named entities in Transformer-based Language Models aclanthology.org/2022.blackboxnlp-1.32 English
_ A WordNet View on Crosslingual Contextualized Language Models aclanthology.org/2023.gwc-1.2 English, German, Dutch
_ Confidently Wrong: Exploring the Calibration and Expression of (Un)Certainty of Large Language Models in a Multilingual Setting aclanthology.org/2023.mmnlg-1.1 Amharic, Dutch, English, German, Hindi,and Spanish