DreamsLab WP1

Repository for datasets, models, and publications

Datasets

ID	Dataset	URL	Languages	Annotation	Size
D1.1	Included in D3.2	github.com/cltl/grounding-toxicity	English, German, Spanish, Dutch, Turkish, Arabic	Target spans, Target category, toxic reasoning	24 threads, 125 comments
D2.1	HateXplain Target Spans	github.com/cltl/Target-Spans-Detection	English	Target spans	3,480 comments
D1.2	Reddit data from Banned subreddits	Refer the drive link	English	Toxicity	1.3 million comments
D1.3	Test set selected from D1.2 that reflects inappropriate languages spans	github.com/cltl/Target-Spans-Detection	English	Inappropriate language spans, Target spans and category	498 subthreads, 4-5 comments per subthread (ca. 20K comments)
D1.4	Test m D1.2 that reflects inappropriate and target spans with types	github.com/cltl/Target-Spans-Detection	English	Inappropriate language spans, Target spans & category	498 subthreads, 4-5 comments per subthread (ca. 20K comments)
D3.1	Topic based comment threads from Reddit communities	github.com/cltl/Reddit_topic_toxicity	English, German, Spanish, Dutch, Turkish, Arabic	Topic, Toxicity	1.5 million comments
D3.2	Comment and threads related to Real world social and political events	github.com/cltl/grounding-toxicity	English, German, Spanish, Dutch, Turkish, Arabic	World events, Toxicity, Sentiment, Emotions	4.5 million comments
D3.4	Toxic reasoning dynamic context data on sample from D3.2 - Expert annotations	Pre-release	English & Dutch	Toxic reasoning	921
D3.5	Toxic reasoning dynamic context data on sample from D3.2 - Student annotations	Pre-release	German, Spanish, Turkish, Arabic	Toxic reasoning	1957
D3.6	Toxic reasoning dynamic context data on subset of D3.2 - ChatGPT annotations (train split)	Pre-release	English, German, Spanish, Dutch, Turkish, Arabic	Toxic reasoning	127,472
D6	RefNews-12: news articles	github.com/cltl/refnews	English	Topics and entities	106,167 documents
D7	RADD-Wikidata-5-EN	github.com/cltl/exploiting-ambiguity	English	Ambiguity (De Dicto / De Re)	500 sentence pairs

Models

ID	Model	URL	Languages	Input	Output
M2.1	Target Span Detection	github.com/cltl/Target-Spans-Detection	English	comment	0: not part of the target span / 1: beginning of the target span / 2: inside the target span
M2	Target Span Detection	github.com/sybmo/MA_thesis	English	Comment	0: not part of the target span / 1: beginning of the target span / 2: inside the target span
M3.2	Lexicon-based toxicity scores	github.com/cltl/Reddit_topic_toxicity	English, German, Spanish, Dutch, Turkish, Arabic	comment w/o context	toxicity between 0 and 1
M3.2	Lexicon-based toxicity, sentiment and emotion scores	github.com/cltl/grounding-toxicity/tree/main	English, German, Spanish, Dutch, Turkish, Arabic	comment w/o context	toxicity, sentiment and emotion scores between 0 and
_	Probing the representations of named entities in Transformer-based Language Models	github.com/cltl/entity-news	English	News articles	Topic classifications
_	Reasoning about Ambiguous Definite Descriptions (Pre-trained only! no fine-tuned models)	github.com/cltl/exploiting-ambiguity	English	Questions about ambiguous statements	Predictions and explanations
M6	Context models	Being trained at Huawei office right now	English	Comments in context	Message-level toxicity
M7	Cross-domain toxic spans	github.com/sfschouten/toxic-cross-domain	English	Comment	Toxic spans
_	A WordNet View on Crosslingual Contextualized Language Models	github.com/cltl/probing-cross-linqual-model	English, German, Dutch
M3.1	The Constant in HATE: Patterns of Toxicity in Reddit across Topics and Languages	github.com/cltl/Reddit_topic_toxicity/tree/main	English, German, Spanish, Dutch, Turkish, Arabic	Comment	Toxicity score
M3.2	Grounding Toxicity in Real-World Events across Languages	github.com/cltl/grounding-toxicity	English, German, Spanish, Dutch, Turkish, Arabic	Comment	Toxicity, sentiment and emotion scores
-	Fine-tuning various models on ChatGPT's and experts' toxic reasoning annotations	In progress	English, German, Spanish, Dutch, Turkish, Arabic	Comments in context	Toxic reasoning schema

Publications

ID	Paper	URL	Languages
P1.3	Content Moderation in Online Platforms: A Study of Annotation Methods for Inappropriate Language	aclanthology.org/2024.trac-1.11.pdf	English, German, Spanish, Dutch, Turkish, Arabic
P1.4	Assessing and Refining ChatGPT’s Performance in Identifying Targeting and Inappropriate Language: A Comparative Study	Under review	English
P1.1	SeqL at SemEval-2022 Task 11: An Ensemble of Transformer Based Models for Complex Named Entity Recognition Task	aclanthology.org/2022.semeval-1.218	11 Languages
P1.1	Unknown Script: Impact of Script on Cross-Lingual Transfer	aclanthology.org/2024.naacl-srw	English,Arabic,Amharic, English
P2.1	Cross-domain toxic span detection	cross-domain-toxic-spans-detection	English
P2.2	Annotating Targets of Toxic Language at the Span Level	aclanthology.org/2022.trac-1.6	English
P2.3	The Role of Context in Detecting the Target of Hate Speech	aclanthology.org/2022.trac-1.5	Dutch
P2.4	Technical report on the role of discourse context for toxicity classification		English
P3.1	The Constant in HATE: Patterns of Toxicity in Reddit across Topics and Language	aclanthology.org/2024.trac-1.1	English, German, Spanish, Dutch, Turkish, Arabic
P3.2	Grounding Toxicity in Real-World Events across Languages	arxiv.org/pdf/2405.13754	English, German, Spanish, Dutch, Turkish, Arabic
P3.3	Reasoning about Ambiguous Definite Descriptions	aclanthology.org/2023.findings-emnlp.296	English
P3.4	Toxic Reasoning on implicit hatespeech	In progress	English
P3.4	Probing the representations of named entities in Transformer-based Language Models	aclanthology.org/2022.blackboxnlp-1.32	English
_	A WordNet View on Crosslingual Contextualized Language Models	aclanthology.org/2023.gwc-1.2	English, German, Dutch
_	Confidently Wrong: Exploring the Calibration and Expression of (Un)Certainty of Large Language Models in a Multilingual Setting	aclanthology.org/2023.mmnlg-1.1	Amharic, Dutch, English, German, Hindi,and Spanish
_	Understanding and Analyzing Inappropriately Targeting Language in Online Discourse: A Comparative Annotation Study	Under review	English