Beats, bars, and bad words: a comparative analysis of profanity detection in code-switched German rap lyrics

Karin Niederreiter

doi:10.25365/thesis.76297

You are here:

University of Vienna
PHAIDRA
Detail o:2079475

Title (eng)

Beats, bars, and bad words: a comparative analysis of profanity detection in code-switched German rap lyrics

Author

Karin Niederreiter

Advisor

Dagmar Gromann

Assessor

Dagmar Gromann

Abstract (deu)

Die wachsende Anzahl von Social-Media-Nutzer*innen und die zunehmende Verwendung von beleidigender Sprache und Schimpfwörtern machen es immer wichtiger, wirksame automatische Erkennungssysteme zu entwickeln. Diese Masterarbeit untersucht die Herausforderungen und Fortschritte in der automatischen Erkennung von Schimpfwörtern, insbesondere im ressourcenarmen, codegemischten Bereich. Hierfür wird die Effektivität der Integration von englischen umgangssprachlichen und auf Wortebene annotierten Trainingsdaten zusätzlich zu deutschsprachigen Daten für das Fine-Tuning von XLM-R zur Verbesserung der Schimpfworterkennung ermittelt. Darüber hinaus werden die Ergebnisse beider Modelle mittels Zero-Shot-Domain-Transfer in einer neuen Domäne, nämlich codegeswitchten deutschen Raptexten, verglichen. Im Vergleich zum monolingual finegetunten Modell erzielt das bilingual finegetunte Modell eine um 14,82% höhere Trefferquote, eine um 17,55% höhere Genauigkeit und einen um 16,35% höheren F1-Score. Das bilingual finegetunte Modell erkennt darüber hinaus sowohl mehr englischsprachige als auch mehr deutschsprachige Schimpfwörter. Diese Masterarbeit trägt somit zur Weiterentwicklung der Methoden in der multilingualen Schimpfworterkennung bei. Darüber hinaus wurde im Zuge dieser Masterarbeit ein einzigartiger, auf Wortebene annotierter Datensatz erstellt, der eine wertvolle Ressource für weitere Forschungen in diesem Bereich darstellt. WARNUNG: Diese Masterarbeit enthält anzügliche und beleidigende Sprache.

Abstract (eng)

The exponential growth of the social media user base has led to an alarming rise in the use of offensive language and profanity, necessitating effective detection mechanisms. This thesis explores the challenges and advancements in automatic profanity detection, focusing on the limitations of existing methodologies, particularly in addressing the low-resource code-switched domain and the evolving language dynamics. Notably, it investigates the effectiveness of integrating word-level annotated colloquial English data alongside German in fine-tuning XLM-R with token classification to enhance profanity detection performance. Additionally, zero-shot domain transfer is employed to comparatively evaluate the performance of the monolingual and the bilingual fine-tuned models on previously unseen data, specifically low-resource code-switched German rap lyrics. Through experimentation and comparative analysis, the thesis showcases significant performance improvements in profanity detection with the bilingual fine-tuned model outperforming the monolingual one across various metrics. In relative comparison to the monolingual fine-tuned model, the bilingual one exhibited approximately 14.82% higher recall, 17.55% higher precision, and 16.35% higher F1 score in the zero-shot domain transfer setting. Additionally, the thesis highlights the bilingual fine-tuned model’s superior ability to recognize profanities in both German and English, as well as its effectiveness in detecting neologisms, a crucial capability given the constantly evolving nature of natural languages. This research contributes to advancing profanity detection methodologies while addressing critical challenges in cross-lingual and code-switched contexts. Additionally, it presents a unique word-level annotated dataset, providing a valuable resource for further research in this domain. WARNING: This thesis contains offensive and profane language.

Keywords (deu)

Fine-TuningDeep LearningTransfer LearningCode-SwitchingVulgäre SpracheNatürliche SprachverarbeitungMaschinelles Lernen

Keywords (eng)

Pre-Trained Language ModelsFine-TuningProfanity DetectionHate SpeechCode-SwitchingZero-Shot Domain TransferToken ClassificationNatural Language ProcessingMachine Learning

Subject (deu)

Sprachverarbeitung

Subject (deu)

Künstliche Intelligenz

Type (deu)

Masterarbeit

Persistent identifier

https://phaidra.univie.ac.at/o:2079475

DOI

10.25365/thesis.76297

URN

urn:nbn:at:at-ubw:1-21804.02757.104269-2

URI

https://utheses.univie.ac.at/detail/72122

Extent (deu)

13, 101 Seiten : Illustrationen

Number of pages

117

Study plan

Joint-Masterstudium Multilingual Technologies

[UA]

[066]

[587]

Association (deu)

Zentrum für Translationswissenschaft

Title (eng)

Beats, bars, and bad words: a comparative analysis of profanity detection in code-switched German rap lyrics

Author

Karin Niederreiter

Abstract (deu)

Die wachsende Anzahl von Social-Media-Nutzer*innen und die zunehmende Verwendung von beleidigender Sprache und Schimpfwörtern machen es immer wichtiger, wirksame automatische Erkennungssysteme zu entwickeln. Diese Masterarbeit untersucht die Herausforderungen und Fortschritte in der automatischen Erkennung von Schimpfwörtern, insbesondere im ressourcenarmen, codegemischten Bereich. Hierfür wird die Effektivität der Integration von englischen umgangssprachlichen und auf Wortebene annotierten Trainingsdaten zusätzlich zu deutschsprachigen Daten für das Fine-Tuning von XLM-R zur Verbesserung der Schimpfworterkennung ermittelt. Darüber hinaus werden die Ergebnisse beider Modelle mittels Zero-Shot-Domain-Transfer in einer neuen Domäne, nämlich codegeswitchten deutschen Raptexten, verglichen. Im Vergleich zum monolingual finegetunten Modell erzielt das bilingual finegetunte Modell eine um 14,82% höhere Trefferquote, eine um 17,55% höhere Genauigkeit und einen um 16,35% höheren F1-Score. Das bilingual finegetunte Modell erkennt darüber hinaus sowohl mehr englischsprachige als auch mehr deutschsprachige Schimpfwörter. Diese Masterarbeit trägt somit zur Weiterentwicklung der Methoden in der multilingualen Schimpfworterkennung bei. Darüber hinaus wurde im Zuge dieser Masterarbeit ein einzigartiger, auf Wortebene annotierter Datensatz erstellt, der eine wertvolle Ressource für weitere Forschungen in diesem Bereich darstellt. WARNUNG: Diese Masterarbeit enthält anzügliche und beleidigende Sprache.

Abstract (eng)

The exponential growth of the social media user base has led to an alarming rise in the use of offensive language and profanity, necessitating effective detection mechanisms. This thesis explores the challenges and advancements in automatic profanity detection, focusing on the limitations of existing methodologies, particularly in addressing the low-resource code-switched domain and the evolving language dynamics. Notably, it investigates the effectiveness of integrating word-level annotated colloquial English data alongside German in fine-tuning XLM-R with token classification to enhance profanity detection performance. Additionally, zero-shot domain transfer is employed to comparatively evaluate the performance of the monolingual and the bilingual fine-tuned models on previously unseen data, specifically low-resource code-switched German rap lyrics. Through experimentation and comparative analysis, the thesis showcases significant performance improvements in profanity detection with the bilingual fine-tuned model outperforming the monolingual one across various metrics. In relative comparison to the monolingual fine-tuned model, the bilingual one exhibited approximately 14.82% higher recall, 17.55% higher precision, and 16.35% higher F1 score in the zero-shot domain transfer setting. Additionally, the thesis highlights the bilingual fine-tuned model’s superior ability to recognize profanities in both German and English, as well as its effectiveness in detecting neologisms, a crucial capability given the constantly evolving nature of natural languages. This research contributes to advancing profanity detection methodologies while addressing critical challenges in cross-lingual and code-switched contexts. Additionally, it presents a unique word-level annotated dataset, providing a valuable resource for further research in this domain. WARNING: This thesis contains offensive and profane language.

Keywords (deu)

Fine-TuningDeep LearningTransfer LearningCode-SwitchingVulgäre SpracheNatürliche SprachverarbeitungMaschinelles Lernen

Keywords (eng)

Pre-Trained Language ModelsFine-TuningProfanity DetectionHate SpeechCode-SwitchingZero-Shot Domain TransferToken ClassificationNatural Language ProcessingMachine Learning

Subject (deu)

Sprachverarbeitung

Subject (deu)

Künstliche Intelligenz

Type (deu)

Masterarbeit

Persistent identifier

https://phaidra.univie.ac.at/o:2080697

Number of pages

117

Association (deu)

Zentrum für Translationswissenschaft

License

Download

Citable links

Persistent identifier
https://phaidra.univie.ac.at/o:2079475
Handle
https://hdl.handle.net/11353/10.2079475
DOI
https://doi.org/10.25365/thesis.76297
URN
https://nbn-resolving.org/nbn:at:at-ubw:1-21804.02757.104269-2
Other links

URI
https://utheses.univie.ac.at/detail/72122
Managed by

u:theses
Details

Uploader

Universitätsbibliothek Wien / u:theses

Object type

Container

Created

18.07.2024 03:00:02
Usage statistics

44
Metadata

JSON-LD
Export formats

Dublin Core

DataCite

LOM

EDM

OpenAIRE