Exercise Session 3

In this session, you will learn how to finetune SBERT embeddings and use them in downstream tasks. They will also practice using SBERT in a variety of business scenarios.

Hate Speech Dataset

Online hate speech on social media networks can influence hate violence and even crimes against a certain group of people in this digital age. According to FBI statistics, hate-related attacks on specific groups of people are at a 16-year high [1]. Due to this, there is a growing need to eradicate hate speech through automatic detection to reduce the burden on moderators Datasets were obtained from Reddit and a white supremacist forum, Gab, where human-labeled comments are classified as hate speech [2].

Overview

The dataset used for this project consists of Tweets labeled as hate_speech, offensive_language, or neither. In the dataset:

  • count = number of CrowdFlower users who coded each tweet (min is 3, sometimes more users coded a tweet when judgments were determined to be unreliable by CF).
  • hate_speech = number of CF users who judged the tweet to be hate speech.
  • offensive_language = number of CF users who judged the tweet to be offensive.
  • neither = number of CF users who judged the tweet to be neither offensive nor non-offensive.
  • class = class label for majority of CF users.
  • 0 - hate speech 1 - offensive language 2 - neither

Tasks

  • First step: Performance comparison of Transformer models and traditional embedding models
  • Second step: SBERT for semantic similarity (Patent Search using PatentSBERTa)

Notebooks

Here you will find the notebooks for this session:

Core contents

Classification with various vectorization approaches

Add-on