Social / Business Data Science 2022 > Applied Deep Learning and Artificial Intelligence > Intro to transformer models > Exercise Session 3

Exercise Session 3

In this session, you will learn how to finetune SBERT embeddings and use them in downstream tasks. They will also practice using SBERT in a variety of business scenarios.

Hate Speech Dataset

Online hate speech on social media networks can influence hate violence and even crimes against a certain group of people in this digital age. According to FBI statistics, hate-related attacks on specific groups of people are at a 16-year high [1]. Due to this, there is a growing need to eradicate hate speech through automatic detection to reduce the burden on moderators Datasets were obtained from Reddit and a white supremacist forum, Gab, where human-labeled comments are classified as hate speech [2].

Overview

The dataset used for this project consists of Tweets labeled as hate_speech, offensive_language, or neither. In the dataset:

count = number of CrowdFlower users who coded each tweet (min is 3, sometimes more users coded a tweet when judgments were determined to be unreliable by CF).
hate_speech = number of CF users who judged the tweet to be hate speech.
offensive_language = number of CF users who judged the tweet to be offensive.
neither = number of CF users who judged the tweet to be neither offensive nor non-offensive.
class = class label for majority of CF users.
0 - hate speech 1 - offensive language 2 - neither

Tasks

First step: Performance comparison of Transformer models and traditional embedding models
Second step: SBERT for semantic similarity (Patent Search using PatentSBERTa)

Exercise Session 3

Hate Speech Dataset

Overview

Tasks

Notebooks

Core contents

Classification with various vectorization approaches

Add-on