NOTEBOOKS:
============================================
============================================
Imagine someone who cannot hear and uses sign language to communicate. To understand them, you need to learn the specific signs they use to express words and ideas.
The signs represent the meaning of words, but they aren’t the words themselves—just symbols that convey the message.
Similarly, computers don’t understand human language directly. Instead, we translate text into a format they understand—like sign language for machines. This ‘machine sign language’ is a numerical representation of text, called ‘vectors’ or ’embeddings.’
Each word, phrase, or sentence is converted into a series of numbers that capture its meaning.
============================================
============================================
Something that is easily understood by humans is not always easily understood by computers—and vice versa! Placing additional word in to sentences can change the main message, but it doesn’ always convey the same message when converted into a numerical representation of text—and vice versa!
Languages are full of synonims and words that have several meanings. Humans understand the correct meaning by knowing the context.
Punctuation marks can change the meaning of the sentence.
============================================
============================================
The difference between analyzing a large corpus vs. multiple short texts
One Big Text or Corpus of Texts:
Many Short Texts:
============================================
==========================================
==========================================
In the next few slides, we will look at different methods used in NLP:
============================================
What is it?
Steps: 1. Tokenize the input sentence into individual tokens (words). 2. Remove stop words from the tokenized list if required. 3. Create a dictionary of each unique word. 4. Represent each text as a vector of word counts or frequencies. |
============================================
A statistical measure used to evaluate how important a word is to a document within a corpus.
Term Frequency (TF): Measures how often a word appears in a document.
Inverse Document Frequency (IDF): Measures how common or rare a word is across all documents.
Advantages:
Example BoW: A word like "the" might be very frequent, but it's not important. TF-IDF: Words like "science" or "research" in an article about technology would get a higher score because they are less frequent in general but more important for the specific document. |
============================================
A deep learning-based model that transforms words into continuous, dense vectors (embeddings) that capture semantic meaning.
Advantages:
Limitations:
============================================
Feature | Word2Vec | TF-IDF | BoW |
---|---|---|---|
Representation | Dense vector embeddings | Weighted word frequencies | Word counts |
Context Awareness | Yes, considers surrounding words | No, treats words independently | No, treats words independently |
Semantic Meaning | Captures semantic relationships | No, only counts importance based on corpus | No, only counts word frequencies |
Dimensionality | Low (dense vectors) | High (sparse matrix) | High (sparse matrix) |
Common Use Cases | Similarity, analogy, NLP tasks | Text classification, relevance scoring | Text classification, simple tasks |