ARTIFICIAL INTELLIGENCE (21) – Deep learning (19) Text Representations in NLP (1) One-hot encoding

We are going to explore the inputs representations in the process of transforming a text corpus into different input formats for a Machine Learning Model. In this article we are going to study the input One-hot encoding.

ONE-HOT ENCODING

When we use one-hot encoding on text:

  1. We create a vocabulary (all unique words).
  2. Each word gets an index.
  3. Each word becomes a vector of size m (vocabulary size).
  4. A sentence becomes a matrix.

© Image. https://www.kdnuggets.com/2019/10/introduction-natural-language-processing.html

Each sentence becomes a matrix of shape:

(n, m)

Where:

  • n = number of words (tokens) in the sentence.
  • m = size of the vocabulary.

Different sentences : different number of words : different n.

So:

  • Sentence 1 : (4, 9)
  • Sentence 2 : (7, 9)

Shapes are different : this is the problem for ML models.

Example of Sentences:

s1: «I love AI»
s2: «AI is very powerful»

Vocabulary

Unique words:

{«ai», «i», «is», «love», «powerful», «very»}

Assign indices (alphabetical order):

vocabulary = {
«ai»: 0,
«i»: 1,
«is»: 2,
«love»: 3,
«powerful»: 4,
«very»: 5
}

Vocabulary size = 6 (m = 6)

One-hot vectors

Each word becomes a vector of length 6:

«ai»              [1,0,0,0,0,0]
«i»                [0,1,0,0,0,0]
«is»              [0,0,1,0,0,0]
«love»          [0,0,0,1,0,0]
«powerful» [0,0,0,0,1,0]
«very»         [0,0,0,0,0,1]

Sentence matrices

Sentence 1: «I love AI»

Words:

[«i», «love», «ai»]

Matrix:

[
[0,1,0,0,0,0],  «i»
[0,0,0,1,0,0],  «love»
[1,0,0,0,0,0]   «ai»
]

Shape:

(3, 6)

Sentence 2: «AI is very powerful»

Words:

[«ai», «is», «very», «powerful»]

Matrix:

[
[1,0,0,0,0,0],  «ai»
[0,0,1,0,0,0],  «is»
[0,0,0,0,0,1],  «very»
[0,0,0,0,1,0]   «powerful»
]

Shape:

(4, 6)

When we build a vocabulary from text, we include every unique word we see.

 If we keep adding more and more text (a bigger training dataset), we will keep discovering new words and we have a bigger vocabulary.

The problem is that each word is represented by a vector (like in one-hot encoding).

  • If the vocabulary has 1,000 words : vectors have length 1,000
  • If it grows to 100,000 words : vectors have length 100,000

So the vectors become very long.

Another problem is sparsity.

In one-hot vectors:

  • Almost everything is 0
  • Only one position is 1

Example:

[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …]

This means the data is sparse (mostly zeros).

Problems with that:

  • Wastes memory.
  • Slows down computation.
  • Harder for models to learn efficiently.

A common solution is to use characters instead of words.

Instead of treating whole words as tokens, we use characters:

  • Words: «this», «sentence», «number» : thousands of possibilities.
  • Characters: «a–z» → only  aprox. 26 (plus punctuation).

Much smaller vocabulary!

Using characters is better because:

  • Vocabulary size stays small.
  • Vectors are shorter.
  • Less sparsity.
  • More efficient.

When we convert text into numbers (for Machine Learning), we usually turn each sentence into a matrix.

  • Each row = one word (or character).
  • So, the number of rows = the length of the sentence.

Not all sentences have the same length:

  • “This is short” : 3 words → 3 rows
  • “This is a much longer sentence” : 6 words → 6 rows

So their matrices look like:

Sentence 1 : shape: (3 rows × vector size)
Sentence 2 : shape: (6 rows × vector size)

That means: different shapes.

Most Machine Learning models (like standard neural networks) expect:

All inputs to have the same shape.

Think of it like a factory machine:

  • It expects boxes of the same size.
  • If one box is bigger or smaller : it breaks or fails.

So:

  • Different sentence lengths : different matrix sizes.
  • Model gets confused : there is a problem.

RNNs (Recurrent Neural Networks) are special because:

They read data step by step (one word at a time). Instead of processing the whole sentence at once, they do:

word 1 : word 2 : word 3  …

So:

  • They can handle different sentence lengths.
  • They don’t need fixed-size inputs in the same way.

But there is still a limitation:

Even with RNNs:

When we train models, we use batches (groups of sentences).

Example:

Batch = 3 sentences

For efficiency, all sentences in the same batch must still have: the same shape.

We usually fix this by:

  • Padding (adding zeros to shorter sentences).
  • or Truncating (cutting longer sentences).

Example:

«Hi» : [Hi, PAD, PAD]
«Hello world» : [Hello, world, PAD]

Now all sentences have the same length.

Another way to solve the problem is:

Fix a maximum length (l)

We choose a fixed number, for example:

l = 300 characters

This means: Every sentence must have exactly 300 positions.

Truncating (cutting long sentences)

If a sentence is too long:

«This movie was absolutely amazing and full of unexpected moments…»

We cut it to the first 300 characters:

  • Keep the beginning.
  • Remove the rest.

Padding (extending short sentences)

If a sentence is too short:

«Great movie»

We add empty values (PAD tokens) until it reaches length 300.

Example:

[«Great», «movie», PAD, PAD, PAD, …]

Important note about padding:

Those extra PAD tokens are fake data. We don’t want the model to learn from them.

Masking (in Keras)

In frameworks like Keras: We can mask the padding tokens.

This means:

  • The model ignores PAD values.
  • They do not affect the loss.
  • They do not influence learning.

For tasks like sentiment analysis (e.g., positive/negative text):

  • Most important information is usually at the beginning.
  • So using the first 300 characters is often enough.

Even though: Using one-hot encoding is not ideal (too sparse, inefficient). Modern models use embeddings instead.

 

 

 

References:

Article: An Overview for Text Representations in NLP. Author: jiawei hu. Published on Medium.

Generalized Language Models by Lilian Weng.

Licencia Creative Commons@Yolanda Muriel Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)

Deja un comentario