Machine Learning Foundations: Part 9 - Using the Sequencing APIs

In part 8: Introduction to Natural Language Processing, we looked at how you can tokenize words with simple APIs. This allowed you to turn words into numbers or tokens so that they can be more easily represented in a computer's memory. It's the first step in processing language. The next step, which we'll look at in this part, is to turn sentences into sequences of tokens. And we'll explore the tools that make this very simple to do in TensorFlow. So let's get started.

Here's the code that we were looking at last time where our sentences are represented as elements in an array.

from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)

We use a tokenizer to turn the words in those sentences into numeric tokens. Then we can inspect those tokens by looking at the tokenizer's word index property. But there's a few differences. First, I've added another sentence to the corpus, just so that we can get a variety of sentence lengths. All of the others were four words long. So I've added one that has seven words. Additionally, I've added call tokenizer.texts_to_sequences which does the hard work of turning the array of sentences into arrays of tokens. As the tokenizer already fif the job of tokenizing the words, it's nice that it can handle the sequencing, too. But this is just the first step in what you'll need to do when you're preparing text for an LP.

The results of this, if we print out the word index and the sequences, will look like this.

Word Index =  {'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
Sequences =  [[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]

We have a lot more words now. And we can see how some of them have been added.

Remember, the order of words is the frequency of them in the corpus. So my is the most common word and then love, et cetera.

We can see that the new words that we've added, such as amazing, think, is, and do are there also. And the sentences have been encoded into numeric sequences. So, for example, 4, 2, 1, 3 is our first sentence. If you substitute these numbers for the associated words, you'll see I love my dog.

So think about this in machine learning terms for a moment. All along, we've had data that we've trained neural networks on. And then we showed those neural networks new data with a view to having the network predict what that data is.

In the case of pictures, for example, we had images of horses and humans. And after the network was trained on lots of images of these, we'd show it a new one. And then it would tell us if it thought it was seeing a horse or a human.

With NLP, we have similar approach. We'll train a network with lots of sentences. And then those sentences will be labeled. Later, you'll see a data set of headlines, some of which are sarcastic and some of which are normal. And we'll train a network on those. But think about what happens with a trained neural network when you show it new data. It tries to use what it knows, what it's been trained on, to understand that new data. In the case of words, the network is trained on the words in the corpus. But what happens when you want to show it new data and have it predict from that? The new data will need to be encoded with the same tokens as the training data. And it will have to be sequenced in the same way with the same rules.

So let's go back to our code and see what this might look like. So while we haven't trained a neural network yet, we have started to prepare imaginary training data. Using the tokenizer, we were able to get tokens for the words in our corpus and create sequences out of our text.

So imagine now that we've used this to train a network. And we want the network to understand this test text -- I really love my dog and my dog loves my manatee. We should use the same tokenizer that we used on the training set because we want to have the same tokens.

test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

For example, the token used for dog should be the same when you train the network and then later on when you're using and testing it. So if we were to encode these sentences into tokens and sequences, we'd get this result.

Test Sequence =  [[4, 2, 1, 3], [1, 3, 1]]

I really love my dog would become 4, 2, 1, 3, which, if you do substitution, would be recognized as I love my dog. Not bad. But my dog loves my manatee would become 1, 3, 1, which is my dog my. And it's really lost all meaning. The network will not be able to understand the sentence reasonably because words like loves and manatee just aren't in the corpus.

You might think a simple solution to this would be to add the test data to the training data so that the training data will contain words like loves and manatee. But that isn't really feasible. You can't train a network with new words every time it sees a not previously seen word. And you'd, of course, end up overfitting, where the network is only able to parse data that it's previously seen. There's no simple solution to this. But there is a simple thing that you can do to begin solving it. We'll see that next.

Here, I've updated the code, and I've added a new option on the tokenizer.

from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100, oov_token="")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

print("\nWord Index = " , word_index)

# Try with words that the tokenizer wasn't fit to
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

The OOV token parameter allows you to specify a token type for Out Of Vocabulary. Choose a string you don't expect to see in the corpus, like I've done here with OOV, and pass back to the tokenizer.

Word Index =  {'': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
Test Sequence =  [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

Now, there'll be a new token list, with OOV being number one. It'll always be number one regardless of how many times it's used. So the rest of the tokens from number two onwards are in order of frequency. And the sequencing of the sentences will use it. So now our two sentences are encoded as 5, 1, 3, 2, 4 and 2, 4, 1, 2, 1. If you swap the words back in, you'll get I OOV love my dog and my dog OOV my OOV, which is a small step in the right direction.

One important thing when training neural networks is to get the input shape of your data uniform. With images, you saw that we resized all the images to be the same size so that we could have a network fed something of that size to produce a prediction. With language, we generally have to do the same thing. There are exceptions with something called ragged tensors. But I'm not going to be covering those here.

Given that our sentences are different lengths, we can get them to uniform lengths using something called padding. We'll see that next.

Here's the updated code to handle padding.

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100, oov_token="")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

padded = pad_sequences(sequences)
print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)
print("\nPadded Sequences:")
print(padded)

Let's step through it piece by piece, and we'll explore what's new.

First, we'll import the pad_sequences APIs from tf.keras . Now, we can pass our sequences to pad_sequences to get back a padded. If we print them out, we'll see the sequences have been padded for us so they're now all the same length.

Padded Sequences:
[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]

There are also no longer lists of comma-separated values, but real tensors in an array of tensors. So 5, 3, 2, 4 become 0, 0, 0, 5, 3, 2, 4 with a bunch of zeros at the beginning. And our longest sentence, 8, 6, 9, 2, 4, 10, 11 gets into it without any padding of zeros.

You saw the first sentence was padded with zeros at the beginning. And that's the default behavior. If you want them at the end instead, you can just say padding='post'.

The other default behavior was that each padded sequence was the length of the longest sequence. So the longest sequence had no padding at all. You can change that to another length using the maxlen parameter.

padded = pad_sequences(sequences, padding='post', maxlen=5)

And here, I've set it to five. You might wonder what would happen to all of the characters in sentences longer than five. Well, they're going to be truncated. And with parameter truncating, you can specify if it's cut off at the end of the sentence using post or at the beginning using pre.

You've now seen the steps involved, not just in tokenizing all of the words in your sentence, but also in sequencing them into sentence arrays and using padding to get the arrays into the same shape and size.

Up to now, we've just been using hard-coded sentences to experiment with tokenizing and padding. Before you can train a neural network, you're going to need to read in text from a data source. In the next part, you'll see how to do that, reading thousands of news headlines, tokenizing and sequencing them. After that, you'll be ready to start building your first models that understand language.

Next: Part 10 - Using NLP to build a sarcasm classifier

Machine Learning Foundations: Part 9 - Using the Sequencing APIs

Alex 9.8K

Comments (0)

No comments yet

Similar Posts

Machine Learning Foundations: Part 8 - Tokenization for Natural Language Processing

Machine Learning Foundations: Part 1 - What is ML?

Machine Learning Foundations: Part 10 - Using NLP to build a sarcasm classifier

Machine Learning Foundations: Part 3 - Convolutions and pooling

Machine Learning Foundations: Part 6 - Convolutional cats and dogs

Machine Learning Foundations: Part 7 - Image augmentation and overfitting

Machine Learning Foundations: Part 4 - Coding with Convolutional Neural Networks

Machine Learning Foundations: Part 2 - First steps in computer vision