Machine Learning Foundations: Part 8 - Tokenization for Natural Language Processing

Alex Alex 18 June 2020
Machine Learning Foundations: Part 8 - Tokenization for Natural Language Processing

Previous Part 7 - Image augmentation and overfitting

Up to now, you've learned how machine learning works and explored examples in computer vision by doing image classification, including understanding concepts such as convolutional neural networks for feature identification, and image augmentation to avoid overfitting, making your networks that little bit smarter.

We're now going to switch gears, and we'll take a look at natural language processing. In this part, we'll take a look at how a computer can represent language, and that's words and sentences, in a numeric format that can then later be used to train neural networks. This process is called tokenization. So let's get started.

Consider this word. It's the English word LISTEN and it consists of six letters. We're used to reading it based on the sounds and putting those sounds together to form a word. But how can a computer understand this word? Well, one way, as computers deal better with numbers than they do with letters, is to assign a number to each letter. Machine Learning Foundations: Part 8 - Tokenization for Natural Language ProcessingA common coding format is called ASCII, where common letters and symbols are encoded into the values from 0 to 255. It's useful in that only one byte is needed to store the value for a letter, but it has been superseded by later encodings in order to give access to characters and letters beyond 255, in particular international characters. But for the purposes of illustration, we can stick with ASCII, where, for example, the letter L is 76, I is 73, and so on. So we now have the word listen encoded into six bytes, one for each letter.

This is a perfectly valid encoding, and often when you use neural networks, you'll see character encoding or subword encoding, and stuff like that. They lead to things being a little bit more complicated, and in these tutorials, I'm going to do word-based encoding, and not the letter-based that we just saw.

Why would I do this? One reason is that if we're taking a word as a set of numbers, unless we take the sequence of those numbers into account, we can have two words sometimes with opposite-ish meanings, like SILENT, and they can have the same letters.Machine Learning Foundations: Part 8 - Tokenization for Natural Language ProcessingThus, if we want to use character-based encoding, a computer can't tell the difference between these two words unless we have a sequence model. And that's a little bit more complicated than we need to look into right now.

So let's consider a different encoding, and that's a word-based one. That way each of these words can be represented by a single number, and each number will be different. There's also a nice hidden advantage to this, which we'll see in a moment.

So consider this sentence. I love my dog. It's a pretty straightforward one. If I encode based on words, I can come up with an arbitrary encoding. Say the word "I" is number one, and then "love my dog" become two, three, and four, respectively.Machine Learning Foundations: Part 8 - Tokenization for Natural Language Processing If I were to encode another sentence, for example, "I love my cat" the words "I love my" already have numbers, so I can just use one, two, and three again for them. And I can create a new number for "cat," which I'll say is number five. So now my sentences are 1 2 3 4, and 1 2 3 5. What's interesting here is now that the words are gone, and the tokens for the words are just used, we can begin to tell that there's a similarity between the sentences. So maybe we're beginning to get a glimpse of what it might look like to have sentences turned into numbers, yet maintain some kind of meaning.

The process I just outlined is called tokenization, and it's an inherent part of doing natural language processing, or NLP. TensorFlow gives you APIs that help you to achieve this very simply. We'll take a look at them next.

Here's all the code that you would need to tokenize the sentences I showed earlier.

from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'I love my dog',
    'I, love my cat',
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

We can break it down and go through it line by line. The tokenizer tools are part of the TensorFlow Keras libraries, and they're in the preprocessing namespace. So make sure you import these.

I'm going to hard code the sentences into an array. Now, while this is a super simple corpus -- there's just two sentences and five unique words -- this design pattern can work for much bigger sets of data. You'll soon be working with tens of thousands of sentences with thousands of unique words. And it's all pretty much the same code. So don't worry right now if this looks a little bit too simplistic.

You can then create a tokenizer  by simply creating an instance of the Tokenizer and initializing that with parameters. One of these is the num_words parameter, which specifies the maximum number of words that you want to care about.

There's only five unique words here, so it doesn't really make a difference. But with larger sets of text, it can. You'll commonly encounter bodies of text with many thousands of unique words in them, and lots of these words may be only used once or twice. By specifying the number of words that you care about in your tokenizer, you get an easy way to filter those out. The tokenizer is smart enough to assign tokens to words based on how commonly used they are in the corpus. So the most common word will be at index one, the next common word will be at index two, et cetera, et cetera.

To get the tokenizer to do its job, you can fit it on texts and pass it your corpus of text. In this case, it's our simple array of sentences. To see the word index that the tokenizer created, you can just get the word_index property. This will give you a set of name value pairs where the name is the word, and the value is the token for that word. And then you can just print this to inspect it.

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}

When you print it out, it won't necessarily be in any order, but keep an eye on the values. Like I said earlier, the most common words will be the lowest index. And in this sense, "I love my" appears twice, while "dog" and "cat" both appear once. So "I love my" are the lower indexed words -- one, two, and three -- and "dog" and "cat" are the higher indexed ones, four and five.

So what if we expand our sentences and then add some more content, like maybe "You love my Dog!" with an exclamation mark. And note that exclamation. The default behavior of the tokenizer is to strip punctuation out. It can be overridden, but we'll keep it in for now. It also makes all of the words become lowercase, so  "Dog" will be treated in the same way as "dog." The tokens will now look like this.

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}

Notice that they've moved around a little. "Love" is now the number one token because it's the most-used word, and it's similar with "my." Also notice that "dog" lost its exclamation. And there's only one token in here for "dog," and it represents both usages of the word, despite the exclamation being on the second one. And we've added a new word, "you," because that was first used in the new sentence that was added to the corpus.

So I hope you found that pretty straightforward, despite the underlying power in the tokenizer.

In this part, you got your first taste of NLP using tokenization, where you were able to take sentences and have the words encoded into tokens. The next Part 9 - Using the Sequencing APIs

Comments (0)

    No comments yet

You must be logged in to comment.