In the context of NLP/NER what is stride and overlapping tokens when chunking the texts

Apr 05, 2023

Text Chunking with NLTK - NLP-FOR-HACKERS

`stride` and overlapping tokens

Both of these are related to the process of splitting large documents into smaller, more manageable chunks. This is typically done to accommodate the token limit imposed by large language models like GPT-3 or BERT, which have a maximum sequence length constraint (e.g., 512 tokens).

STRIDE: This is a parameter that determines the number of tokens by which to slide the window or move the starting point of the next chunk while breaking the document into smaller pieces. For example, if you have a STRIDE of 50 tokens and a maximum sequence length of 200 tokens, the first chunk will consist of tokens 1-200, the second chunk of tokens 51-250, the third chunk of tokens 101-300, and so on.

"Overlapping tokens when chunking the texts": This concept refers to the fact that when you break a large document into smaller chunks using a stride, you create some overlapping tokens between adjacent chunks. These overlaps help ensure that important information is not lost at the boundaries of the chunks and that the model can better understand the context.

The overlapping tokens help mitigate potential issues arising from splitting sentences or phrases across chunks, which could lead to misinterpretation or loss of context for the NLP model. By having a sufficient overlap between the chunks, you give the model a better chance to identify entities, relationships, or other semantic information that might span across.

Example

The specific need for overlapping tokens and DOC_STRIDE in an NER project is to maintain context and improve the performance of the model in identifying named entities even when they span across the boundaries of the chunks.

Consider an example where you are working on an NER project to extract named entities, such as person names, organization names, and locations from a large document. Let's say the document contains the following text:

"John Doe works at XYZ Corporation. The headquarters of XYZ Corporation is located in New York City."

Assume that the maximum sequence length constraint of your NLP model is 15 tokens. Without using overlapping tokens and DOC_STRIDE, you might split the text into the following chunks:

"John Doe works at"
"XYZ Corporation. The"
"headquarters of XYZ"
"Corporation is located"
"in New York City."

In this scenario, the named entities "XYZ Corporation" and "New York City" are not broken, but "John Doe" is split across the first two chunks. This can negatively impact the NER model's ability to recognize "John Doe" as a named entity because it lacks the context of the complete name.

Now, let's introduce overlapping tokens by using a DOC_STRIDE of 5 tokens. The text will be split into the following chunks:

"John Doe works at XYZ"
"works at XYZ Corporation. The"
"Corporation. The headquarters of"
"headquarters of XYZ Corporation"
"XYZ Corporation is located in"
"is located in New York"
"in New York City."

By using overlapping tokens and a stride of 5, the named entity "John Doe" is now present in its entirety in the second chunk, giving the NER model a better

What is stride param in huggingface tokenizer

In Hugging Face Tokenizers, the stride parameter controls the overlap between adjacent windows during tokenization.

When a text sequence is longer than the maximum input length allowed by the model, it needs to be divided into smaller segments or windows. These windows are then tokenized separately and fed into the model for inference. The stride parameter determines the amount of overlap between adjacent windows.

For example, suppose you have a text sequence of length 100 and you want to tokenize it using a maximum sequence length of 50 and a stride of 25.

The sequence would be split into the following windows:

First window: spans from the 1st character to the 50th character (length 50)
Second window: spans from the 26th character to the 75th character (length 50)
Third window: spans from the 51st character to the 100th character (length 50)

In this case, there would be a total of three windows of length 50 each, with an overlap of 25 characters (i.e., the stride) between adjacent windows.

By default, the stride is set to the same value as the maximum sequence length. However, you can set it to a different value if you want to control the overlap between adjacent windows.

Increasing the stride can reduce the number of windows required to tokenize the sequence, but may also result in some information loss due to the skipped tokens. Decreasing the stride will increase the number of windows, but may improve the overall quality of the tokenization.

An example of how, "Increasing the stride will reduce the overlap between adjacent windows, which may reduce the total number of windows required to tokenize the sequence."

Certainly, here’s an example to illustrate how increasing the stride reduces the overlap between adjacent windows and the total number of windows required to tokenize a sequence.

Let’s assume you have a text sequence of length 100 and you want to tokenize it using a maximum sequence length of 50. Now, consider two different stride values: 25 (smaller stride) and 50 (larger stride).

Stride = 25 (smaller stride)

Windows:

First window: spans from the 1st character to the 50th character (length 50) Second window: spans from the 26th character to the 75th character (length 50) Third window: spans from the 51st character to the 100th character (length 50) In this case, there is an overlap of 25 characters between adjacent windows, and a total of 3 windows are required to tokenize the sequence.

Stride = 50 (larger stride)

Windows:

First window: spans from the 1st character to the 50th character (length 50) Second window: spans from the 51st character to the 100th character (length 50) In this case, there is no overlap between adjacent windows, and only 2 windows are required to tokenize the sequence.

=====================

Impact on performance: The choice of the stride parameter has implications for the model's performance.

With a smaller stride value, the model processes more overlapping tokens, which can lead to higher accuracy since the context is better preserved.

However, this also increases the computational cost, as more tokens need to be processed. On the other hand, a larger stride value reduces the computational cost but might result in lower accuracy due to less context preservation.

example of how "With a smaller stride value, the model processes more overlapping tokens, which can lead to higher accuracy since the context is better preserved."

Original sentence:

"John Smith lives in San Francisco, and he works for Microsoft Corporation as a software engineer."

Tokenized sentence (with tokens separated by | ): "|John|Smith|lives|in|San|Francisco|,|and|he|works|for|Microsoft|Corporation|as|a|software|engineer|.|"

Let's break down the calculation for stride value of 5.

Total tokens: 18

Max tokens per window: 10

Stride value: 5

Windows:

Window 1 - First 10 tokens:
"|John|Smith|lives|in|San|Francisco|,|and|he|works|"

Window 2 - Start from token 6 (previous starting position + stride), take the next 10 tokens:
"|Francisco|,|and|he|works|for|Microsoft|Corporation|as|a|"

Window 3 - Start from token 11 (previous starting position + stride), take the next 10 tokens:
"|for|Microsoft|Corporation|as|a|software|engineer|.|"

So, with a stride value of 5, we have the following windows:

Window 1: "|John|Smith|lives|in|San|Francisco|,|and|he|works|"
Window 2: "|Francisco|,|and|he|works|for|Microsoft|Corporation|as|a|"
Window 3: "|for|Microsoft|Corporation|as|a|software|engineer|.|"

In this corrected example, you can see that a stride value of 5 results in three windows, with overlapping tokens that help preserve the context. The starting point for each window is calculated by adding the stride value to the previous starting position.

Now lets see the case for Stride value: 10 (Larger stride)

Window 1: "|John|Smith|lives|in|San|Francisco|,|and|he|works|"

Window 2: "for|Microsoft|Corporation|as|a|software|engineer|.|"

With a larger stride value, the overlap between windows is reduced, and the context is not as well-preserved. For instance, the model may have difficulty recognizing "John Smith" as a person, because it appears at the beginning of Window 1, without adequate surrounding context.

In summary, using a smaller stride value when chunking texts for NLP/NER tasks can lead to higher accuracy, as it ensures better context preservation by processing more overlapping tokens. However, it's important to note that smaller stride values also increase the number of windows and, consequently, the computational cost. As an NLP engineer, you should balance the trade-off between accuracy and computational efficiency when choosing the optimal stride value for your specific task.

Rohan's Bytes

Discussion about this post