Neural Networks, NLP

NLTK vs Spacy: Performance Comparison of NLP Libraries in Text Tokenization

Natural Language Processing Tokenization with Spacy and NLTK

Tokenized Text

Computers do not actually understand human language because the only language they understand is binary (0s & 1s). This is also the case because we have so many languages and dialects. In order to make computers understand the natural language (the language humans write and speak) we need to convert it into a computers understandable format. While working with neural networks, we convert the text into a form called “Tokenized Text”.

Tokenized text means break the text into sentences and words and representing them in vectors of numbers. The challenge is how do we convert the (e.g. English) language to vectors of numbers because we first have to break them into sentences and then words. Breaking them into words is programmatically straight forward because we can split the text with space as delimiter but how do we break it into sentences? This is a bit tricky because we so many forms of sentences that we can’t just define one or two rules. E.g. consider following sentences

  1. How are you? Fine.
  2. I am fine.
  3. I can’t go there!
  4. I wish there were more livable planets out there…

Now each of these sentences ends with a different character. We have even more complex sentences when analyzing the slangs. Breaking the text into sentences is important because we want to retain the structure of the sentence which contain grammer rules.

NLP Libraries

To split the text into sentence, fortunately we don’t have to write our own functions as there are already some great libraries like NLTK, Spacy, Stanford CoreNLP developed.

NLTK has been around since 2001 and is continuelly developed but Spacy is a new library and has been geared towards performance. I wanted to compare both of them to see if Spacy is really faster the NLTK in tokenizing the text. But the result is absolutely astonishing!

I have compared the performance of both libraries on Reddit comments sample   using Python.

This code assumes you have installed both Spacy and NLTK along with their English language packs.

Code:

 
# lib imports
import csv
import nltk
import itertools
import datetime;

# spacy import and text tokenization function 
import spacy
nlp = spacy.load('en')
def sentence_tokenize(text):
 doc = nlp(text)
 return [sent.string.strip() for sent in doc.sents]


# NLTK text tokenization and calculation of computation time:</pre>

print ("Reading CSV file...")

sample_str =[]

#record time stamp before 
tstart = datetime.now()

with open('../data/reddit-comments-2015-08.csv', 'r') as f:
    reader = csv.reader(f, skipinitialspace=True)
    # read is an iterator which contains lines which can be itered through an iterator.
    next(reader)

# iterate through all the lines and tokenize the text.

    for row in reader:
        sample_str=row[0]
        tokzd= nltk.sent_tokenize(sample_str.lower()) # text tokenization through NLTK's function
        sentences = itertools.chain(tokzd)

# record time stamp afterwards
tend = datetime.now()

# print time took to tokenize the text
print ( tend - tstart )

# Output : 0:00:01.961066

# Spacy text tokenization and calculation of computation time: 

#record time stamp before 
tstart = datetime.now()

with open('../data/reddit-comments-2015-08.csv', 'r') as f:
    reader = csv.reader(f, skipinitialspace=True)
    # read is an iterator which contains lines which can be itered through an iterator.
    next(reader)   
# iterate through all the lines and tokenize the text.
    for row in reader:
        sample_str=row[0]
        tokzd= sentence_tokenize(sample_str.lower())
        sentences = itertools.chain(tokzd)
# record time stamp afterwards
tend = datetime.now()
# print time took to tokenize the text
print ( tend - tstart )

# Output : 0:02:58.788951

Result

This is not something i had expected! Spacy is way, way slower than NLTK. NLTK took barely 2 seconds to tokenize the 7 MB text sample while Spacy took  whooping 3 minutes!

Leave a Reply

Your email address will not be published. Required fields are marked *