Ngram Extraction¶

Introduction¶

The ngram_extraction module uses nltk to split a given block of text into ngrams.

To use the module:

>>> import coast_core
>>> coast_core.ngram_extraction.function(to_use)

or:

>>> from coast_core import ngram_extraction
>>> ngram_extraction.function(to_use)

A collection of functions that can be used for splitting the article into ngrams.

coast_core.ngram_extraction.calculate_ngram_frequency_count(article_text, ngram_size, stop_list=None)¶: Calculate the frequency of occurances for a given ngram based on an article test :param article_text: the block of text to operate on. :param ngram_size: the degree of ngmram to be returned (eg 3 would be a tri gram) :param stop_list: list of words to be excluded from the frequency count :return: An object containing the frequency count of the n grams without ngrams included in the stop list

coast_core.ngram_extraction.generate_ngrams(article_text)¶

Split the given text into ngrams, returning an object that contains ngrams from one to six.

Parameters:	article_text – the block of text to operate on.
Returns:	An object containing all ngrams up to 6 in the following structure: { "unigrams": [list of unigrams], "bigrams": [list of bigrams], "trigrams": [list of trigrams], "fourgrams": [list of fourgrams], "fivegrams": [list of fivegrams], "sixgrams": [list of sixgrams] }