Ngram Extraction

Introduction

The ngram_extraction module uses nltk to split a given block of text into ngrams.

Usage

To use the module:

>>> import coast_core
>>> coast_core.ngram_extraction.function(to_use)

or:

>>> from coast_core import ngram_extraction
>>> ngram_extraction.function(to_use)

Functions

A collection of functions that can be used for splitting the article into ngrams.

coast_core.ngram_extraction.calculate_ngram_frequency_count(article_text, ngram_size, stop_list=None)

Calculate the frequency of occurances for a given ngram based on an article test :param article_text: the block of text to operate on. :param ngram_size: the degree of ngmram to be returned (eg 3 would be a tri gram) :param stop_list: list of words to be excluded from the frequency count :return: An object containing the frequency count of the n grams without ngrams included in the stop list

coast_core.ngram_extraction.generate_ngrams(article_text)

Split the given text into ngrams, returning an object that contains ngrams from one to six.

Parameters:article_text – the block of text to operate on.
Returns:An object containing all ngrams up to 6 in the following structure:
{
    "unigrams": [list of unigrams],
    "bigrams": [list of bigrams],
    "trigrams": [list of trigrams],
    "fourgrams": [list of fourgrams],
    "fivegrams": [list of fivegrams],
    "sixgrams": [list of sixgrams]
}