fastcountvectorizer

class fastcountvectorizer.FastCountVectorizer(input='content', ngram_range=(1, 1), analyzer='char', min_df=1, max_df=1.0, binary=False, dtype=<class 'numpy.int64'>)

Bases: sklearn.base.BaseEstimator

Convert a collection of text documents to a matrix of token counts

This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

Parameters
  • input (string {'content'}) – Indicates the input type. Currently, only ‘content’ (default value) is supported. The input is expected to be a sequence of items of type string.

  • ngram_range (tuple (min_n, max_n), default=(1, 1)) – The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

  • analyzer (string, {'char'}) –

    Analyzer mode. If set to ‘char’ (default value, only option) character ngrams will be used.

    Warning

    FastCountVectorizer does not apply any kind of preprocessing to inputs. Note that this is different from scikit-learn’s CountVectorizer performs, which applies whitespace normalization.

  • max_df (float in range [0.0, 1.0] or int, default=1.0) – When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts.

  • min_df (float in range [0.0, 1.0] or int, default=1) – When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

  • binary (bool, default=False) – If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.

  • dtype (type, optional) – Type of the matrix returned by fit_transform() or transform(). Defaults to np.float64.

vocabulary_

A mapping of terms to feature indices.

Type

dict

stop_words_

Terms that were ignored because they either:

  • occurred in too many documents (max_df)

  • occurred in too few documents (min_df)

Type

set

fit(raw_documents, y=None)

Learn a vocabulary dictionary of all tokens in the raw documents.

Parameters

raw_documents (list) – A list of strings.

Returns

Return type

self

fit_transform(raw_documents, y=None)

Learn the vocabulary dictionary and return term-document matrix.

This is equivalent to fit followed by transform, but more efficiently implemented.

Parameters

raw_documents (list) – A list of strings.

Returns

X – Document-term matrix.

Return type

array, [n_samples, n_features]

get_feature_names()

Array mapping from feature integer indices to feature name.

Returns

feature_names – A list of feature names.

Return type

list

transform(raw_documents)

Transform documents to document-term matrix.

Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.

Parameters

raw_documents (list) – A list of strings.

Returns

X – Document-term matrix.

Return type

sparse matrix, [n_samples, n_features]