fastcountvectorizer¶

class fastcountvectorizer.FastCountVectorizer(input='content', ngram_range=(1, 1), analyzer='char', min_df=1, max_df=1.0, binary=False, dtype=<class 'numpy.int64'>)¶

Bases: sklearn.base.BaseEstimator

Convert a collection of text documents to a matrix of token counts

This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

Parameters

input (string {'content'}) – Indicates the input type. Currently, only ‘content’ (default value) is supported. The input is expected to be a sequence of items of type string.
ngram_range (tuple (min_n, max_n), default=(1, 1)) – The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.
analyzer (string, {'char'}) –
Analyzer mode. If set to ‘char’ (default value, only option) character ngrams will be used.

Warning

FastCountVectorizer does not apply any kind of preprocessing to inputs. Note that this is different from scikit-learn’s CountVectorizer performs, which applies whitespace normalization.
max_df (float in range [0.0, 1.0] or int, default=1.0) – When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts.
min_df (float in range [0.0, 1.0] or int, default=1) – When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.
binary (bool, default=False) – If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
dtype (type, optional) – Type of the matrix returned by fit_transform() or transform(). Defaults to np.float64.

vocabulary_¶

A mapping of terms to feature indices.

Type: dict

stop_words_¶

Terms that were ignored because they either:

occurred in too many documents (max_df)

occurred in too few documents (min_df)

Type: set

fit(raw_documents, y=None)¶

Learn a vocabulary dictionary of all tokens in the raw documents.

Parameters: raw_documents (list) – A list of strings.
Returns
Return type: self

fit_transform(raw_documents, y=None)¶

Learn the vocabulary dictionary and return term-document matrix.

This is equivalent to fit followed by transform, but more efficiently implemented.

Parameters: raw_documents (list) – A list of strings.
Returns: X – Document-term matrix.
Return type: array, [n_samples, n_features]

get_feature_names()¶

Array mapping from feature integer indices to feature name.

Returns: feature_names – A list of feature names.
Return type: list

transform(raw_documents)¶

Transform documents to document-term matrix.

Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.

Parameters: raw_documents (list) – A list of strings.
Returns: X – Document-term matrix.
Return type: sparse matrix, [n_samples, n_features]