fastcountvectorizer¶
-
class
fastcountvectorizer.
FastCountVectorizer
(input='content', ngram_range=(1, 1), analyzer='char', min_df=1, max_df=1.0, binary=False, dtype=<class 'numpy.int64'>)¶ Bases:
sklearn.base.BaseEstimator
Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.
- Parameters
input (string {'content'}) – Indicates the input type. Currently, only ‘content’ (default value) is supported. The input is expected to be a sequence of items of type string.
ngram_range (tuple (min_n, max_n), default=(1, 1)) – The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an
ngram_range
of(1, 1)
means only unigrams,(1, 2)
means unigrams and bigrams, and(2, 2)
means only bigrams.analyzer (string, {'char'}) –
Analyzer mode. If set to ‘char’ (default value, only option) character ngrams will be used.
Warning
FastCountVectorizer does not apply any kind of preprocessing to inputs. Note that this is different from scikit-learn’s CountVectorizer performs, which applies whitespace normalization.
max_df (float in range [0.0, 1.0] or int, default=1.0) – When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts.
min_df (float in range [0.0, 1.0] or int, default=1) – When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.
binary (bool, default=False) – If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
dtype (type, optional) – Type of the matrix returned by fit_transform() or transform(). Defaults to np.float64.
-
vocabulary_
¶ A mapping of terms to feature indices.
- Type
dict
-
stop_words_
¶ Terms that were ignored because they either:
occurred in too many documents (max_df)
occurred in too few documents (min_df)
- Type
set
-
fit
(raw_documents, y=None)¶ Learn a vocabulary dictionary of all tokens in the raw documents.
- Parameters
raw_documents (list) – A list of strings.
- Returns
- Return type
self
-
fit_transform
(raw_documents, y=None)¶ Learn the vocabulary dictionary and return term-document matrix.
This is equivalent to fit followed by transform, but more efficiently implemented.
- Parameters
raw_documents (list) – A list of strings.
- Returns
X – Document-term matrix.
- Return type
array, [n_samples, n_features]
-
get_feature_names
()¶ Array mapping from feature integer indices to feature name.
- Returns
feature_names – A list of feature names.
- Return type
list
-
transform
(raw_documents)¶ Transform documents to document-term matrix.
Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.
- Parameters
raw_documents (list) – A list of strings.
- Returns
X – Document-term matrix.
- Return type
sparse matrix, [n_samples, n_features]