gensim word2vec 参数说明

2017年7月24日2017年7月24日 fendouai

def __init__(
        self, sentences=None, size=100, alpha=0.025, window=5, min_count=5,
        max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
        sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
        trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False):
    """
    Initialize the model from an iterable of `sentences`. Each sentence is a
    list of words (unicode strings) that will be used for training.

    The `sentences` iterable can be simply a list, but for larger corpora,
    consider an iterable that streams the sentences directly from disk/network.
    See :class:`BrownCorpus`, :class:`Text8Corpus` or :class:`LineSentence` in
    this module for such examples.

    If you don't supply `sentences`, the model is left uninitialized -- use if
    you plan to initialize it in some other way.

    `sg` defines the training algorithm. By default (`sg=0`), CBOW is used.
    Otherwise (`sg=1`), skip-gram is employed.

    `size` is the dimensionality of the feature vectors.

    `window` is the maximum distance between the current and predicted word within a sentence.

    `alpha` is the initial learning rate (will linearly drop to `min_alpha` as training progresses).

    `seed` = for the random number generator. Initial vectors for each
    word are seeded with a hash of the concatenation of word + str(seed).
    Note that for a fully deterministically-reproducible run, you must also limit the model to
    a single worker thread, to eliminate ordering jitter from OS thread scheduling. (In Python
    3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED
    environment variable to control hash randomization.)

    `min_count` = ignore all words with total frequency lower than this.

    `max_vocab_size` = limit RAM during vocabulary building; if there are more unique
    words than this, then prune the infrequent ones. Every 10 million word types
    need about 1GB of RAM. Set to `None` for no limit (default).

    `sample` = threshold for configuring which higher-frequency words are randomly downsampled;
        default is 1e-3, useful range is (0, 1e-5).

    `workers` = use this many worker threads to train the model (=faster training with multicore machines).

    `hs` = if 1, hierarchical softmax will be used for model training.
    If set to 0 (default), and `negative` is non-zero, negative sampling will be used.

    `negative` = if > 0, negative sampling will be used, the int for negative
    specifies how many "noise words" should be drawn (usually between 5-20).
    Default is 5. If set to 0, no negative samping is used.

    `cbow_mean` = if 0, use the sum of the context word vectors. If 1 (default), use the mean.
    Only applies when cbow is used.

    `hashfxn` = hash function to use to randomly initialize weights, for increased
    training reproducibility. Default is Python's rudimentary built in hash function.

    `iter` = number of iterations (epochs) over the corpus. Default is 5.

    `trim_rule` = vocabulary trimming rule, specifies whether certain words should remain
    in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count).
    Can be None (min_count will be used), or a callable that accepts parameters (word, count, min_count) and
    returns either `utils.RULE_DISCARD`, `utils.RULE_KEEP` or `utils.RULE_DEFAULT`.
    Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part
    of the model.

    `sorted_vocab` = if 1 (default), sort the vocabulary by descending frequency before
    assigning word indexes.

    `batch_words` = target size (in words) for batches of examples passed to worker threads (and
    thus cython routines). Default is 10000. (Larger batches will be passed if individual
    texts are longer than 10000 words, but the standard cython code truncates to that maximum.)

    """

2018年4月16日 fendouai 0

中文 NLP 词法、句法、语义、语篇综合系列

好文推荐 NLP+词法系列（一）...

gensim Keras 自然语言处理
2018年4月2日 fendouai 0

TensorFlow 官方开源用于寻找系外行星的代码

在上周六的 2018 Tenso...

gensim TensorFlow TensorFlowNews TensorFlow文档计算机视觉
2018年3月31日 fendouai 0

中文自然语言处理工具集：分词，相似度匹配

分词工具结巴分词 https:...

gensim 自然语言处理

Leave a Comment 取消回复