Word Embeddings in Go

wego is the implementations from scratch for word embeddings (a.k.a word representation) models in Go.

What's word embeddings?

Word embeddings make words' meaning, structure, and concept mapping into vector space with a low dimension. For representative instance:

Vector("King") - Vector("Man") + Vector("Woman") = Vector("Queen")

Like this example, the models generate word vectors that could calculate word meaning by arithmetic operations for other vectors.

Features

The following models to capture the word vectors are supported in wego:

Word2Vec: Distributed Representations of Words and Phrases and their Compositionality [pdf]
GloVe: Global Vectors for Word Representation [pdf]
LexVec: Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations [pdf]

Also, wego provides nearest neighbor search tools that calculate the distances between word vectors and find the nearest words for the target word. "near" for word vectors means "similar" for words.

Please see the Usage section if you want to know how to use these for more details.

Why Go?

Inspired by Data Science in Go @chewxy

Installation

Use go command to get this pkg.

$ go get -u github.com/ynqa/wego
$ bin/wego -h

Usage

wego provides CLI and Go SDK for word embeddings.

CLI

Usage:
  wego [flags]
  wego [command]

Available Commands:
  console     Console to investigate word vectors
  glove       GloVe: Global Vectors for Word Representation
  help        Help about any command
  lexvec      Lexvec: Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations
  query       Query similar words
  word2vec    Word2Vec: Continuous Bag-of-Words and Skip-gram model

word2vec, glove and lexvec executes the workflow to generate word vectors:

Build a dictionary for vocabularies and count word frequencies by scanning a given corpus.
Start training. The execution time depends on the size of the corpus, the hyperparameters (flags), and so on.
Save the words and their vectors as a text file.

query and console are the commands which are related to nearest neighbor searching for the trained word vectors.

query outputs similar words against a given word using sing word vectors which are generated by the above models.

e.g. wego query -i word_vector.txt microsoft:

  RANK |   WORD    | SIMILARITY
-------+-----------+-------------
     1 | hypercard |   0.791492
     2 | xp        |   0.768939
     3 | software  |   0.763369
     4 | freebsd   |   0.761084
     5 | unix      |   0.749563
     6 | linux     |   0.747327
     7 | ibm       |   0.742115
     8 | windows   |   0.731136
     9 | desktop   |   0.715790
    10 | linspire  |   0.711171

wego does not reproduce word vectors between each trial because it adopts HogWild! algorithm which updates the parameters (in this case word vector) async.

console is for REPL mode to calculate the basic arithmetic operations (+ and -) for word vectors.

Go SDK

It can define the hyper parameters for models by functional options.

model, err := word2vec.New(
	word2vec.Window(5),
	word2vec.Model(word2vec.Cbow),
	word2vec.Optimizer(word2vec.NegativeSampling),
	word2vec.NegativeSampleSize(5),
	word2vec.Verbose(),
)

The models have some methods:

type Model interface {
	Train(io.ReadSeeker) error
	Save(io.Writer, vector.Type) error
	WordVector(vector.Type) *matrix.Matrix
}

Formats

As training word vectors wego requires the following file formats for inputs/outputs.

Input

Input corpus must be subject to the formats to be divided by space between words like text8.

word1 word2 word3 ...

Output

After training wego save the word vectors into a txt file with the following format (N is the dimension for word vectors you given):

<word> <value_1> <value_2> ... <value_N>

Name	Name	Last commit message	Last commit date
Latest commit ynqa bump up go version to 1.8 Apr 2, 2023 bce0611 · Apr 2, 2023 History 266 Commits
.github	.github	use github action instead of travis	Mar 31, 2023
cmd	cmd	woo-ho!!!	Dec 1, 2020
examples	examples	add example for query	Dec 13, 2020
pkg	pkg	apply the suggestions of #56	Jul 26, 2022
test	test	rename for options	Dec 8, 2020
.dockerignore	.dockerignore	Improve dockerfile	Aug 16, 2018
.gitignore	.gitignore	refactoring	Apr 25, 2020
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md	add CODE_OF_CONDUCT.md	Dec 26, 2019
Dockerfile	Dockerfile	update dockerfile to use go v1.14.x	May 29, 2020
LICENSE	LICENSE	subject to golang-standards/project-layout	Dec 26, 2019
README.md	README.md	update badge	Mar 31, 2023
go.mod	go.mod	bump up go version to 1.8	Apr 2, 2023
go.sum	go.sum	upgrade pkgs	Nov 26, 2020
wego.go	wego.go	woo-hoo!	Nov 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word Embeddings in Go

What's word embeddings?

Features

Why Go?

Installation

Usage

CLI

Go SDK

Formats

Input

Output

About

Releases

Packages

Contributors 7

Languages

License

ynqa/wego

Folders and files

Latest commit

History

Repository files navigation

Word Embeddings in Go

What's word embeddings?

Features

Why Go?

Installation

Usage

CLI

Go SDK

Formats

Input

Output

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages