Skip to content

Commit

Permalink
woo-ho!!!
Browse files Browse the repository at this point in the history
  • Loading branch information
ynqa committed Dec 1, 2020
1 parent b68a19e commit 5c82ebd
Show file tree
Hide file tree
Showing 24 changed files with 438 additions and 481 deletions.
91 changes: 50 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,36 @@
# Word Embedding in Go
# Word Embeddings in Go

[![Build Status](https://travis-ci.com/ynqa/wego.svg?branch=master)](https://travis-ci.com/ynqa/wego)
[![GoDoc](https://godoc.org/github.com/ynqa/wego?status.svg)](https://godoc.org/github.com/ynqa/wego)
[![Go Report Card](https://goreportcard.com/badge/github.com/ynqa/wego)](https://goreportcard.com/report/github.com/ynqa/wego)

wego is the implementations for word embedding (a.k.a word representation) models in Go. [Word embedding](https://en.wikipedia.org/wiki/Word_embedding) makes word's meaning, structure, and concept mapping into vector space with low dimension. For representative instance:
*wego* is the implementations **from scratch** for word embeddings (a.k.a word representation) models in Go.

## What's word embeddings?

[Word embeddings](https://en.wikipedia.org/wiki/Word_embeddings) make words' meaning, structure, and concept mapping into vector space with a low dimension. For representative instance:
```
Vector("King") - Vector("Man") + Vector("Woman") = Vector("Queen")
```
Like this example, models generate word vectors that could calculate word meaning by arithmetic operations for other vectors. wego provides CLI that includes not only training model for embedding but also similarity search between words.
Like this example, the models generate word vectors that could calculate word meaning by arithmetic operations for other vectors.

## Features

The following models to capture the word vectors are supported in *wego*:

## Models
- Word2Vec: Distributed Representations of Words and Phrases and their Compositionality [[pdf]](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)

🎃 Word2Vec: Distributed Representations of Words and Phrases and their Compositionality [[pdf]](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
- GloVe: Global Vectors for Word Representation [[pdf]](http://nlp.stanford.edu/pubs/glove.pdf)

🎃 GloVe: Global Vectors for Word Representation [[pdf]](http://nlp.stanford.edu/pubs/glove.pdf)
- LexVec: Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations [[pdf]](http://anthology.aclweb.org/P16-2068)

🎃 LexVec: Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations [[pdf]](http://anthology.aclweb.org/P16-2068)
Also, wego provides nearest neighbor search tools that calculate the distances between word vectors and find the nearest words for the target word. "near" for word vectors means "similar" for words.

Please see the [Usage](#Usage) section if you want to know how to use these for more details.

## Why Go?

[Data Science in Go](https://speakerdeck.com/chewxy/data-science-in-go) @chewxy
Inspired by [Data Science in Go](https://speakerdeck.com/chewxy/data-science-in-go) @chewxy

## Installation

Expand All @@ -29,65 +39,44 @@ $ go get -u github.com/ynqa/wego
$ bin/wego -h
```

## Demo

Run the following command, and start to download [text8](http://mattmahoney.net/dc/textdata.html) corpus and train them by Word2Vec.
## Usage

```
$ sh scripts/demo.sh
```
*wego* provides CLI and Go SDK for word embeddings.

## Usage
### CLI

```
Usage:
wego [flags]
wego [command]
Available Commands:
console Console to investigate word vectors
glove GloVe: Global Vectors for Word Representation
help Help about any command
lexvec Lexvec: Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations
repl Search similar words with REPL mode
search Search similar words
query Query similar words
word2vec Word2Vec: Continuous Bag-of-Words and Skip-gram model
Flags:
-h, --help help for wego
```

## File I/O

### Input
Input corpus requires the format that is divided by space between words like [text8](http://mattmahoney.net/dc/textdata.html) since wego parse with `scanner.Split(bufio.ScanWords)`.

### Output
Wego outputs a .txt file that is described word vector is subject to the following format:

```
<word> <value1> <value2> ...
```

## API

It's also able to train word vectors using wego APIs. Examples are as follows.
### Go SDK

```go
package main

import (
"os"

"github.com/ynqa/wego/pkg/model/modelutil/save"
"github.com/ynqa/wego/pkg/model/modelutil/vector"
"github.com/ynqa/wego/pkg/model/word2vec"
)

func main() {
model, err := word2vec.New(
word2vec.WithWindow(5),
word2vec.WithModel(word2vec.Cbow),
word2vec.WithOptimizer(word2vec.NegativeSampling),
word2vec.WithNegativeSampleSize(5),
word2vec.Window(5),
word2vec.Model(word2vec.Cbow),
word2vec.Optimizer(word2vec.NegativeSampling),
word2vec.NegativeSampleSize(5),
word2vec.Verbose(),
)
if err != nil {
Expand All @@ -100,6 +89,26 @@ func main() {
}

// write word vector.
model.Save(os.Stdin, save.Aggregated)
model.Save(os.Stdin, vector.Agg)
}
```

## Formats

As training word vectors *wego* requires file format for inputs/outputs.

### Input

Input corpus must be subject to the formats to be divided by space between words like [text8](http://mattmahoney.net/dc/textdata.html).

```
word1 word2 word3 ...
```

### Output

After training *wego* save the word vectors into a txt file with the following format (`N` is the dimension for word vectors you given):

```
<word> <value_1> <value_2> ... <value_N>
```
7 changes: 4 additions & 3 deletions cmd/model/cmdutil/cmdutil.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,14 @@ import (

"github.com/spf13/cobra"

"github.com/ynqa/wego/pkg/model/modelutil/save"
"github.com/ynqa/wego/pkg/model/modelutil/vector"
)

const (
defaultInputFile = "example/input.txt"
defaultOutputFile = "example/word_vectors.txt"
defaultProf = false
defaultVectorType = vector.Single
)

func AddInputFlags(cmd *cobra.Command, input *string) {
Expand All @@ -40,6 +41,6 @@ func AddProfFlags(cmd *cobra.Command, prof *bool) {
cmd.Flags().BoolVar(prof, "prof", defaultProf, "profiling mode to check the performances")
}

func AddSaveVectorTypeFlags(cmd *cobra.Command, typ *save.VectorType) {
cmd.Flags().Var(typ, "save-type", fmt.Sprintf("save vector type. One of: %s|%s", save.Single, save.Aggregated))
func AddVectorTypeFlags(cmd *cobra.Command, typ *vector.Type) {
cmd.Flags().StringVar(typ, "vec-type", defaultVectorType, fmt.Sprintf("word vector type. One of: %s|%s", vector.Single, vector.Agg))
}
14 changes: 7 additions & 7 deletions cmd/model/glove/glove.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,14 @@ import (

"github.com/ynqa/wego/cmd/model/cmdutil"
"github.com/ynqa/wego/pkg/model/glove"
"github.com/ynqa/wego/pkg/model/modelutil/save"
"github.com/ynqa/wego/pkg/model/modelutil/vector"
)

var (
prof bool
inputFile string
outputFile string
saveVectorType save.VectorType
prof bool
inputFile string
outputFile string
vectorType vector.Type
)

func New() *cobra.Command {
Expand All @@ -47,7 +47,7 @@ func New() *cobra.Command {
cmdutil.AddInputFlags(cmd, &inputFile)
cmdutil.AddOutputFlags(cmd, &outputFile)
cmdutil.AddProfFlags(cmd, &prof)
cmdutil.AddSaveVectorTypeFlags(cmd, &saveVectorType)
cmdutil.AddVectorTypeFlags(cmd, &vectorType)
glove.LoadForCmd(cmd, &opts)
return cmd
}
Expand Down Expand Up @@ -91,5 +91,5 @@ func execute(opts glove.Options) error {
if err := mod.Train(input); err != nil {
return err
}
return mod.Save(output, saveVectorType)
return mod.Save(output, vectorType)
}
14 changes: 7 additions & 7 deletions cmd/model/lexvec/lexvec.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,14 @@ import (

"github.com/ynqa/wego/cmd/model/cmdutil"
"github.com/ynqa/wego/pkg/model/lexvec"
"github.com/ynqa/wego/pkg/model/modelutil/save"
"github.com/ynqa/wego/pkg/model/modelutil/vector"
)

var (
prof bool
inputFile string
outputFile string
saveVectorType save.VectorType
prof bool
inputFile string
outputFile string
vectorType vector.Type
)

func New() *cobra.Command {
Expand All @@ -47,7 +47,7 @@ func New() *cobra.Command {
cmdutil.AddInputFlags(cmd, &inputFile)
cmdutil.AddOutputFlags(cmd, &outputFile)
cmdutil.AddProfFlags(cmd, &prof)
cmdutil.AddSaveVectorTypeFlags(cmd, &saveVectorType)
cmdutil.AddVectorTypeFlags(cmd, &vectorType)
lexvec.LoadForCmd(cmd, &opts)
return cmd
}
Expand Down Expand Up @@ -91,5 +91,5 @@ func execute(opts lexvec.Options) error {
if err := mod.Train(input); err != nil {
return err
}
return mod.Save(output, saveVectorType)
return mod.Save(output, vectorType)
}
14 changes: 7 additions & 7 deletions cmd/model/word2vec/word2vec.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,15 +23,15 @@ import (
"github.com/spf13/cobra"

"github.com/ynqa/wego/cmd/model/cmdutil"
"github.com/ynqa/wego/pkg/model/modelutil/save"
"github.com/ynqa/wego/pkg/model/modelutil/vector"
"github.com/ynqa/wego/pkg/model/word2vec"
)

var (
prof bool
inputFile string
outputFile string
saveVectorType save.VectorType
prof bool
inputFile string
outputFile string
vectorType vector.Type
)

func New() *cobra.Command {
Expand All @@ -47,7 +47,7 @@ func New() *cobra.Command {
cmdutil.AddInputFlags(cmd, &inputFile)
cmdutil.AddOutputFlags(cmd, &outputFile)
cmdutil.AddProfFlags(cmd, &prof)
cmdutil.AddSaveVectorTypeFlags(cmd, &saveVectorType)
cmdutil.AddVectorTypeFlags(cmd, &vectorType)
word2vec.LoadForCmd(cmd, &opts)
return cmd
}
Expand Down Expand Up @@ -91,5 +91,5 @@ func execute(opts word2vec.Options) error {
if err := mod.Train(input); err != nil {
return err
}
return mod.Save(output, saveVectorType)
return mod.Save(output, vectorType)
}
4 changes: 2 additions & 2 deletions examples/word2vec/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ package main
import (
"os"

"github.com/ynqa/wego/pkg/model/modelutil/save"
"github.com/ynqa/wego/pkg/model/modelutil/vector"
"github.com/ynqa/wego/pkg/model/word2vec"
)

Expand All @@ -39,5 +39,5 @@ func main() {
}

// write word vector.
model.Save(os.Stdin, save.Aggregated)
model.Save(os.Stdin, vector.Agg)
}
2 changes: 1 addition & 1 deletion pkg/corpus/corpus.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ type Corpus interface {
Dictionary() *dictionary.Dictionary
Cooccurrence() *co.Cooccurrence
Len() int
Load(*verbose.Verbose, int, *WithCooccurrence) error
Load(*WithCooccurrence, *verbose.Verbose, int) error
}

type WithCooccurrence struct {
Expand Down
2 changes: 1 addition & 1 deletion pkg/corpus/fs/fs.go
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ func (c *Corpus) Len() int {
return c.maxLen
}

func (c *Corpus) Load(verbose *verbose.Verbose, logBatch int, with *corpus.WithCooccurrence) error {
func (c *Corpus) Load(with *corpus.WithCooccurrence, verbose *verbose.Verbose, logBatch int) error {
clk := clock.New()
if err := cpsutil.ReadWord(c.doc, func(word string) error {
if c.toLower {
Expand Down
2 changes: 1 addition & 1 deletion pkg/corpus/memory/memory.go
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ func (c *Corpus) Len() int {
return c.maxLen
}

func (c *Corpus) Load(verbose *verbose.Verbose, logBatch int, with *corpus.WithCooccurrence) error {
func (c *Corpus) Load(with *corpus.WithCooccurrence, verbose *verbose.Verbose, logBatch int) error {
clk := clock.New()
if err := cpsutil.ReadWord(c.doc, func(word string) error {
if c.toLower {
Expand Down
Loading

0 comments on commit 5c82ebd

Please sign in to comment.