Vectorizing text: Word2Vec

November 27, 2019

3 minutes read

The idea with greatest impact in Word2Vec¹ ² was that vector representations could capture linguistic regularities, which can be retrieved through vector arithmetics

vec('Rome') ≈ vec('Paris') - vec('France') + vec('Italy')
vec('Queen') ≈ vec('King') - vec('Man') + vec('Woman')

Awesome ! How could such results be achieved ? They came from the following assumption

The meaning of a word can be inferred by the company it keeps

Indeed, Word2Vec was built using unsupervised learning on huge quantity of text, by predicting words, given their context.

The above amazing example of word vectors having semantic relationship when performing arithmetic is actually "too good to be true" for Word2Vec, as you might not obtain the expected relationship if you compute it with the pretrained models. The original paper by the Google team in 2014³ however shows the following relationships are obtained when projecting with Principal Component Analysis.

Continuous Bag-of-words

The Word2Vec model is used to predict a word, given its context.

Words are one-hot vectors

Raw words are represented using one-hot encoding. That is, if your vocabulary has \(d\) words, the ith word of the vocabulary is represented by \(w\), a binary vector in \(\mathbb{R}^d\) with the ith bit activated, all other bits being off. Say "cat, dog, eat" are all words in your vocabulary. In this order, we will encode "cat" with \([1, 0, 0]\), "dog" with \([0, 1, 0]\), "eat" with \([0, 0, 1]\).

A word's context is a continuous bag of words

For representing the context of a word, Continuous Bag-of-words (or CBOW) representation is used. In this representation, each words is encoded in a continuous representation, then we take the average representation. \[ h(c_1,c_2,c_3) = \frac{1}{3}(W_i(c_1 + c_2 + c_3)) \] \(H\) computes the continous representation of context vector formed by three words \(c_1, c_2, c_3\). \(W_i\) serves in computing a word embedding for context words. From this continuous representation, we derive the target word by decoding it into a word vector \(o = W_oh(c_1, c_2, c_3)\).

We see such model can be trained in an unsupervised way, taking four consecutive words at a time, and for example using the second word as target, and the remaining three other words as context of the target word.

CBOW reversed is Skip-gram

Skip-gram reverses CBOW model, so that we try to find context words given an input word. To train such network using backpropagation, we sum the output layers, so that only one vector remains.

CBOW and Skip-gram both end up computing a hidden representation of words, given a large amount of texts. The original paper by Mikolov et al. used Skip-gram, as well as other advanced techniques, such as negative sampling, which are out of the scope of this article.

Word2vec in action

Gensim

Let's use gensim library that implements word2vec model. We download english Wikipedia corpus using the convinent gensim-data api.

  from gensim.models.word2vec import Word2Vec
  import gensim.downloader as api

  corpus = api.load('text8')  # download english Wikipedia corpus
  model = Word2vec(corpus)
  model.most_similar("cat")

We obtain as result

  [('dog', 0.8296105861663818),
  ('bee', 0.7931321859359741),
  ('panda', 0.7909268736839294),
  ('hamster', 0.7631657719612122),
  ('bird', 0.7588882446289062),
  ('blonde', 0.7569558620452881),
  ('pig', 0.7523074746131897),
  ('goat', 0.7510213851928711),
  ('ass', 0.7371785640716553),
  ('sheep', 0.7347163558006287)]

spaCy

spaCy is an advanced library for "industrial-strength NLP". After installing spaCy, download the pretrained model

python -m download en_core_web_lg

Then

  from itertools import product
  import spacy

  nlp = spacy.load("en_core_web_md")
  tokens = nlp("dog cat banana afskfsd")

  print("# Info on each token dog, cat, banana, and afskfsd")
  for token in tokens:
      print(token.text, token.has_vector, token.vector_norm, token.is_oov)

  print("# Similarity of dog, cat and banana")
  for t1, t2 in product(tokens[:-1], tokens[:-1]):
      print(t1.similarity(t2))

Source: https://spacy.io/usage/vectors-similarity

Footnotes

Efficient Estimation of Word Representations in Vector Space, 2013

Word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method, 2014

Distributed Representations of Words and Phrases and their Compositionality, 2014

Adam Oudad