Adam Oudad

(Machine) Learning log.

Vectorizing text

October 17, 2019

1 minute read

Character, word, sentence and document embeddings are popular because they are efficient. In the case of words, such embeddings represent words in their meaning, their role and their hierarchy in texts.

We have seen a clear improvement of word embeddings thanks to Mikolov et al. at Google, with Word2Vec in 2013. Since then, many kinds of embeddings have been developped for different purposes. We can cite for example Fasttext by Facebook, Bert by Google.

In this series of articles I will publish, I plan to make a tour of most popular, efficient, and super cool ways to vectorize text. I like to wrap topics I write on with an overview of the mathematical background and some code to get started quickly, so that is what I will try to do in each article of this series.

There exists a large amount of embeddings for NLP. If you want to dive into the lot, I recommend for example Embedding Methods for NLP, by Jason Weston and Antoine Bordes, presented at EMNLP 2014.