- Introduction
- Skip-gram Model
- Negative Sampling
- Subsampling
- Evaluating Embedding Quality
- Implementation
- Applications
Introduction
Word2vec is a simple and elegant model for learning vector representations of words from text (sequences of words). At the core is the distributional hypothesis, which hypothesizes that words that frequently appear close to each other in text share similar meanings. For example, “apple” is more similar to “banana” than “boy” because (“apple”, “banana”) occur in the same sentence more frequently than (“apple”, “boy”). Thus the word2vec vector representation of “apple” will be closer in distance to the vector representation of “banana” than the vector representation of “boy”.
Given a text, we can represent this text as a sequence of words: . Here, represents the total number of words. If we select an arbitrary target word , we can define a context of size . The words before () and the words after () are considered the context of nearby words.
For the rest of this article, let be the size of the vocabulary (distinct words) in the text and 300 be the size of the vector representations (embeddings) we are trying to learn. Vector representations are considered “close” if their cosine distance is small.
Skip-gram Model
For a single target word and a context of size , we have (input, output) pairs where the input is and the output is a word in the context. In the example above, the (input, output) pairs are:
- (apples, restocked)
- (apples, the)
- (apples, and)
- (apples, pears)
During training, for each (input, output) pair, we try to maximize the log likelihood as our objective function:
In the model above, for each (input, output) pair, the input and output words are each represented as 1-hot vectors of size . For example, if “apple” is the third word in the vocabulary, it’s 1-hot vector is . The model has two sets of embeddings for each word, an input embedding and an output embedding (the input embeddings are usually taken to be the word2vec embeddings after training takes place). is a weight matrix of input embeddings where row contains the input embedding for the th word in the vocabulary. Similarly, is a weight matrix of output embeddings where column contains the output embedding for the th word in the vocabulary.
The input 1-hot vector is first multiplied by to produce , the 300 dimensional embedding for the input word (this operation essentially just selects the relevant embedding row from ):
Next, the embedding is multiplied by to create , a dimensional vector where each entry is the dot product between the input word’s input embedding and the th word in the vocabulary’s output vector. Since dot product of closer vectors are higher, this value should be higher when the kth word is more similar to the input word. We now can calculate the probability of the output word given the input word by applying the softmax function to the output word’s entry in :
Computing the gradient of the objective function for a single training example is computationally expensive because we need to do a number of calculations that is proportional to the vocabulary size (we need to update every word’s output embedding). This brings us to a more clever and efficient model formulation, negative sampling.
Negative Sampling
The idea of negative sampling is for each (input, output) pair, we sample negative (input, random) pairs from the unigram distribution (distribution of all words in the vocabulary). So now, given the same text, we suddenly have times as many input pairs as before. Continuing our last example and taking , for the pair (apples, pears), we now have 3 training examples:
- (apples, pears) — real pair
- (apples, random word 1) — negative pair
- (apples, random word 2) — negative pair
During training, for each pair, we try to maximize an objective function that tries to differentiate real pairs from noise using logistic regression:
where is the input word’s input embedding and is the output/random word’s output embedding. Notice that in the objective function, the sign of the dot product between and is negative for negative pairs. The objective function for a negative pair achieves a higher value when the embeddings are very different (have a very negative dot product). This is because when we randomly sample a word to generate a negative (input, random) pair, we expect the random word not to be similar to the input word.
The word2vec paper found that empirically, drawing negative samples from the unigram distribution raised to the power outperformed the unigram distribution . Intuitively, this flattens the so that more frequent words are slightly underweighted compared to before and less frequent words are slightly overweighted compared to before.
Subsampling
Subsampling of frequent words speeds up training and improves the vector representations of less frequent words. This is because we observe common words such as “the” so many times that we learn their embeddings very well and further training examples don’t change their embeddings significantly. So it would be a better use of training time to prioritize the training examples of the less frequent words, so that we can learn good embeddings for them as well. The subsampling method used in the word2vec paper discards a word (removes it as a target word and as part of any other target word’s training context) from the text with probability
where is a chosen threshold and is the frequency of word . So .
Evaluating Embedding Quality
How can we tell how good our vector representations are? Well for one, we can manually inspect them — words that are related or have similar meanings (e.g. apple and pear) should have vectors that are close in distance. In the word2vec paper, the word embeddings are evaluated using an analogical reasoning task. Analogies such as “Germany” : “Berlin” :: “France”: ? are solved by finding the word having the vector closest to vec(“Berlin”) - vec(“Germany”) + vec(“France”). This analogy would be successfully solved if the resulting word vector was vec(“Paris”).
Implementation
Applications
Word2vec can be a way to featurize text that is more informative than simply word counts because it captures a degree of semantic meaning and relationship between words as well. The vectors produced by word2vec can be then be utilized as features for downstream natural language processing modeling tasks.
Another interesting and less obvious application of word2vec is in recommendating items to users based on interaction history. For example, we can consider a listening session of songs for a user to be a “text”, where each “word” is a song. By applying word2vec, we can learn “song” vectors and recommend users new songs that are similar to (songs with vectors that are close) the ones they listen to.