All Articles

Word2vec and Negative Sampling

Introduction

Word2vec is a simple and elegant model for learning vector representations of words from text (sequences of words). At the core is the distributional hypothesis, which hypothesizes that words that frequently appear close to each other in text share similar meanings. For example, “apple” is more similar to “banana” than “boy” because (“apple”, “banana”) occur in the same sentence more frequently than (“apple”, “boy”). Thus the word2vec vector representation of “apple” will be closer in distance to the vector representation of “banana” than the vector representation of “boy”.

Given a text, we can represent this text as a sequence of words: w1,w2,w3,...,wTw_1, w_2, w_3,..., w_T. Here, TT represents the total number of words. If we select an arbitrary target word wtw_t, we can define a context of size cc. The cc words before (wt1,wt2,...,wtcw_{t-1}, w_{t-2},..., w_{t-c}) and the cc words after (wt+1,wt+2,...,wt+cw_{t+1}, w_{t+2},..., w_{t+c}) wtw_t are considered the context of nearby words.

The words "restocked", "the", "and", and "pears" are in the context of the target word "apple"

For the rest of this article, let VV be the size of the vocabulary (distinct words) in the text and 300 be the size of the vector representations (embeddings) we are trying to learn. Vector representations are considered “close” if their cosine distance is small.

Skip-gram Model

For a single target word wtw_t and a context of size cc, we have 2c2c (input, output) pairs where the input is wtw_t and the output is a word in the context. In the example above, the (input, output) pairs are:

  • (apples, restocked)
  • (apples, the)
  • (apples, and)
  • (apples, pears)

During training, for each (input, output) pair, we try to maximize the log likelihood as our objective function:

JML=logp(wowi)J_{ML}=\log p(w_{o}|w_{i})

In the model above, for each (input, output) pair, the input and output words are each represented as 1-hot vectors of size 1xV1xV. For example, if “apple” is the third word in the vocabulary, it’s 1-hot vector is [0,0,1,0,0,...,0,0][0, 0, 1, 0, 0, ..., 0, 0]. The model has two sets of embeddings for each word, an input embedding and an output embedding (the input embeddings are usually taken to be the word2vec embeddings after training takes place). EiE_i is a Vx300Vx300 weight matrix of input embeddings where row kk contains the input embedding for the kkth word in the vocabulary. Similarly, EoE_o is a 300xV300xV weight matrix of output embeddings where column kk contains the output embedding for the kkth word in the vocabulary.

The input 1-hot vector is first multiplied by EiE_i to produce eie_i, the 300 dimensional embedding for the input word (this operation essentially just selects the relevant embedding row from EiE_i):

wiEi=eiw_{i}E_{i}=e_{i}

Next, the embedding eie_i is multiplied by EoE_o to create sis_i, a VV dimensional vector where each entry kk is the dot product between the input word’s input embedding eie_i and the kkth word in the vocabulary’s output vector. Since dot product of closer vectors are higher, this value should be higher when the kth word is more similar to the input word. We now can calculate the probability of the output word given the input word by applying the softmax function to the output word’s entry in sis_i:

p(wowi)=exp(si[o])k=1Vexp(si[k])p(w_{o}|w_{i})=\frac{\text{exp}(s_{i}[o])}{\sum_{k=1}^{V}\text{exp}(s_{i}[k])} JML=logp(wowi)=logexp(si[o])k=1Vexp(si[k])J_{ML}=\log p(w_{o}|w_{i})=\log \frac{\text{exp}(s_{i}[o])}{\sum_{k=1}^{V}\text{exp}(s_{i}[k])}

Computing the gradient of the objective function for a single training example is computationally expensive because we need to do a number of calculations that is proportional to the vocabulary size VV (we need to update every word’s output embedding). This brings us to a more clever and efficient model formulation, negative sampling.

Negative Sampling

The idea of negative sampling is for each (input, output) pair, we sample kk negative (input, random) pairs from the unigram distribution (distribution of all words in the vocabulary). So now, given the same text, we suddenly have k+1k+1 times as many input pairs as before. Continuing our last example and taking k=2k=2, for the pair (apples, pears), we now have 3 training examples:

  • (apples, pears) — real pair
  • (apples, random word 1) — negative pair
  • (apples, random word 2) — negative pair

During training, for each pair, we try to maximize an objective function that tries to differentiate real pairs from noise using logistic regression:

JNS=logσ(eiTeo) if positive pairJ_{NS}=\log \sigma (e_{i}^{T}e_{o}) \text{ if positive pair} JNS=logσ(eiTeo) if negative pairJ_{NS}=\log \sigma (-e_{i}^{T}e_{o})\text{ if negative pair}

where eie_i is the input word’s input embedding and eoe_o is the output/random word’s output embedding. Notice that in the objective function, the sign of the dot product between eie_i and eoe_o is negative for negative pairs. The objective function for a negative pair achieves a higher value when the embeddings are very different (have a very negative dot product). This is because when we randomly sample a word to generate a negative (input, random) pair, we expect the random word not to be similar to the input word.

The word2vec paper found that empirically, drawing negative samples from the unigram distribution raised to the 34\frac{3}{4} power U(w)34/ZU(w)^{\frac{3}{4}}/Z outperformed the unigram distribution U(w)U(w). Intuitively, this flattens the U(w)U(w) so that more frequent words are slightly underweighted compared to before and less frequent words are slightly overweighted compared to before.

Subsampling

Subsampling of frequent words speeds up training and improves the vector representations of less frequent words. This is because we observe common words such as “the” so many times that we learn their embeddings very well and further training examples don’t change their embeddings significantly. So it would be a better use of training time to prioritize the training examples of the less frequent words, so that we can learn good embeddings for them as well. The subsampling method used in the word2vec paper discards a word (removes it as a target word and as part of any other target word’s training context) wiw_i from the text with probability

P(wi)=1tf(wi)P(w_{i})=1-\sqrt{\frac{t}{f(w_{i})}}

where tt is a chosen threshold and f(wi)f(w_i) is the frequency of word ii. So f(wi)=count of word itotal number of words in textf(w_i) = \frac{\text{count of word i}}{\text{total number of words in text}}.

Evaluating Embedding Quality

How can we tell how good our vector representations are? Well for one, we can manually inspect them — words that are related or have similar meanings (e.g. apple and pear) should have vectors that are close in distance. In the word2vec paper, the word embeddings are evaluated using an analogical reasoning task. Analogies such as “Germany” : “Berlin” :: “France”: ? are solved by finding the word having the vector closest to vec(“Berlin”) - vec(“Germany”) + vec(“France”). This analogy would be successfully solved if the resulting word vector was vec(“Paris”).

Implementation

TensorFlow implementation

Applications

Word2vec can be a way to featurize text that is more informative than simply word counts because it captures a degree of semantic meaning and relationship between words as well. The vectors produced by word2vec can be then be utilized as features for downstream natural language processing modeling tasks.

Another interesting and less obvious application of word2vec is in recommendating items to users based on interaction history. For example, we can consider a listening session of songs for a user to be a “text”, where each “word” is a song. By applying word2vec, we can learn “song” vectors and recommend users new songs that are similar to (songs with vectors that are close) the ones they listen to.

Resources