Hello everyone,

I decided to summarise some of the tips that were discussed in the previous two labs.

BSG

In the BSG model, we have a parameterised prior that conditions on the central word :

\begin{equation} Z|w \sim \mathcal N(\mu_w, \sigma^2_w) \end{equation}

To get it right we need to have a matrix of locations () and a matrix of scale () vectors:

  • has shape [emb_dim, vocab_size] and is initialised at random
  • where has shape [emb_dim, vocab_size] and is initialised at random: note that I apply softplus to make sure scales are strictly positive

Throughout, let the function onehot(x) returns a one-hot encoding of a word x (thus the result of onehot(x) has shape [vocab_size, 1]).

Generative model

To get a prior location and scale for a central word we basically use matrix multiplication:

  • prior location:
  • prior scale:

The decoder must obtain, for each central word and (sampled) embedding , the parameters of a Categorical distribution over the vocabulary of the language (that is a vocab_size-dimensional probability vector):

\begin{equation} C_j|w,z \sim \text{Cat}(\mathbf f_w) \end{equation}

This distribution is used to assign a likelihood to context words given a central word and its (sampled) embedding .

The vector is computed via a simple feedforward NN:

Inference model

Let denote some deterministic embeddings, thus has shape [emb_dim, vocab_size]. In the paper, the authors

  1. Concatenate a deterministic representation of each context word with a deterministic representation of the central word : .
  2. To the result, they apply a projection followed by a ReLU.
  3. Then, they aggregate all vectors using elementwise sum.
  4. To the resulting vector, they apply an affine layer for the variational location and another affine layer with softplus activation for the variational scale.

ELBO

Just get a reparameterised sample from the inference model, stick that into the ``decoder’’ and build the loss (negative ELBO), all terms are now available.

Softmax approximation

If you find it difficult to implement the softmax approximation, you can ignore it, instead using a smaller vocabulary by mapping stop words to UNK as well as infrequent words.

If you want to implement it though, this is how you do it.

First, recall what we are trying to do, namely, parameterise a conditional probability over context words given an embedding:

\begin{equation} C_j|z \sim \text{Cat}(\mathbf f_j) \end{equation}

When we use a softmax layer, is a V-dimensional probability vector and getting the probability corresponds to a lookup, i.e. .

The first thing to realise is that a softmax is not the only way to obtain valid parameters . Another way is inspired by logistic regression where we write

\begin{equation} P(c|z) = \frac{\exp(s(z,c))}{\sum_{c’} \exp(s(z,c’))} \end{equation}

where is a score function that measures a sort of strenght of association between and . The is there to make sure the numerator is positive, in which case the denominator normalises the distribution correctly.

Another way to obtain a valid numerator is to replace the by any other positive scalar function :

\begin{equation} P(c|z) = \frac{u(z,c)}{\sum_{c’} u(z,c’)} \end{equation}

In this paper, the authors design using two components very conveniently chosen:

  1. a fixed distribution over context words
  2. the prior probability of the embedding had been the central word

That is, . Note that this is valid since remains a positive scalar function and therefore the conditional probability of the context word

\begin{equation} P(c|z) = \frac{p(c) \mathcal N(z|\mu_c, \sigma_c^2)}{\sum_{c’} p(c’) \mathcal N(z|\mu_{c’}, \sigma_{c’}^2)} \end{equation}

can be properly normalised.

Now we turn back to the ELBO, where we will find a term of the kind . Which we can approximate using Jensen’s inequality.

First we replace the new conditional

then we separate the numerator and denominator using log identities

then we realise that the second term is in fact an expectation

and then we use Jensen’s inequality to derive a lowerbound

Now note that we can easily obtain an MC estimate of the second term by using 1 or more samples from P(c).

EmbedAlign

Generative model

  • Prior location: 0 (fixed)
  • Prior scale: 1 (fixed)

Distribution over vocabulary of L1: for every . Note that this returns a vector of L1 vocabulary-size per word in .

Distribution over vocabulary of L2: for every . Note that this returns a vector of L2 vocabulary-size per word in . You did not read it wrong, it’s per word in , not in .

Alignment distribution parameter: 1 / m.

Inference model

Embed words deterministically using a matrix with shape [emb_dim, vocab_size]: that is, .

If applicable re-encode the words using a BiLSTM: .

Variational location: for each .

Variational scale: for each .

This will give you Gaussians, one per word in . Then you can obtain 1 sample per position and build the loss.

ELBO

The loss (negative ELBO) will have a term

per word in . This is the log of the entry in that corresponds to .

It will also have a term

per word in . This however requires a marginalisation.

\begin{equation} P(y_i|z_1^m) = \sum_{a_j=1}^m P(a_j|m) P(y_j|z_{a_j}) \end{equation}

This basically iterates for every from 1 to selecting the probability corresponding to had it been aligned to a certain position in . To compute it, you basically have to average the probabilities of in , , all the way to . Recall that the alignment probability is uniform (1/m), and that’s why this is just an average.

Finally, in the loss there will be 1 KL term per word in , which are all analytically computable.

Softmax approximation

Again, if you find it difficult to implement the softmax approximation, you can ignore it, instead using a smaller vocabulary by mapping stop words to UNK as well as infrequent words. The same strategy applies to both languages in EmbedAlign.

If you want to implement it though, this is how you do it.

I will present it for the distribution over L1 words, but the same holds for the distribution over L2 words. First, recall what we are trying to do, namely, parameterise a conditional probability over L1 words given an embedding:

\begin{equation} X_i|z_i \sim \text{Cat}(\mathbf f_i) \end{equation}

When we use a softmax layer, is a -dimensional probability vector and getting the probability corresponds to a lookup, i.e. .

The first thing to realise is that a softmax is not the only way to obtain valid parameters . Another way is inspired by logistic regression where we write

\begin{equation} P(x_i|z) = \frac{\exp(s(z,x_i))}{\sum_{x’ \in \mathcal X} \exp(s(z,x’))} \end{equation}

where is a score function that measures a sort of strenght of association between and . The is there to make sure the numerator is positive, in which case the denominator normalises the distribution correctly.

We now address efficiency by approximating the denominator such that the sum does not have to go over the entire vocabulary.

We realise that the denominator can be decomposed in two disjoint sets, namely, and :

and the set is such that it includes at least the observation .

We then approximate the sum over the complement set via importance sampling. In importance sampling we introduce a fixed proposal distribution and express the sum over the complement set as an expectation:

For as long as is non-zero over the complete complement set the equality is well define. In this case, we use a uniform distribution over the elements of the complement set. Now we build a Monte Carlo estimate of the expectation by sampling from the proposal.

where is a subset of the complement set and where .

Formally, we define a uniform distributions over subsets of a certain size (where the size is a fixed hyperparameter, 10000 in the paper). In practice, we sample words one at a time without replacement until we reach the specified size. In that case, .

All we need to do is to specify the scoring function . For that we use a deterministic embedding matrix with shape [emb_dim, vocab_size] initialised at random. We then define the scoring function as

\begin{equation} s(z, x) = z^\top \text{matmul}(C, \text{onehot}(x)) \end{equation}

Alternatives to BiLSTM for EmbedAlign

In case you cannot afford training a BiLSTM for EmbedAlign you need to find an alternative way to make use of context information, otherwise your approximate posteriors will not be sensitive to context (and context is important for tasks such as lexical substitution).

There are several options:

  • you can condition on as well as the average embedding taking the other words in into account

\begin{equation} \mathbf h_i = \text{concat}\left(\text{emb}(x_i), \frac{1}{m-1} \sum_{k \neq i} \text{emb}(x_k)]\right) \end{equation}

  • you can also use the exact same idea as in BSG, namely, condition on a neighbourhood around