简介

本文是翻译Triplet Loss and Online Triplet Mining in TensorFlow而来。

第一次尝试独自翻译外文博客，不通顺的地方多见谅。

原作者写这个博客一开始是因为有人在stackflow提了一个问题问，“TensorFlow怎么实现三元组损失？”原作者便在下面给出了回答，“计算三元组并不困难，困难在于怎么挖掘得到三元组”，提问者问了原作者是否有博客能提供参考。时隔两年，在前两个月，原作者发布了这篇博客。

接下来，开始正文，正文中的“我”是指原作者。

引言

在人脸识别中，三元组损失被用来学习好的脸部嵌入矩阵（或“编码矩阵”）。如果你对三元组损失不太熟悉，你应该先从吴恩达的深度学习课程了解学习一下。

众所周知，三元组损失是难以实现的，特别是在添加了使用TensorFlow构建计算图这个约束。

在这篇博客中，我将定义三元组损失和取样得到三元组的不同策略。然后，我将解释如何正确地在TensorFlow中实现在线挖掘得到三元组损失。

大约在两年前，我在Reminiz实习，做人脸识别。我在stackoverflow上回答了一个关于在TensorFlow中实现三元组损失的问题。我得出的结论：

显然，在Tensorflow中实现三元组损失是很困难的，而且有很多方法可以使它比在python中进行采样更有效，但是解释它们需要一个完整的博客！

两年后了，我们开始吧。

所有的代码都可以在这个github库中找到。

三元组损失与三元组在线挖掘

为什么不使用softmax？

Google的一篇论文“FaceNet: A Unified Embedding for Face Recognition and Clustering”介绍了人脸识别的三元组损失。他们描述了一种使用在线挖掘三元组来训练脸部嵌入矩阵的新方法，这将在下一节中讨论。

通常在监督学习中，我们有固定数量的类，并使用软-最大交叉熵损失来训练网络。然而，在某些情况下，我们需要能够有一个可变数量的类。例如，在人脸识别中，我们需要能够比较两个未知的面孔，并判断它们是否来自同一个人。

在这种情况下，三元组损失是一种学习对每个脸部的良好嵌入的方法。在嵌入空间中，来自同一个人的面孔应该紧密地靠近在一起，形成良好的分离的集群。

损失定义

两张同类脸部（Obama）和一张异类脸部（Macron）

三元组损失的目标是确保：

两个具有相同标签的样本在嵌入空间中它们的嵌入紧密靠近；
两个具有不同标签的样本在嵌入空间中它们的嵌入距离很远。

但是，我们不想把每个样本的训练嵌入推移缩小到非常小的集群。唯一要求是给定两个正（同类）样本和一个负（异类）样本。负的距离应该要比正的距离更大一些。这与SVM中使用的间隔非常相似，这里我们希望每个类的集群都由该间隔分隔开。

为了正式定义该要求，损失函数由嵌入矩阵的三元组定义而成：

一个anchor
一个与anchor同类的正样本
一个与anchor异类的负样本

用于嵌入空间的距离度量$d$，三元$(a,p,n)$的损失为：

$\mathcal{L} = max(d(a, p) - d(a, n) + margin, 0)$

我们最小化损失，它使$d(a, p)$趋向于0，使$d(a, n)$大于$d(a, p) + margin$。一旦$n$为简单的负样本，损失即为0。

三元组挖掘

根据损失的定义，由三种类型的三元组：

简单三元组：损失为0的三元组，因为$d(a, p) + margin < d(a,n)$；
困难三元组：在三元组中，负样本比正样本更靠近anchor，即$d(a,n) < d(a,p)$；
半困难三元组：三元组中，负样本比正样本更不靠近anchor，但仍然有大于0的损失：$d(a, p) < d(a, n) < d(a, p) + margin$

每一个定义都依赖于负样本与anchor和正样本的相对位置。因此，我们可以将这三种类型扩展到负样本：困难负样本、半困难负样本或简单负样本。

下图显示了负样本的嵌入空间的三个对应区域。

给定一个anchor和一个正样本，负样本的三种情况

选择我们想要训练的三元组将会极大地影响我们的指标。在最初的Facenet论文，他们为每对anchor与正样本选取一个随机的半困难负样本，并且在这些三元组上训练。

离线与在线三元组挖掘

我们已经定义了一个三元组的损失，并且已经看到一些三元组比其他的更有用。现在的问题是如何对这些三元组进行取样或“挖掘”。

离线三元组挖掘

第一种产生三元组的方式是在每个epoch开始之前，离线找到它们。我们计算训练集的所有嵌入，然后只选择困难或半困难的三元组。然后我们就可以在这些三元组上训练一个epoch。

具体而言，我们会产生一个三元组$(i, j, k)$的列表。我们会构建$B$个三元组的批次，这意味着我们将计算$3B$个嵌入得到$B$个三元组，之后计算这些三元组的损失并且在网络中反向传播。

总的来说，这种方式不是很有效，因为我们需要对训练集进行全面遍历来生成三元组。它还需要定期离线地更新三元组。

在线三元组挖掘

在Facenet中引入在线三元组挖掘，这在Brandon Amos的博客“OpenFace 0.2.0: Higher accuracy and halved execution time”得到很好的描述。

这里的想法是，为每一批输入计算出有用的三元组。给定一个$B$个样本的批次（即有脸部的$B$张图像），我们计算$B$的嵌入，然后我们可以找到最大为$B^3$个三元组，其中大多数三元组都是无效的（它们没有两个正样本和一个负样本）。

这种方式可以给你的一个batch输入提供更多三元组，并且不需要任何离线挖掘。因此它的效率更高。我们将在最后一部分看到这方面的实现。

在线挖掘的三元组损失：从一组嵌入中计算出三元组。

在线挖掘的策略

在线挖掘中，我们已经从一个$B$输入的批次中计算该批次$B$的嵌入。现在我们想从这些$B$嵌入中生成三元组。

我们有三个索引$i,j,k \in [1,B]$，如果样本$i$和$j$有相同的标签但却相距较远，$k$样本有不同的标签，我们说$(i,j,k)$是一个有效的三元组。剩下的就是要有一个好的策略，在有效的三元组中挑选出一些来计算损失。

下面这两种策略的详细解释可以在这篇论文In Defense of the Triplet Loss for Person Re-Identification的第二部分找到。

他们假设你有一组面孔作为$B=PK$大小的输入，由P个不同的人，每个人有K张图像。一个经典的值是$K=4$。两种策略是：

batch all：选择所有有效的三胞胎，并在困难的和半困难的三胞胎中平均损失。
- 这里的关键点是不考虑简单的三元组（即那些损失为0的），因为把它们加入平均后会使整体的损失非常小；
- 这会产生总共有$PK(K-1)(PK-K)$个三元组（$PK$个anchor，每个anchor有$K-1$个可能的正样本，$PK-K$可能的负样本）。
batch hard：对于每个anchor，在批次中选取出最困难的正样本（$d(a,p)$最大）和最困难的负样本
- 这会产生$PK$个三元组；
- 被选取的三元组是批次中最困难的。

根据上面提到的论文，batch hard获得了最佳表现：

此外，所选的三元组可以被认为是适合的三元组，因为它们在数据的一小部分中是最困难的，这正是最适合学习三元组损失的方法。

然而，这实际上取决于你的数据集，应该通过比较开发的数据集的表现来决定。

一种简单的三元组损失的实现

在stackoverflow的回答中，我简单地实现了离线三元组挖掘的三元组损失：

anchor_output = ...    # shape [None, 128]
positive_output = ...  # shape [None, 128]
negative_output = ...  # shape [None, 128]

d_pos = tf.reduce_sum(tf.square(anchor_output - positive_output), 1)
d_neg = tf.reduce_sum(tf.square(anchor_output - negative_output), 1)

loss = tf.maximum(0.0, margin + d_pos - d_neg)
loss = tf.reduce_mean(loss)

该网络将被复制三次（用共享权重）来产生$B$个anchor，$B$个正样本，$B$个负样本的嵌入。然后我们只需计算这些嵌入的三元组损失。

这是一个简单的实现，但也是一个非常低效的实现，因为它使用的是离线的三元组挖掘。

一个更好的在线三元组挖掘的实现

所有相关的代码都可以在github上的model/triplet_loss.py有提供。

TensorFlow本身有一种半困难的在线挖掘三元组损失的实现，tf.contrib.losses.metric_learning.triplet_semihard_loss。在这里，我们不会遵循这个实现，并从头开始。

计算距离矩阵

因为最后的三元组损失取决于$d(a,p)$和$d(a,n)$，我们首先需要有效地计算成对的距离矩阵。我们在_pairwise_distances实现了欧式范数与欧式范数的平方。

def _pairwise_distances(embeddings, squared=False):
    """Compute the 2D matrix of distances between all the embeddings.

    Args:
        embeddings: tensor of shape (batch_size, embed_dim)
        squared: Boolean. If true, output is the pairwise squared euclidean distance matrix.
                 If false, output is the pairwise euclidean distance matrix.

    Returns:
        pairwise_distances: tensor of shape (batch_size, batch_size)
    """
    # Get the dot product between all embeddings
    # shape (batch_size, batch_size)
    dot_product = tf.matmul(embeddings, tf.transpose(embeddings))

    # Get squared L2 norm for each embedding. We can just take the diagonal of `dot_product`.
    # This also provides more numerical stability (the diagonal of the result will be exactly 0).
    # shape (batch_size,)
    square_norm = tf.diag_part(dot_product)

    # Compute the pairwise distance matrix as we have:
    # ||a - b||^2 = ||a||^2  - 2 <a, b> + ||b||^2
    # shape (batch_size, batch_size)
    distances = tf.expand_dims(square_norm, 0) - 2.0 * dot_product + tf.expand_dims(square_norm, 1)

    # Because of computation errors, some distances might be negative so we put everything >= 0.0
    distances = tf.maximum(distances, 0.0)

    if not squared:
        # Because the gradient of sqrt is infinite when distances == 0.0 (ex: on the diagonal)
        # we need to add a small epsilon where distances == 0.0
        mask = tf.to_float(tf.equal(distances, 0.0))
        distances = distances + mask * 1e-16

        distances = tf.sqrt(distances)

        # Correct the epsilon added: set the distances on the mask to be exactly 0.0
        distances = distances * (1.0 - mask)

    return distances

为了更详细地解释代码，我们计算嵌入的点积，维数为$(B,B)$。每个嵌入的欧式范数平方实际上就是在这个点积的对角线上，所以我们用tf.diag_part把它提取出来。最后我们用公式来计算距离：

$\Vert a - b \Vert ^2 = \Vert a \Vert^2 - 2 \langle a, b \rangle + \Vert b \Vert ^2$

一个棘手的问题是如果squared=False，我们取距离矩阵的平方根。首先，我们要确保距离矩阵总是正的。有些值可能是负的，因为计算中的小误差。我们只需要确保每个负值都被设置为0.0。

第二件要注意的事情是如果任何元素都是0.0的话（例如对角线应该都是0.0）。因为$0$的平方根导数是无限的，我们将得到的梯度为nan。为了处理这种情况，我们把等于0.0的值替换为一个比较小的值epsilon=1e-16。然后我们计算平方根，替换$\sqrt{\epsilon}$为0.0的值。

batch all 策略

在这个策略中，我们想要计算几乎所有三元组的三元组损失。在TensorFlow的计算图中，我们构造了3维的Tensor，维数为$(B,B,B)$，其中索引$(i,j,k)$，包含着它们的三元组损失。

然后我们得到一个有效三元组的3维mask_get_triplet_mask。这里mask[i,j,k]能正确识别$(i,j,k)$是一个有效的三元组。

最后，我们把无效的三元组的损失设置为0，并且计算有效的三元组损失的平均值。

这一切都是在batch_all_triplet_loss函数中实现的。

def batch_all_triplet_loss(labels, embeddings, margin, squared=False):
    """Build the triplet loss over a batch of embeddings.

    We generate all the valid triplets and average the loss over the positive ones.

    Args:
        labels: labels of the batch, of size (batch_size,)
        embeddings: tensor of shape (batch_size, embed_dim)
        margin: margin for triplet loss
        squared: Boolean. If true, output is the pairwise squared euclidean distance matrix.
                 If false, output is the pairwise euclidean distance matrix.

    Returns:
        triplet_loss: scalar tensor containing the triplet loss
    """
    # Get the pairwise distance matrix
    pairwise_dist = _pairwise_distances(embeddings, squared=squared)

    anchor_positive_dist = tf.expand_dims(pairwise_dist, 2)
    anchor_negative_dist = tf.expand_dims(pairwise_dist, 1)

    # Compute a 3D tensor of size (batch_size, batch_size, batch_size)
    # triplet_loss[i, j, k] will contain the triplet loss of anchor=i, positive=j, negative=k
    # Uses broadcasting where the 1st argument has shape (batch_size, batch_size, 1)
    # and the 2nd (batch_size, 1, batch_size)
    triplet_loss = anchor_positive_dist - anchor_negative_dist + margin

    # Put to zero the invalid triplets
    # (where label(a) != label(p) or label(n)  == label(a) or a == p)
    mask = _get_triplet_mask(labels)
    mask = tf.to_float(mask)
    triplet_loss = tf.multiply(mask, triplet_loss)

    # Remove negative losses (i.e. the easy triplets)
    triplet_loss = tf.maximum(triplet_loss, 0.0)

    # Count number of positive triplets (where triplet_loss > 0)
    valid_triplets = tf.to_float(tf.greater(triplet_loss, 1e-16))
    num_positive_triplets = tf.reduce_sum(valid_triplets)
    num_valid_triplets = tf.reduce_sum(mask)
    fraction_positive_triplets = num_positive_triplets / (num_valid_triplets + 1e-16)

    # Get final mean triplet loss over the positive valid triplets
    triplet_loss = tf.reduce_sum(triplet_loss) / (num_positive_triplets + 1e-16)

    return triplet_loss, fraction_positive_triplets

_get_triplet_mask的实现非常简单，所以我不会详细说明。

batch hard 策略

在这个策略中，我们想要找到对每个anchor最困难的正样本和负样本。

最困难的正样本

为了计算最困难的正样本，我们从成对的距离矩阵开始。然后我们得到一个有效配对的二维mask$(a,p)$（即$a \neq p$并且$a$和$p$有相同的标签），之后把mask外的其他元素都置为0。

最后一步是取这个经过修改的距离矩阵的每一行的最大距离。结果应该是有效的对$(a,p)$，因为不有效的元素都置为0。

最困难的负样本

最困难的负样本也是类似的，但要计算起来有点难。这里我们需要得到每一行的最小距离，所以我们不能把不有效的对$(a,n)$置为0（不有效是指$a$和$n$有相同的标签）。

这里的技巧是每一行都要为无效的对$(a,n)$添加最大的值。之后然后我们用最小值除以每一行。结果应该是有效的对$(a,n)$，因为不有效的元素都置为最大值。

最后一步是将这些合并到三元组损失中：

1	triplet_loss = tf.maximum(hardest_positive_dist - hardest_negative_dist + margin, 0.0)

这一切都是在batch_hard_triplet_loss函数中实现的。

def batch_hard_triplet_loss(labels, embeddings, margin, squared=False):
    """Build the triplet loss over a batch of embeddings.

    For each anchor, we get the hardest positive and hardest negative to form a triplet.

    Args:
        labels: labels of the batch, of size (batch_size,)
        embeddings: tensor of shape (batch_size, embed_dim)
        margin: margin for triplet loss
        squared: Boolean. If true, output is the pairwise squared euclidean distance matrix.
                 If false, output is the pairwise euclidean distance matrix.

    Returns:
        triplet_loss: scalar tensor containing the triplet loss
    """
    # Get the pairwise distance matrix
    pairwise_dist = _pairwise_distances(embeddings, squared=squared)

    # For each anchor, get the hardest positive
    # First, we need to get a mask for every valid positive (they should have same label)
    mask_anchor_positive = _get_anchor_positive_triplet_mask(labels)
    mask_anchor_positive = tf.to_float(mask_anchor_positive)

    # We put to 0 any element where (a, p) is not valid (valid if a != p and label(a) == label(p))
    anchor_positive_dist = tf.multiply(mask_anchor_positive, pairwise_dist)

    # shape (batch_size, 1)
    hardest_positive_dist = tf.reduce_max(anchor_positive_dist, axis=1, keepdims=True)

    # For each anchor, get the hardest negative
    # First, we need to get a mask for every valid negative (they should have different labels)
    mask_anchor_negative = _get_anchor_negative_triplet_mask(labels)
    mask_anchor_negative = tf.to_float(mask_anchor_negative)

    # We add the maximum value in each row to the invalid negatives (label(a) == label(n))
    max_anchor_negative_dist = tf.reduce_max(pairwise_dist, axis=1, keepdims=True)
    anchor_negative_dist = pairwise_dist + max_anchor_negative_dist * (1.0 - mask_anchor_negative)

    # shape (batch_size,)
    hardest_negative_dist = tf.reduce_min(anchor_negative_dist, axis=1, keepdims=True)

    # Combine biggest d(a, p) and smallest d(a, n) into final triplet loss
    triplet_loss = tf.maximum(hardest_positive_dist - hardest_negative_dist + margin, 0.0)

    # Get final mean triplet loss
    triplet_loss = tf.reduce_mean(triplet_loss)

    return triplet_loss

测试我们的实现

如果你不相信上面的实现是按预期工作的，那么你是对的！要确保实现中没有bug，唯一的方法就是为model/triplet_loss.py的每个函数编写测试。

这对于像这样的棘手函数来说尤其重要，因为在TensorFlow中很难实现，但是在python中使用三个嵌套for循环来编写更容易。

测试函数写在model/tests/test_triplet_loss.py，并且将我们的TensorFlow实现的结果与简单的numpy实现的结果进行比较。

为了检查测试是否通过，运行：

1	pytest model/tests/test_triplet_loss.py

（或只是pytest）

下面是执行的测试列表：

test_pairwise_distances()：比较了TensorFlow与numpy的成对距离的结果。
test_pairwise_distances_are_positive()：确保得到的距离是正的。
test_gradients_pairwise_distances()：确保梯度不是nan。
test_triplet_mask()：比较numpy和tensorflow实现。
test_anchor_positive_triplet_mask()：比较numpy和tensorflow实现。
test_anchor_negative_triplet_mask()：比较numpy和tensorflow实现。
test_simple_batch_all_triplet_loss()：只有一种标签的测试。
test_batch_all_triplet_loss()：对batch all策略的全部测试（与numpy相比）
test_batch_hard_triplet_loss()：对batch hard策略的全部测试（与numpy相比）

在MNIST上实验

即使在上面的测试中，也很容易忽视一些错误。例如，一开始我实现了成对的距离而不检查根号下的值是否大于$0$。我通过了所有测试，但是训练过程中的梯度是nan。因此我加入了test_gradients_pairwise_distances，修正了_pairwise_distances函数。

为了使事情简单，我们将测试MNIST的三元组损失。代码可以在这里找到。

为了训练和评价模型，运行：

1	python train.py --model_dir experiments/base_model

这将启动一个新的实验（即一个训练过程）称为base_model。模型目录（包含权重、总结……）在experiments/base_model。这里我们使用了一个json文件experiments/base_model/params.json，指定了模型中的所有超参数。这个在所有新的实验中都应该被创建。

一旦训练过程完成（或者在模型目录中保存了一些权重），我们可以使用TensorBoard来可视化这些嵌入。要做到这一点，运行:

1	python visualize_embeddings.py --model_dir experiments/base_model

在实验目录中运行TensorBoard：

1	tensorboard --logdir experiments/base_model

用T-SNE可视化的MNIST测试图像的嵌入。

这些嵌入是由配置文件experiments/base_model/params.json中指定的超参数运行的。看到哪些验证集图片被错误分类是很有趣的：它们中的大多数肯定也会被人类误解。

结论

TensorFlow并不能很容易地实现三元组损失，但是只要稍加努力，我们就可以通过在线挖掘来构建一个漂亮的三元组损失。

最棘手的部分是如何有效地计算嵌入的距离，以及如何过滤掉无效的/简单的三胞胎。

最后，如果你需要记住一件事：总是测试你的代码，特别是当它像三元组损失一样复杂时。

资源

这篇博客的github目录
介绍三元组在线挖掘的Facenet论文
对在线三元组挖掘的详细解释In Defense of the Triplet Loss for Person Re-Identification
Brandon Amos关于在线三元组挖掘的博客OpenFace 0.2.0: Higher accuracy and halved execution time
用于半困难在线挖掘三元组损失的内置TensorFlow函数的源代码tf.contrib.losses.metric_learning.triplet_semihard_loss
coursera关于三元组损失的课程

RexKing6's Note

三元组损失与TensorFlow在线三元组挖掘

简介

引言