Partial_FC

论文: https://arxiv.org/abs/2010.05222
代码: https://github.com/deepinsight/insightface/tree/master/recognition/partial_fc

动机

解决人脸识别中随着样本数量增大的情况下，GPU计算无法负载的情况。同时保证在低计算资源的情况下能够达到全类别分类的准确率
batch size=64，embedding dim=512情况下，实现8张2080ti分类1000w个类别；64个GPU分类1亿个类别

方法

作者发现，softmax中负类别对最终结果的影响并没那么重要，因此作者将众多类别细分为子集，采用聚类中心的方法来实现样本的分类。

近似策略

损失函数

W是个d*C的矩阵，d为embedding为度，C为类别数量，N为batch size大小，fj表示全连阶层之后的激活特征

损失函数

线性变换矩阵中每一列被视为一个类中心，对于样本xi，w_yi就是其类中心（第yi列）。如果想要选择一个子集来近似全类别的softmax，那么这个样本xi的正类中心w_yi必须被包括，只有这样模型的性能才能保持。
作者进行了两个消融实验来验证这个假设：

第一个实验将该batch中所有将正类中心选择，然后负类进行随机采样（采样率0.1），称为正加随机负类（PPRN）
第二个实验直接从所有类中心进行随机选择，采样率0.5
训练过程中将xi和w_yi的平均余弦定义为：

损失函数

实验结果表明PPRN下样本与聚类中心的平均余弦距离最短。

分布式近似

权重矩阵W (d*C) 被分为k部分 [w1; w2; … ; wk] 其中k为GPU数量。对于样本xi，其label为yi，那么该样本的聚类中心为W的第yi列。那么当前GPU的正类中心即可得到，用w^p_i代表（batch size不为1，因此w^p_i也不止一个）
根据上面的信息，每个GPU上的类中心总数为|wi|（wi的数量），正类中心总数为|w^p_i|（w^p_i的数量），然后负类中心的数量si通过随机采样得到，si=(|wi|-|w^p_i|)*r，r为PPRN的采样率。
最后，我们将所有类中心进行softmax计算，Ws=[Wp, Wn]=[w^p_1, w^p_2, …, w^p_k, w^n_1, …, w^n_k]，实际上Ws是在每张GPU上进行本地采样得到

损失函数

代码

https://github.com/deepinsight/insightface/blob/master/recognition/arcface_torch/partial_fc_v2.py

import math
from typing import Callable

import torch
from torch import distributed
from torch.nn.functional import linear, normalize


class PartialFC_V2(torch.nn.Module):
    """
    https://arxiv.org/abs/2203.15565
    A distributed sparsely updating variant of the FC layer, named Partial FC (PFC).
    When sample rate less than 1, in each iteration, positive class centers and a random subset of
    negative class centers are selected to compute the margin-based softmax loss, all class
    centers are still maintained throughout the whole training process, but only a subset is
    selected and updated in each iteration.
    .. note::
        When sample rate equal to 1, Partial FC is equal to model parallelism(default sample rate is 1).
    Example:
    --------
    >>> module_pfc = PartialFC(embedding_size=512, num_classes=8000000, sample_rate=0.2)
    >>> for img, labels in data_loader:
    >>>     embeddings = net(img)
    >>>     loss = module_pfc(embeddings, labels)
    >>>     loss.backward()
    >>>     optimizer.step()
    """
    _version = 2

    def __init__(
        self,
        margin_loss: Callable,
        embedding_size: int,
        num_classes: int,
        sample_rate: float = 1.0,
        fp16: bool = False,
    ):
        """
        Paramenters:
        -----------
        embedding_size: int
            The dimension of embedding, required
        num_classes: int
            Total number of classes, required
        sample_rate: float
            The rate of negative centers participating in the calculation, default is 1.0.
        """
        super(PartialFC_V2, self).__init__()
        assert (
            distributed.is_initialized()
        ), "must initialize distributed before create this"
        self.rank = distributed.get_rank()
        self.world_size = distributed.get_world_size()

        self.dist_cross_entropy = DistCrossEntropy()
        self.embedding_size = embedding_size
        self.sample_rate: float = sample_rate
        self.fp16 = fp16
        
        # 每张卡单独储存子矩阵Wk
        self.num_local: int = num_classes // self.world_size + int(
            self.rank < num_classes % self.world_size
        )
        self.class_start: int = num_classes // self.world_size * self.rank + min(
            self.rank, num_classes % self.world_size
        )
        self.num_sample: int = int(self.sample_rate * self.num_local)
        self.last_batch_size: int = 0

        self.is_updated: bool = True
        self.init_weight_update: bool = True
        # 每张卡的子矩阵权重
        self.weight = torch.nn.Parameter(torch.normal(0, 0.01, (self.num_local, embedding_size)))

        # margin_loss
        if isinstance(margin_loss, Callable):
            # 边缘softmax定义，如ArcFace Loss
            self.margin_softmax = margin_loss
        else:
            raise

    def sample(self, labels, index_positive):
        """
            This functions will change the value of labels
            Parameters:
            -----------
            labels: torch.Tensor
                pass
            index_positive: torch.Tensor
                pass
            optimizer: torch.optim.Optimizer
                pass
        """
        with torch.no_grad():
            # 获取正样本标签集合
            positive = torch.unique(labels[index_positive], sorted=True).cuda()
            if self.num_sample - positive.size(0) >= 0:
                # 在所有本地类别中（num_local），生成一个随机分布（perm）
                perm = torch.rand(size=[self.num_local]).cuda()
                # 将正样本类别的分数赋值为2.0（大于随机[0,1)），保证正类别一定被选中
                perm[positive] = 2.0
                # 用topk选取分数最高的num_sample个类别
                index = torch.topk(perm, k=self.num_sample)[1].cuda()
                # 再排序（sort），得到最终采样类别下标index
                index = index.sort()[0].cuda()
            else:
                index = positive
            self.weight_index = index

            # searchsorted返回labels[index_positive]在index中的插入位置，即在采样类别的下标
            labels[index_positive] = torch.searchsorted(index, labels[index_positive])

        # 返回采样权重，即聚类中心
        return self.weight[self.weight_index]

    def forward(
        self,
        local_embeddings: torch.Tensor,
        local_labels: torch.Tensor,
    ):
        """
        Parameters:
        ----------
        local_embeddings: torch.Tensor
            feature embeddings on each GPU(Rank).
        local_labels: torch.Tensor
            labels on each GPU(Rank).
        Returns:
        -------
        loss: torch.Tensor
            pass
        """
        # 格式转化
        local_labels.squeeze_()
        local_labels = local_labels.long()

        # 分布式训练确保batch size相同
        batch_size = local_embeddings.size(0)
        if self.last_batch_size == 0:
            self.last_batch_size = batch_size
        assert self.last_batch_size == batch_size, (
            f"last batch size do not equal current batch size: {self.last_batch_size} vs {batch_size}")

        # 聚合所有GPU上的embedding和label
        _gather_embeddings = [
            torch.zeros((batch_size, self.embedding_size)).cuda()
            for _ in range(self.world_size)
        ]
        _gather_labels = [
            torch.zeros(batch_size).long().cuda() for _ in range(self.world_size)
        ]
        _list_embeddings = AllGather(local_embeddings, *_gather_embeddings)
        distributed.all_gather(_gather_labels, local_labels)

        embeddings = torch.cat(_list_embeddings)
        labels = torch.cat(_gather_labels)

        # 计算哪些样本的类别属于本地负责范围，不属于本地的标签置为-1，这些样本本轮不参与本地权重计算
        labels = labels.view(-1, 1)
        index_positive = (self.class_start <= labels) & (
            labels < self.class_start + self.num_local
        )
        labels[~index_positive] = -1
        # 属于本地的标签编号减去 class_start，映射到 [0, num_local) 区间
        labels[index_positive] -= self.class_start

        # 采样权重，即聚类中心数量
        if self.sample_rate < 1:
            weight = self.sample(labels, index_positive)
        else:
            weight = self.weight

        # 对embedding和权重W进行归一化，并计算点积
        with torch.cuda.amp.autocast(self.fp16):
            norm_embeddings = normalize(embeddings)
            norm_weight_activated = normalize(weight)
            # 其实就是矩阵相乘，得到点积矩阵
            logits = linear(norm_embeddings, norm_weight_activated)
        if self.fp16:
            logits = logits.float()
        # clamp 在 [-1, 1]，防止后续数值爆炸
        logits = logits.clamp(-1, 1)

        # 边缘softmax，如ArcFace
        logits = self.margin_softmax(logits, labels)
        # 计算分布式环境下的交叉熵损失，定义见下面
        loss = self.dist_cross_entropy(logits, labels)
        return loss

# 损失函数定义
class DistCrossEntropyFunc(torch.autograd.Function):
    """
    CrossEntropy loss is calculated in parallel, allreduce denominator into single gpu and calculate softmax.
    Implemented of ArcFace (https://arxiv.org/pdf/1801.07698v1.pdf):
    """

    @staticmethod
    def forward(ctx, logits: torch.Tensor, label: torch.Tensor):
        """ """
        batch_size = logits.size(0)
        # for numerical stability
        max_logits, _ = torch.max(logits, dim=1, keepdim=True)
        # local to global
        distributed.all_reduce(max_logits, distributed.ReduceOp.MAX)
        logits.sub_(max_logits)
        logits.exp_()
        sum_logits_exp = torch.sum(logits, dim=1, keepdim=True)
        # local to global
        distributed.all_reduce(sum_logits_exp, distributed.ReduceOp.SUM)
        logits.div_(sum_logits_exp)
        index = torch.where(label != -1)[0]
        # loss
        loss = torch.zeros(batch_size, 1, device=logits.device)
        loss[index] = logits[index].gather(1, label[index])
        distributed.all_reduce(loss, distributed.ReduceOp.SUM)
        ctx.save_for_backward(index, logits, label)
        return loss.clamp_min_(1e-30).log_().mean() * (-1)

    @staticmethod
    def backward(ctx, loss_gradient):
        """
        Args:
            loss_grad (torch.Tensor): gradient backward by last layer
        Returns:
            gradients for each input in forward function
            `None` gradients for one-hot label
        """
        (
            index,
            logits,
            label,
        ) = ctx.saved_tensors
        batch_size = logits.size(0)
        one_hot = torch.zeros(
            size=[index.size(0), logits.size(1)], device=logits.device
        )
        one_hot.scatter_(1, label[index], 1)
        logits[index] -= one_hot
        logits.div_(batch_size)
        return logits * loss_gradient.item(), None


class DistCrossEntropy(torch.nn.Module):
    def __init__(self):
        super(DistCrossEntropy, self).__init__()

    def forward(self, logit_part, label_part):
        return DistCrossEntropyFunc.apply(logit_part, label_part)


class AllGatherFunc(torch.autograd.Function):
    """AllGather op with gradient backward"""

    @staticmethod
    def forward(ctx, tensor, *gather_list):
        gather_list = list(gather_list)
        distributed.all_gather(gather_list, tensor)
        return tuple(gather_list)

    @staticmethod
    def backward(ctx, *grads):
        grad_list = list(grads)
        rank = distributed.get_rank()
        grad_out = grad_list[rank]

        dist_ops = [
            distributed.reduce(grad_out, rank, distributed.ReduceOp.SUM, async_op=True)
            if i == rank
            else distributed.reduce(
                grad_list[i], i, distributed.ReduceOp.SUM, async_op=True
            )
            for i in range(distributed.get_world_size())
        ]
        for _op in dist_ops:
            _op.wait()

        grad_out *= len(grad_list)  # cooperate with distributed loss function
        return (grad_out, *[None for _ in range(len(grad_list))])

论文笔记

#大模型 #多模态 #论文

Partial_FC

https://guokent.github.io/papernotes/partial_fc/

作者

Kent

发布于

2025年8月26日

许可协议

TULIP 下一篇