阿里STAR网络的从0到1重点解析

整体介绍

提出了一种单一模型能够适用于多种不同业务场景的方法。 multi-domain CTR prediction，「即我们的模型需要同时预测在D1,D2,Dm业务场景下的点击率。模型以(x,y,p)作为输入，其中x为输入特征，y为点击标签，p为不同业务场景的标识」
这里指的多种ctr业务场景是针对首页推荐和猜你喜欢，特点都是同一个app内。
提出了不同业务场景下，数据互相共享互补提升的思路，提出了一种新的任务：multi-domain CTR prediction。并针对这类任务设计了PN，Star Topology FCN，辅助网络等结构

Your question seems to be about multi-domain click-through rate (CTR) prediction, where a single model needs to predict CTRs in different business scenarios such as D1, D2, and Dm. The model takes (x, y, p) as input, where x are input features, y are click labels, and p are identifiers for different business scenarios.

You mention that these multiple CTR business scenarios involve both the home page recommendation and the “I wonder what you like” recommendations, both of which are within the same app.

You propose a data sharing and complementarity improvement approach under different business scenarios and introduce a new task: multi-domain CTR prediction. You also design structures such as PN (probably a placeholder), Star Topology FCN (probably referring to a star topology fully connected network), and auxiliary networks for this type of task.

背景

通过使用一种通用模型，实现了单一模型可以适用于多种CTR业务场景，从而减少了维护成本、节省了计算资源，并促进了不同业务场景之间的数据共享。

By using a universal model, it is possible to use a single model for multiple CTR business scenarios, thereby reducing maintenance costs, saving computational resources, and promoting data sharing among different business scenarios.

原理

原理1

单场景CTR预估方法将输入数据经过嵌入层（embedding）进行特征转换，然后通过池化（pooling）或拼接（concatenation）操作得到一维向量表示。接下来，这个向量经过批量归一化（Batch Normalization，BN）层，然后通过一系列全连接层（Fully Connected，FC）进行处理，最终得到CTR预估的结果。这个方法通常用于单一的业务场景中，以对输入数据进行特征提取和CTR预测。

Single-scene CTR estimation methods first convert input data through embedding layers (embedding) for feature transformation, and then obtain one-dimensional vector representations through pooling (pooling) or concatenation operations. Next, this vector goes through a batch normalization (Batch Normalization, BN) layer, and then through a series of fully connected (Fully Connected, FC) layers for processing, finally obtaining the result of CTR estimation. This method is usually used in a single business scenario to extract features from the input data and predict CTR.

Star网络：

在这个方法中，我们进行了一系列关键改进。首先，我们将传统的BN（Batch Normalization）层替换为PN（Partitioned Normalization）层，这使得我们能够根据不同业务场景下的数据分布进行定制化的归一化处理。

In this method, we have made a series of key improvements. First, we replace the traditional BN (Batch Normalization) layer with a PN (Partitioned Normalization) layer, allowing us to perform customized normalization processing according to the data distribution under different business scenarios.

其次，我们将FCN（Fully Connected Network）替换为Star Topology FCN，也就是Star Topology Adaptive Recommender（STAR）。这个新型网络架构可以更充分地利用多个业务场景中的数据，从而提升各自业务的性能指标。

Secondly, we replace the FCN (Fully Connected Network) with a Star Topology FCN, also known as the Star Topology Adaptive Recommender (STAR). This new network architecture can make fuller use of the data in multiple business scenarios, thereby improving the performance indicators of each business.

最后，我们引入了一个辅助网络，直接将业务场景的标识（domain indicator）作为输入。这个辅助网络有助于网络更好地感知不同场景下的数据分布，进一步提高了模型的性能和适用性。这些改进共同使得我们的方法能够更好地应对多种业务场景的CTR预估问题。

Finally, we introduce an auxiliary network that directly takes the identifier of the business scenario (domain indicator) as input. This auxiliary network helps the network better perceive the data distribution under different scenarios, further improving the performance and applicability of the model. These improvements together enable our method to better cope with the CTR estimation problem in multiple business scenarios.

Star网络：

Star网络：1

在经过PN层后，输出z’会作为Star Topology FCN的输入. Star Topology FCN包含一个所有领域共享的FCN和每个领域各自独立的FCN. 所有的FCN数量为M+1, M为domain的数量。FCN:全连接网络dnn

After going through the PN layer, the output z’ will be taken as the input of the Star Topology FCN. The Star Topology FCN contains a FCN shared by all domains and a FCN independent of each domain. There are M+1 FCNs in total, where M is the number of domains. FCN: Fully Connected Network dnn

Star网络：2

模型架构

Star网络：3

Star Topology FCN中每个业务场景网络的权重由共享FCN和其domain-specific FCN的权重共同决定。共享FCN来决定每个领域中数据的共性，而domain-specific FCN习得不同领域数据之间分布的差异性。

Star Topology

损失函数

实验结果

共19个domain的数据

实验结果

文中对比的baseline模型大都是多任务学习模型，

multi-domain和multi-task之间的区别主要是：

multi-domain的模型大都解决的是不同domain的相同问题，如CTR预估，其label space是相同的；而multi-task一般解决的是相同domain内的不同任务，如CTR预估和CVR预估，其label space是不同的。

The baseline models compared in the article are mostly multi-task learning models. The differences between multi-domain and multi-task models mainly lie in:

Multi-domain models generally solve the same problem in different domains, such as CTR prediction, and their label spaces are the same; while multi-task usually solves different tasks within the same domain, such as CTR prediction and CVR prediction, and their label spaces are different.

代码实现

Star层代码：

import tensorflow as tf
from tensorflow.python.keras import backend as K
from tensorflow.python.keras.initializers import Zeros, glorot_normal
from tensorflow.python.keras.layers import Layer
from tensorflow.python.keras.regularizers import l2

def activation_layer(activation):
    if isinstance(activation, str):
        act_layer = tf.keras.layers.Activation(activation)
    elif issubclass(activation, Layer):
        act_layer = activation()
    else:
        raise ValueError(
            "Invalid activation,found %s.You should use a str or a Activation Layer Class." % (activation))
    return act_layer

class STAR(Layer):

    def __init__(self, hidden_units, num_domains, activation='relu', l2_reg=0, dropout_rate=0, use_bn=False, output_activation=None,
                 seed=1024, **kwargs):
        self.hidden_units = hidden_units
        self.num_domains = num_domains
        self.activation = activation
        self.l2_reg = l2_reg
        self.dropout_rate = dropout_rate
        self.use_bn = use_bn
        self.output_activation = output_activation
        self.seed = seed

        super(STAR, self).__init__(**kwargs)

    def build(self, input_shape):
        input_size = input_shape[-1]
        hidden_units = [int(input_size)] + list(self.hidden_units)
        ## 共享FCN权重
        self.shared_kernels = [self.add_weight(name='shared_kernel_' + str(i),
                                        shape=(
                                            hidden_units[i], hidden_units[i + 1]),
                                        initializer=glorot_normal(
                                            seed=self.seed),
                                        regularizer=l2(self.l2_reg),
                                        trainable=True) for i in range(len(self.hidden_units))]

        self.shared_bias = [self.add_weight(name='shared_bias_' + str(i),
                                     shape=(self.hidden_units[i],),
                                     initializer=Zeros(),
                                     trainable=True) for i in range(len(self.hidden_units))]
        ## domain-specific 权重
        self.domain_kernels = [[self.add_weight(name='domain_kernel_' + str(index) + str(i),
                                        shape=(
                                            hidden_units[i], hidden_units[i + 1]),
                                        initializer=glorot_normal(
                                            seed=self.seed),
                                        regularizer=l2(self.l2_reg),
                                        trainable=True) for i in range(len(self.hidden_units))] for index in range(self.num_domains)]

        self.domain_bias = [[self.add_weight(name='domain_bias_' + str(index) + str(i),
                                     shape=(self.hidden_units[i],),
                                     initializer=Zeros(),
                                     trainable=True) for i in range(len(self.hidden_units))] for index in range(self.num_domains)]

        self.activation_layers = [activation_layer(self.activation) for _ in range(len(self.hidden_units))]

        if self.output_activation:
            self.activation_layers[-1] = activation_layer(self.output_activation)

        super(STAR, self).build(input_shape)  # Be sure to call this somewhere!

    def call(self, inputs, domain_indicator, training=None, **kwargs):
        deep_input = inputs
        output_list = [inputs] * self.num_domains
        for i in range(len(self.hidden_units)):
            for j in range(self.num_domains):
                # 网络的权重由共享FCN和其domain-specific FCN的权重共同决定
                output_list[j] = tf.nn.bias_add(tf.tensordot(
                    output_list[j], self.shared_kernels[i] * self.domain_kernels[j][i], axes=(-1, 0)), self.shared_bias[i] + self.domain_bias[j][i])

                try:
                    output_list[j] = self.activation_layers[i](output_list[j], training=training)
                except TypeError as e:  # TypeError: call() got an unexpected keyword argument 'training'
                    print("make sure the activation function use training flag properly", e)
                    output_list[j] = self.activation_layers[i](output_list[j])
        output = tf.reduce_sum(tf.stack(output_list, axis=1) * tf.expand_dims(domain_indicator,axis=-1), axis=1)

        return output

    def compute_output_shape(self, input_shape):
        if len(self.hidden_units) > 0:
            shape = input_shape[:-1] + (self.hidden_units[-1],)
        else:
            shape = input_shape

        return tuple(shape)

    def get_config(self, ):
        config = {'activation': self.activation, 'hidden_units': self.hidden_units,
                  'l2_reg': self.l2_reg, 'use_bn': self.use_bn, 'dropout_rate': self.dropout_rate,
                  'output_activation': self.output_activation, 'seed': self.seed}
        base_config = super(STAR, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))