摘要

In this paper, the authors address the challenges in Click-Through Rate (CTR) prediction for online advertising and marketing. Existing methods, both shallow and deep architectures, suffer from three main shortcomings: lack of interpretability, inefficiency in analyzing high-order feature interactions, and neglect of polysemy in feature interactions across different semantic subspaces. They propose the InterHAt algorithm, which leverages a transformer model with multi-head self-attention for feature learning. This model incorporates hierarchical attention layers to predict CTR while providing interpretable insights into the predictions. InterHAt efficiently captures high-order feature interactions through a low-complexity attention aggregation strategy. Extensive experiments on multiple real and synthetic datasets validate the effectiveness and efficiency of InterHAt in CTR prediction.

关键词

Click-Through Rate (CTR) prediction; recommendation systems; deep learning; feature interactions.

介绍

点击率(CTR)定义为用户点击网页上特定推荐项目或广告的概率。它在推荐系统中扮演着重要的角色,比如在线广告,它直接影响着广告代理商的收入. 因此,CTR预测,试图准确估计描述一个用户-商品场景的CTR,是实现精确推荐和增加企业收入的关键.

Click-Through Rate (CTR) is defined as the probability that a user clicks on a specific recommended item or advertisement on a web page. It plays a crucial role in recommendation systems, especially in online advertising, where it directly impacts the revenue of advertising agents. Therefore, CTR prediction, which aims to accurately estimate the CTR for a user-item scenario, is key to achieving precise recommendations and increasing business revenue.

深度学习的发展提供了一种新的机器学习范式,该范式利用更深的神经网络结构从训练数据中捕获更复杂的信息。因此,为了学习多种特性,高阶特征,交叉特征,现有的CTR预测模型的体系结构和计算复杂度一直在增加,并获得更好的预测精度。其中,k阶特征是指原始特征的k阶多项式的潜在变量。深度神经网络具有大量的层次和单元,能够捕捉丰富的高阶信息。例如,DeepFM[9]和xDeepFM[19]通过多层前馈神经网络(FNN)和多块压缩交互网络(CIN)学习高阶特征。

The development of deep learning has introduced a new machine learning paradigm that leverages deeper neural network architectures to capture more complex information from training data. As a result, to learn diverse features, high-order features, and feature interactions, the architecture and computational complexity of existing CTR prediction models have been increasing, leading to improved prediction accuracy. High-order features refer to potential variables in the k-th order polynomial of the original features. Deep neural networks, with numerous layers and units, have the capacity to capture rich high-order information. For instance, models like DeepFM and xDeepFM employ multi-layer feedforward neural networks (FNN) and compressed interaction networks (CIN) to learn high-order features.

然而,不断增长的模型复杂性有两个缺点:可解释性受损和效率低下。在可解释性方面,由于神经网络层的权重和激活通常被认为是无法解释的,所以预测过程很难得到合理的解释。例如,Wide&Deep[4]的wide组件可以交叉积应用于特征嵌入,但未能量化并证明其对实际点击率预测性能的有效性。模型预测缺乏有说服力的理论依据,这给模型的可靠性和安全性蒙上了阴影。在许多应用程序中,例如:如药物推荐[20]和金融服务[39]等,不可信、不可靠的广告会误导用户点击那些统计上流行但实际上无用甚至有害的链接,从而导致严重的后果,如经济或健康损失。

However, the growing complexity of models has two drawbacks: compromised interpretability and inefficiency. In terms of interpretability, neural network layer weights and activations are often considered unexplainable, making it challenging to provide reasonable explanations for the prediction process. For example, in the Wide & Deep model, the “wide” component can apply cross-products to feature embeddings but fails to quantify and prove its effectiveness in improving actual CTR prediction performance. The lack of a convincing theoretical basis for model predictions casts a shadow on the reliability and safety of the model. In many applications such as drug recommendations and financial services, unreliable and untrustworthy advertisements can mislead users into clicking on statistically popular but practically useless or even harmful links, leading to severe consequences such as financial or health losses.

现有方法的第二个缺陷是效率较低,因为深度神经网络产生的高阶交互特征涉及到深度神经网络(DNN)中极其繁重的矩阵计算。例如,xDeepFM[19]中的压缩交互网络(CIN)通过外积和全连接层计算k+1阶特征矩阵,这需要一个三次复杂度的嵌入维度。在Wide&Deep中的deep组件有许多完全连接的层,每一个层都包含一个二次乘法。

The second drawback of existing methods is their lower efficiency, mainly because the high-order interaction features generated by deep neural networks (DNN) involve highly intensive matrix computations within the DNN. For example, in models like xDeepFM, the Compressed Interaction Network (CIN) calculates k+1 order feature matrices through outer products and fully connected layers, which require a cubic complexity in terms of embedding dimensions. In the deep component of Wide & Deep, there are numerous fully connected layers, each of which includes a quadratic multiplication operation.

在真正的应用程序中,效率问题是普遍和关键的。广告公司更喜欢即时点击推荐——而不是慢速或昂贵的推荐,尤其是在大量实时推荐查询的压力下。例如,网络广告公司Criteo在24天内就处理了超过40亿的点击量。尽管数据量很大,但新特正(如新用户和新商品)正在迅速涌现,推荐系统必须迅速适应这些新特性,以获得更好的用户体验。因此,用现有的方法学习大量现有或正在出现的特征的表示在计算上是困难的。

In real-world applications, efficiency is a widespread and critical concern. Advertising companies prefer real-time click recommendations over slow or expensive recommendations, especially under the pressure of handling a large volume of real-time recommendation queries. For instance, the online advertising company Criteo processed over 4 billion clicks in just 24 days. Despite the large dataset, new features, such as new users and new products, are emerging rapidly, and recommendation systems must quickly adapt to these new features to provide a better user experience. Therefore, learning representations for a vast number of existing or emerging features using existing methods can be computationally challenging.

除了可解释性和效率问题外,我们还指出了另一个可能降低重要交叉特征交互检测性能的障碍:不同的交叉特征可能对点击率产生相互冲突的影响,必须进行全面分析. 例如,电影推荐movie.genre = horror,user.age = young,time = 8am有一些相互矛盾的因素. 前两者结合会鼓励人们点击,而后两者结合则会抑制人们的点击,因为看电影通常是在晚上。这种冲突问题是由于不同语义子空间的多义特征交互造成的。在这个例子中,用户年龄的多语义对用户点击率有相反的影响。当user.age=young时结合了两个不同的属性movie.genre and time。

In addition to interpretability and efficiency issues, another obstacle that we have highlighted is the potential for conflicting effects between different feature interactions, which could diminish the performance of detecting crucial cross-feature interactions. For instance, in movie recommendations, when considering features like “movie.genre = horror,” “user.age = young,” and “time = 8 am,” there can be conflicting factors. The combination of the first two factors might encourage people to click, while the combination of the latter two might discourage them from clicking since watching horror movies typically occurs in the evening. This conflict issue arises due to polysemy in feature interactions across different semantic subspaces. In this example, the polysemy of user age has opposing effects on user click-through rates when combined with two different attributes, “movie.genre” and “time,” which creates a conflict when “user.age = young” is considered.

为了解决上述问题,在本文中,我们提出了一种具有层次注意力机制(In-terHAt)的可解释点击率预测模型, 该模型能够有效地学习不同顺序的显著特征作为解释性见解,并以端到端的方式同时准确预测CTR。具体来说,InterHAt通过一种新颖的层次注意机制明确量化了任意顺序的特征交互的影响,为提高效率目的聚集了重要的特征交互,并根据学习到的特征显著性解释了推荐决策。与Yanget等人研究的层次注意网络不同,InterHAt在特征顺序上使用层次注意,在较低特征的基础上生成高阶特征。

To address the aforementioned issues, this paper introduces an interpretable click-through rate prediction model with a hierarchical attention mechanism called InterHAt. This model efficiently learns significant features of different orders for interpretability while accurately predicting CTR in an end-to-end manner.

Specifically, InterHAt employs a novel hierarchical attention mechanism to explicitly quantify the impact of feature interactions in any order, aggregating important feature interactions for the purpose of enhancing efficiency. It provides explanations for recommendation decisions based on the learned feature significance. Unlike hierarchical attention networks studied by Yang et al., InterHAt uses hierarchical attention on feature orders, generating high-order features based on lower-order features.

为了适应不同语义子空间中特征交互的多义性,InterHAt利用了一个具有multi-head self-attention的Transformer来全面研究不同可能的特征交互。Transformer已广泛应用于自然语言处理任务,如情感分析、自然语言推理[6]和机器翻译。Multi-head attention会从不同的潜在子空间捕捉共同构成文本语义的词语的多种相互作用。我们利用transformer的这一伟大特性来检测复杂的多义特征交互,并学习一个多义增强的特征列表,作为分层注意层的输入。请注意,尽管Transformer在特征学习方面有很强的能力,但根据Vaswaniet等人的观点,模型的效率仍然保持不变。

To accommodate the polysemy of feature interactions across different semantic subspaces, InterHAt leverages a Transformer with multi-head self-attention to thoroughly examine various potential feature interactions. Transformers have been widely applied in natural language processing tasks such as sentiment analysis, natural language inference, and machine translation. Multi-head attention allows for capturing various interactions of words that collectively form the semantics of the text in different latent subspaces.

We utilize this excellent property of the Transformer to detect complex polysemous feature interactions and learn a polysemy-enhanced feature representation, which serves as input to the hierarchical attention layers. It’s worth noting that while Transformers excel in feature learning, the model’s efficiency remains a consideration, as mentioned by Vaswani et al.

我们将本文的贡献总结如下:

我们提出了InterHAt用于CTR预测。特别是,Inter-HAt采用分层注意来指出对点击有很大贡献的重要单一特征或不同顺序的交互特征。然后,InterHAt可以根据特征交互的不同顺序对CTR预测进行相应的基于注意的解释。

InterHAt利用一个具有多头自注意的transform来深入分析不同潜在语义子空间中特征之间可能的交互关系。就我们所知,InterHAt是第一种使用Trans-former with multi-head self-attention 来学习一大堆潜在特征以进行CTR预测的方法。

InterHAt预测CTR时无需使用需要大量计算成本的多层感知器网络。它将特征聚集起来,从而节省了列举特征交互的指数大小的开销。因此,它在处理高阶特征方面比现有算法更有效。

在三个主要的CTR基准数据集(Criteo、Avazu和Frappe)、一个流行的推荐系统数据集(movielen - 1m)和一个合成数据集上,进行了大量的实验来评估interhat的可解释性、效率和有效性。结果表明,InterHAt能够解释决策过程,在训练时间上有很大的提高,并且与最先进的模型具有相当的性能。

以下各节的组织如下。第二部分简要介绍了CTR预测和注意机制的相关工作。第3节说明了InterHAt每个组件的技术细节。第四节报告实证评估。最后,第五部分得出结论,并对未来的研究方向进行了讨论。

The contributions of this paper can be summarized as follows:

  1. Introducing InterHAt for CTR prediction: InterHAt uses hierarchical attention to identify important individual features and various order feature interactions that significantly contribute to clicks. It provides attention-based explanations for CTR predictions based on different feature interactions.

  2. Leveraging Transformer with multi-head self-attention: InterHAt delves into potential feature interactions between different latent semantic subspaces using a Transformer with multi-head self-attention. To the best of our knowledge, InterHAt is the first method to use the Transformer with multi-head self-attention to learn a wide range of potential features for CTR prediction.

  3. Efficient CTR prediction without the need for computationally expensive multi-layer perceptron networks: InterHAt aggregates features, saving the exponential cost associated with enumerating feature interactions. Therefore, it is more efficient in handling high-order features compared to existing algorithms.

  4. Extensive experimentation on major CTR benchmark datasets (Criteo, Avazu, and Frappe), a popular recommendation system dataset (Movielens - 1M), and a synthetic dataset to assess InterHAt’s interpretability, efficiency, and effectiveness. The results show that InterHAt can provide explanations for decision-making, significantly reduce training time, and perform comparably to state-of-the-art models.

The paper is organized as follows: Section 2 briefly introduces related work in CTR prediction and attention mechanisms. Section 3 provides technical details of each component of InterHAt. Section 4 reports empirical evaluations. Finally, in Section 5, conclusions are drawn, and potential future research directions are discussed.

相关工作

在本节中,我们将讨论现有的CTR预测模型和注意机制.

CTR 预估模型

CTR预测因其对网络广告的显著影响而受到学术界和产业界的广泛关注. CTR预测算法在特征交互学习方面更加强大,从本质上显示了模型架构的深入发展趋势.

CTR prediction has garnered extensive attention from both academia and industry due to its significant impact on online advertising. CTR prediction algorithms, which are increasingly powerful in feature interaction learning, showcase the trend of deepening model architectures.

分解机(FM)[24]为每个不同的特征赋给d维可训练的连续值表示,学习不同特征的表示,并通过一阶和二阶特征的线性聚合进行预测. 尽管FM可以推广到高阶情况,但它存在指数复杂度[3]的计算代价和浅层结构的低模型能力的问题。领域感知分解机(FFM)[16]假设特征在不同领域下可能具有不同的语义,并通过使特征表示领域特定化来扩展FM的思想。虽然它的CTR效果比FM好,但参数的大小和复杂度也增加了,更容易发生过拟合。注意因子分解机(AFM)[32]通过一个“attention net ”扩展了FM,不仅提高了性能,而且提高了可解释性。作者认为,注意网络所提供的特征显著性大大提高了FM的透明度。也就是说,由于FM的局限性,AFM只能学习基于二阶注意的显著性.

  • Factorization Machine (FM) assigns a trainable continuous value representation of dimension d to each different feature, learning representations of different features, and making predictions through linear aggregation of first-order and second-order features. While FM can be extended to higher orders, it suffers from exponential complexity and a shallow structure, limiting its modeling capabilities.

Wide&Deep [4]由一个较宽的部分和一个较深的部分组成,它们本质上分别是广义线性模型和多层percep-tron(MLP). CTR预测是通过两个部分结果的加权组合得出的。请注意,深层组件(即MLP)破坏了解释预测的可能性,因为逐层转换是在单元级而不是在特征级进行的,单个单元级值不能承载特征的具体和完整的语义信息。Deep&Cross Network(DCN)[31]与wide & deep略有不同,DCN用a cross-product transformation代替线性模型,将高阶信息与非线性深度特征集成在一起。DeepFM [9]通过用FM组件代替多项式乘积来改进这两个模型。深度MLP组件捕获了高阶特征交互,而FM分析了二阶特征交互。xDeepFM [19]声称,MLP参数实际上是在对“隐式”特征交互进行任意建模。 因此,作者引入了压缩交互网络(CIN)来模拟“显式”特征和隐式特征。行业实践的最新著作包括DIN [38]和DIEN [37],它们分别模拟了用户的静态和动态购物兴趣。两者的工作都严重依赖深度前馈网络,这通常是无法解释的。

  • Field-aware Factorization Machine (FFM) assumes that features within different fields may have different semantics and extends the idea of FM by making feature representations field-specific. While it improves CTR performance compared to FM, it increases the size and complexity of parameters, making it prone to overfitting.

  • Attention-aware Factorization Machine (AFM) extends FM with an “attention net,” not only improving performance but also enhancing interpretability. AFM significantly improves transparency through the feature significance provided by the attention network, but it can only learn second-order attention-based significance due to FM’s limitations.

  • Wide & Deep comprises a wide and a deep component, which essentially correspond to a generalized linear model and a multi-layer perceptron (MLP), respectively. CTR prediction is derived through a weighted combination of results from these two parts. The deep component (i.e., MLP) disrupts the possibility of interpretable predictions, as transformations occur at the unit level rather than the feature level, rendering unit-level values unable to carry specific and complete semantic information.

  • Deep & Cross Network (DCN) is slightly different from Wide & Deep, replacing the linear model with a cross-product transformation, integrating high-order information with non-linear deep features.

  • DeepFM enhances both models by replacing the polynomial product with an FM component. The deep MLP component captures high-order feature interactions, while FM analyzes second-order feature interactions.

  • xDeepFM introduces the Compressed Interaction Network (CIN) to model both “explicit” and “implicit” feature interactions, suggesting that MLP parameters indeed model “implicit” feature interactions. This is accomplished by simulating both explicit and implicit features.

  • Recent industry practices include DIN and DIEN, which model users’ static and dynamic shopping interests. Both of these heavily rely on deep feedforward networks, which are typically not interpretable.

上述所有CTR预测模型都严重依赖于深度神经网络,并不断提高性能。然而,深度学习算法是一把双刃剑,在可靠性和安全性方面存在潜在的风险。隐含层的权重和激活很难解释,输入和输出之间的因果关系是隐藏的和不确定的。它们都不能提供任何特征层面的线索来解释为什么这种深度特征学习策略会提高或降低点击率性能。因此,没有明确解释的预测被认为是不可信的。相比之下,InterHAt使用基于注意力的特征级别解释来处理CTR预测。也就是说,InterHAt没有不合理的深度MLP模块,只在特性层上工作,这也提高了InterHAt的效率。

All the CTR prediction models mentioned above heavily depend on deep neural networks and continually improve performance. However, deep learning algorithms are a double-edged sword, with potential risks in terms of reliability and security. The weights and activations of hidden layers are difficult to interpret, and the causality between inputs and outputs is hidden and uncertain. They cannot provide any feature-level clues to explain why this deep feature learning strategy improves or degrades CTR performance. Therefore, predictions without explicit explanations are considered untrustworthy. In contrast, InterHAt uses attention-based feature-level explanations for CTR prediction. In other words, InterHAt lacks the uninterpretable deep MLP modules and works only at the feature level, which also enhances InterHAt’s efficiency.

注意力机制

“注意力机制”学习了一种功能,该功能可以取代中间特征,并操纵机器学习算法的其他模块可以看到的信息. 它最初被提议用于神经机器翻译。它给源语言和目的语言之间关系密切的词赋予更大的权重,以便在翻译时关注重要的词。

The “attention mechanism” learns a function that can replace intermediate features and manipulate information that other modules of a machine learning algorithm can see. It was initially proposed for neural machine translation, giving greater weight to words closely related between the source and target languages to focus on important words during translation.

由于其能够精确并放大显著特征,这些显著特征极大地影响了预测[8],注意力机制被认为是一种合理和可靠的方式来解释许多任务中的决策过程,如推荐系统[32,35],医疗系统[5],计算机视觉[33],视觉问答。

Due to its ability to accurately and significantly enhance salient features, the attention mechanism is considered a reasonable and reliable way to explain decision processes in many tasks, such as recommendation systems, medical systems, computer vision, and visual question answering.

例如,RETAIN[5]研究了患者的电子健康记录(EHR),该记录具有两层注意网络,识别并解释了有影响的医院就诊和与就诊相关的重要临床诊断。协同注意机制[14]在词汇水平、短语水平和问题水平上提出了问题引导的视觉注意和视觉引导的问题注意。将三个层次的信息结合起来,在保持结果的可解释性的同时,以更好的性能预测答案。

For instance, RETAIN explored electronic health records (EHR) of patients, employing a two-layer attention network to identify and explain the influential hospital visits and critical clinical diagnoses related to those visits. The Collaborative Attention Mechanism introduced question-guided visual attention and visual-guided question attention at the vocabulary level, phrase level, and question level, combining three levels of information to predict answers with improved performance while maintaining interpretability.

在自然语言领域,提出了基于词汇、句子等语言层次结构的语言特异性和跨语言注意网络用于文档分类任务。另一个自然语言处理中的注意形式是自我注意。来自google design Transformer[29]的研究人员设计 Transformer[29] based on multi-head self-attention,即句子中的标记会注意同一句子中的其他标记,从而学习复合句的语义。利用Transformer强大的学习能力,BERT[6]通过堆叠多个双向Transformer层,在11个主要NLP任务上实现了最先进的性能。bert算法的成功,体现了Transformer互作用能力的突出特点。

In the natural language domain, language-specific and cross-lingual attention networks based on lexical, sentence, and other language-level structures have been proposed for document classification tasks. Another form of attention in natural language processing is self-attention. Researchers from Google designed Transformer based on multi-head self-attention, where tokens in a sentence attend to other tokens within the same sentence, thereby learning the semantics of complex sentences. Utilizing the powerful learning capabilities of Transformer, BERT achieved state-of-the-art performance on 11 major NLP tasks by stacking multiple bidirectional Transformer layers.

综上所述,已有的许多研究成果都认为,利用注意机制可以提高模型的准确性和可解释性。虽然注意模块没有经过训练来生成人类可读的预测原理,但当特征表示通过模型架构时,它们仍然可以揭示信息的显著性分布,这可以作为一种解释形式。因此,我们采用注意机制来解释INTERHAT中的CTR预测。

In summary, many previous studies have shown that using attention mechanisms can improve both the accuracy and interpretability of models. While attention modules are not trained to generate human-readable explanations for predictions, when feature representations pass through the model architecture, they can still reveal the significance distribution of information, which can serve as a form of explanation. Therefore, we adopt the attention mechanism to explain CTR predictions in InterHAt.

THE INTERHAT 模型

在本节中,我们阐述了图1所示的InterHAt管道以及根据注意权重的CTR预测解释方法。

Embedding 层

特征嵌入是CTR预测的先决条件,因为点击记录包含离散的类别项,不能直接适用于数值计算。

Feature embedding is a prerequisite for CTR prediction because click records consist of discrete categorical items that cannot be directly used for numerical computations.

一条点击记录包含一组F字段和一个二进制label 表示是否点击。每个fieldf∈f有一个分类值或一个数值. 不同的值被定义为不同的特征。对于类别字段,我们对one-hot编码字段进行embedding,以实现低维实值特征表示。具体来说,字段的每个不同特征值被分配一个可训练维连续向量作为其表示。对于数值字段,我们为每个字段指定一个向量作为其嵌入。

A typical click record includes a set of F fields and a binary label representing whether a click occurred. Each field f ∈ F has a categorical value or a numerical value. Different values are defined as different features. For categorical fields, we perform embedding on one-hot encoded fields to achieve low-dimensional real-valued feature representations. Specifically, each distinct feature value in a field is assigned a trainable continuous vector as its representation. For numerical fields, we designate a vector for each field as its embedding.

多头注意力机制

Transformer在自然语言处理中非常流行,这是因为它能够很好地学习句子内或句子间词对文本语义的协同效应,而不考虑词的顺序和距离。在CTR预测的背景下,我们将特征的协同效应,即特征间的相互作用,定义为多义现象。因此,我们在InterHAt上安装了基于多头自我注意的互感器,以捕获丰富对的特征交互,学习不同语义子空间中的多义词特征交互,即不同点击上下文中对点击率的多种含义。给定输入矩阵X0包含训练CTR记录的特征的可学习嵌入,则通过缩放的dot-product 注意力获得Transformer潜在的表示。

Transformer is highly popular in natural language processing because it excels at learning the cooperative effects of words within or between sentences, regardless of word order and distance. In the context of CTR prediction, we define the cooperative effects of features, i.e., interactions between features, as polysemous phenomena. Therefore, in InterHAt, we incorporate a multi-head self-attention mechanism to capture rich pairs of feature interactions, learning polysemous feature interactions in different semantic subspaces, i.e., multiple meanings of click-through rates within various click contexts.

Given an input matrix X0 containing trainable embeddings of the features from training CTR records, the Transformer’s latent representations are obtained through scaled dot-product attention.

公式1

隐藏特征的组合hi形成增广矩阵,既保留每个特征的内在信息,又保留其多义词信息. 在计算上,我们使用串联,然后是一个前馈层和一个ReLU进行组合来学习组合信息的非线性.

公式2

wm包含权重,h表示注意头的数目,“;”表示矩阵的连接.x1具有多义增强特征的矩阵,可以发送到分层注意层进行解释性预测。

我的点评

2020年发表的一篇论文,不属于A作。主要是利用了transform以及特征交互的方式,给深度模型带来了可解释性这一的观点。 可以看看,理解思想即可。产业界真实使用,比这个复杂的多,有含金量的多。这个可以作为丰富知识储备的一种方式。

It seems you are looking for a specific research paper published in 2020, not authored by someone referred to as “A,” which discusses the use of Transformers and feature interactions to bring interpretability to deep models. Understanding the key ideas presented in this paper can indeed be a valuable way to enrich your knowledge base.

Research papers that bridge the gap between advanced machine learning techniques and real-world industrial applications can provide valuable insights and practical knowledge.