整体介绍

在聚类算法领域,有几种基本方法,每种方法可能会导致一个或多个具体的算法。这些基本方法包括:

In the field of clustering algorithms, there are several fundamental approaches, each of which may lead to one or more specific algorithms. These fundamental approaches are:

  • 基于划分的算法;

  • 基于层次的算法;

  • 基于密度的算法;

  • 基于网格的算法;

  • 基于约束的算法;

  • Partition-Based Algorithms;

  • Hierarchical-Based Algorithms;

  • Density-Based Algorithms;

  • Grid-Based Algorithms;

  • Constraint-Based Algorithms;

这些基本方法中的每一种都产生了许多衍生算法,这些衍生算法在学术界和工业界都找到了应用。这些衍生算法通常提供了对特定类型的数据或聚类挑战的改进或专门能力。

Each of these fundamental approaches has given rise to numerous derivative algorithms, which have found applications in both academia and industry. These derivative algorithms often offer improvements or specialized capabilities for specific types of data or clustering challenges.

基于划分算法

In partition-based algorithms, one of the most typical methods is the K-means algorithm, which is relatively simple and easy to understand.

在基于划分的算法里,最典型的是K均值算法,这个比较简单,也是很容易理解的。

Principle:
Algorithm Input: Dataset, K value
Output: K cluster centroids after clustering

  • 原理:
    算法的输入:数据集合,K值
    输出:经过聚类的K个中心点
  1. Randomly select K data samples as initial cluster centers within the dataset to be clustered.
  2. Calculate the distance between each remaining data sample and the K cluster centers and assign each data sample to the nearest cluster center.
  3. Recalculate the mean of each cluster.
  4. Check if the cluster centers have changed or if the squared error function does not change significantly (this can be set by conditions, such as stopping the algorithm when the squared error function satisfies a certain relationship with the previous iteration). If not, repeat steps 2 and 3.
1
2
3
4
5
6
1、在待聚类的数据集合里随机的选取K个数据样本点作为初始的聚类中心
2、计算剩余的每一个样本点距离K个聚类中心点的距离,分配样本点到最最近的聚类中心簇里
3、重新计算聚类簇的均值
4、判断聚类中心点是否会存在变化或者平方误差函数基本上不在变化(这个可以人为的设定,比如通过设定平方误差函数与上次相比满足一定的关系变停止算法等等),否则重复步骤2,3
算法优点

Advantages of the Algorithm:

  • Simple and easy to understand.
  • Suitable for applications where the number of clusters (K) is known or can be reasonably estimated.
  • Computationally efficient and can handle large datasets.

K-means is a popular clustering algorithm with well-defined steps and clear principles, making it a good choice for various clustering tasks.

算法优点

很明显的是算法很容易,很简单,相对来说很快啊,仅限于小数据量。It’s clear that the algorithm is quite straightforward and relatively fast, but it’s limited to small datasets.

算法的缺点

这个也是很明显的,根据算法原理,K值的人为性,主观性,要是有一个算法或者什么方式能够精确地确定K值变化给此算法带来突破;另外,初始聚类中心的随机性;还有数据集合肯定存在孤立点的影响以及大数据量的解决方案等等

This is also quite evident. Due to the subjective and arbitrary nature of selecting the value of K in clustering algorithms, it presents a challenge. If there were a method or algorithm that could precisely determine the optimal K value, it would be a breakthrough for this algorithm. Additionally, the randomness in selecting initial cluster centers and the impact of outliers in the dataset are factors to consider. Dealing with large datasets also requires specific solutions.

改进思路

消除k的盲目性: 分层次的算法 或者 canopy算法

最小方差优化初始聚类中心:以方差最小的样本作为第一个簇中心,改进K-means++的初始选择第一个簇中心的随机性。

最大最小距离法:选择距离样本中心距离最小的点中最大的值。
最大最小距离算法用于避免初始聚类中心分布过于集中和密集,保证聚类中心点之间的距离相差较
远,能够充分的体现数据集的整体分布情况

To eliminate the blind selection of ‘K’ in clustering algorithms, various methods can be employed:

  1. Hierarchical Algorithms: Hierarchical clustering is a technique that organizes data into a tree-like structure with a hierarchy of clusters. It doesn’t require specifying the number of clusters beforehand and provides a visual representation of the data’s structure.

  2. Canopy Algorithm: The Canopy clustering algorithm is an unsupervised method that combines distance measures and thresholds to group data points into clusters. It can be used as an initial clustering step before applying K-means or other partition-based algorithms.

To optimize the initial selection of cluster centers:

  1. Minimum Variance Method: This method selects the data sample with the minimum variance as the initial cluster center for the first cluster. This approach improves upon the randomness of the initial selection for the first cluster center compared to traditional K-means initialization.

To ensure that initial cluster centers are well-distributed and the distances between them are maximized:

  1. Maximum Minimum Distance Method: This method selects the point that has the largest minimum distance to the existing cluster centers as the next cluster center. It aims to prevent the initial cluster centers from being too concentrated or dense, ensuring a more balanced distribution and maximizing the distances between cluster centers. This can help capture the overall data distribution effectively.

These techniques help enhance the selection of initial cluster centers and improve the performance and robustness of clustering algorithms like K-means.

评估指标

轮廓系数(Silhouette Coefficient)、Calinski-Harabasz指数(简称CH指数)和Davies-Bouldin指数(简称DB指数)是常用的聚类效果评估指标。
它们分别从不同的角度来评估聚类的质量。

The Silhouette Coefficient, Calinski-Harabasz index (CH index), and Davies-Bouldin index (DB index) are commonly used metrics for evaluating the quality of clustering. They each assess clustering effectiveness from different perspectives:

  1. Silhouette Coefficient: The Silhouette Coefficient measures how similar each data point in a cluster is to the other points in the same cluster compared to the nearest neighboring cluster. It ranges from -1 to 1, where a higher value indicates that the object is better matched to its own cluster and worse matched to neighboring clusters. A higher average Silhouette Coefficient across all data points indicates better-defined clusters.

  2. Calinski-Harabasz Index (CH Index): The CH index, also known as the Variance Ratio Criterion, evaluates the quality of clustering by considering the ratio of the between-cluster variance to the within-cluster variance. Higher CH values indicate more compact and well-separated clusters, suggesting better cluster quality.

  3. Davies-Bouldin Index (DB Index): The DB index assesses cluster quality based on the average similarity between each cluster and its most similar cluster. A lower DB index indicates better clustering, with minimal overlap and more distinct clusters.

These metrics help provide a holistic view of cluster quality, addressing different aspects such as compactness, separation, and cluster assignments. The choice of which metric to use depends on the specific characteristics and goals of the clustering task.

轮廓系数

轮廓系数更注重个体与所属簇内其他样本的相似性以及与最近的其他簇的簇中心的相似性。当数据集中的簇内差异较大或者簇间差异较大时,轮廓系数可能会有较好的表现。

The silhouette coefficient focuses more on the similarity of individual samples with the other samples within the same cluster and the similarity to the nearest cluster’s center. When there are significant differences within clusters or between clusters, the silhouette coefficient can perform well.

是基于样本之间的相似度和样本与所属簇中心的相似度来计算的,轮廓系数的值在-1到1之间,越接近1说明聚类效果越好。如果轮廓系数为0,则表示样本在两个簇中的分布比例相等,无法明确应该将其归类到哪个簇。

It’s calculated based on the similarity between samples and the similarity between samples and the center of the cluster they belong to. The value of the silhouette coefficient ranges from -1 to 1, with values closer to 1 indicating better clustering results. A silhouette coefficient of 0 implies that the distribution of samples between two clusters is roughly equal, and it’s unclear which cluster they should be assigned to.

轮廓系数 = (b - a)/max(a, b)

如果一个数据点完全位于自己的簇内,并且与其他簇的中心点距离较远,则该数据点的轮廓系数为 1。
如果一个数据点位于两个或多个簇的边界上,则该数据点的轮廓系数接近 0。
如果一个数据点位于其他簇的内部,则该数据点的轮廓系数接近-1。

The formula for the silhouette coefficient is:

Silhouette Coefficient = (b - a) / max(a, b)

  • If a data point is entirely within its own cluster and far from the center of other clusters, its silhouette coefficient is close to 1.
  • If a data point is on or very close to the boundary between two or more clusters, its silhouette coefficient approaches 0.
  • If a data point is within the interior of another cluster, its silhouette coefficient is close to -1.

计算方法如下:

The calculation involves the following steps:

a. 对于每个点i,计算它与同簇其他点的平均距离,得到a(i);

b. 对于每个点i,计算它与不同簇最近点的平均距离,得到b(i);

c. 计算轮廓系数,公式为:s(i) = (b(i) - a(i)) / max(a(i), b(i));

d. 求出所有点的轮廓系数的平均值,即为最终的轮廓系数。

a. For each point “i,” calculate the average distance to other points within the same cluster, yielding “a(i).”

b. For each point “i,” calculate the average distance to the nearest point in a different cluster, yielding “b(i).”

c. Compute the silhouette coefficient for each point, given by: “s(i) = (b(i) - a(i)) / max(a(i), b(i)).”

d. Calculate the average silhouette coefficient over all points to obtain the final silhouette coefficient.

如果是pyspark,在 PySpark 中,你可以使用 ClusteringEvaluator 来计算轮廓系数。通过比较不同 k 值对应的轮廓系数,选择使得轮廓系数最大的 k 值。

If you’re using PySpark, you can use the ClusteringEvaluator to calculate the silhouette coefficient. By comparing silhouette coefficients for different values of “k,” you can select the “k” that maximizes the silhouette coefficient.

Calinski-Harabasz指数(CH指数):

CH指数更注重簇内部的凝聚程度和簇间的分离程度。当数据集中的簇内差异较小时,CH指数可能会有较好的表现。

The CH (Calinski-Harabasz) index focuses on assessing the cohesion within clusters and the separation between clusters. It is particularly effective when the differences within clusters are relatively small.

CH指数主要用于评估多个簇的情况

CH 指数和 DB 指数的取值范围为[0,1]

CH指数 = (∑(w_i - w_j)^2) / (∑w_i^2 * ∑w_j^2)
其中,w_i 和 w_j 分别是第 i 个簇和第 j 个簇的权重,即每个簇内的样本数量。CH指数的值范围在0到1之间,值越大表示聚类效果越好。

如果所有簇之间的分离程度都很高,则 CH 指数和 DB 指数接近 1。
如果不同簇之间存在重叠或混杂,则 CH 指数和 DB 指数接近 0。

The CH index is mainly used to evaluate scenarios with multiple clusters, and its value falls within the range of [0, 1]. The formula to compute the CH index is:

CH(K) = [(∑(w_i - w_j)^2) / (∑w_i^2 * ∑w_j^2)] * [(n - K) / (K - 1)]

Where:

  • w_i and w_j are the weights of the ith and jth clusters, representing the number of samples in each cluster.
  • K is the number of clusters.
  • n is the total number of data points.

The CH index’s value ranges between 0 and 1, with higher values indicating better clustering quality.

CH指数越大,说明聚类效果越好。计算方法如下:

a. 计算类间协方差矩阵和类内协方差矩阵;

b. 计算两个协方差矩阵的特征向量和特征值;

c. 根据特征值计算类间散度矩阵和类内散度矩阵;

d. 计算Calinski-Harabasz指数,公式为:CH(K) = trace(W)/(n-1) * trace(B)/(n-1) / (||W|| ||B||/(n-1)^2),其中W和B分别是类内散度矩阵和类间散度矩阵,n是数据点的数量,trace表示求矩阵的迹,|| ||表示求矩阵的Frobenius范数;

e. 选择使CH指数最大的K值作为最佳聚类数。

Interpreting CH index values:

  • If the separation between all clusters is high, the CH index approaches 1.
  • If different clusters have overlap or mixing, the CH index approaches 0.

A higher CH index suggests better clustering quality. The steps to calculate the CH index are as follows:

a. Compute the between-cluster covariance matrix and within-cluster covariance matrix.

b. Calculate the eigenvectors and eigenvalues of both covariance matrices.

c. Use the eigenvalues to compute the between-cluster scatter matrix and within-cluster scatter matrix.

d. Calculate the Calinski-Harabasz index using the formula mentioned.

e. Choose the value of K that maximizes the CH index as the optimal number of clusters.

Davies-Bouldin指数(DB指数):

DB指数更注重簇的大小和簇内的紧密程度。当数据集中的簇大小相近且簇内紧密程度较高时,DB指数可能会有较好的表现。

The DB (Davies-Bouldin) index primarily emphasizes the size of clusters and the tightness within clusters. It performs well when the sizes of clusters are similar, and there is high cohesion within clusters.

DB指数同样主要用于评估多个簇的情况

DB指数越小,说明聚类效果越好。

Similar to the CH index, the DB index is mainly used to evaluate scenarios with multiple clusters. A smaller DB index indicates better clustering quality. The formula to compute the DB index is:

DB指数 = ∑(w_i * d_i) / (∑w_i + ∑d_i)
其中,w_i 是第 i 个簇的权重,即每个簇内的样本数量;d_i 是第 i 个簇内所有样本之间的平均距离。DB指数的值越小表示聚类效果越好。

DB(K) = (Σ(w_i * d_i)) / (Σw_i + Σd_i)

Where:

  • w_i is the weight of the ith cluster, representing the number of samples in each cluster.
  • d_i is the average distance between all samples within the ith cluster.
  • K is the number of clusters.

计算方法如下:

a. 对于每个簇Ck,计算簇内最大簇间距离中西尔勒斯平均值的最大值,记为马克v恩(m(Ck));

b. 对于每个簇Ck和Cj,计算簇间距离中西尔勒斯平均值的最大值,记为σj;

c. 计算Davies-Bouldin指数,公式为:DB(K) = (sum(m(Ck)) + sum(σj)) / K;

d. 选择使DB指数最小的K值作为最佳聚类数。

A smaller DB index value suggests better clustering quality. The calculation steps for the DB index are as follows:

a. For each cluster Ck, calculate the maximum of the average distances between the samples within the cluster and samples in other clusters, denoted as Markovian m(Ck).

b. For each pair of clusters Ck and Cj, calculate the maximum of the average distances between the samples in the two clusters, denoted as σj.

c. Calculate the Davies-Bouldin index using the formula mentioned.

d. Choose the value of K that minimizes the DB index as the optimal number of clusters.

这些指标可以用来评估K-means聚类的效果,选择合适的K值。在实际应用中,通常会尝试不同的K值,并使用这些指标进行评估和比较,以选择最佳的聚类结果。

应用场景

  • 用户画像-客户价值分类

客户价值分类是客群分析的重要组成,采用电商RFM模型,抽样进行聚类中心的计算,通过计算客户与聚类中心的距离,判断客户价值。

  • User Profile - Customer Value Segmentation

Customer value segmentation is a crucial component of customer analysis. It utilizes the e-commerce RFM model, samples data to compute clustering centers, and calculates customer value based on the distance between customers and these clustering centers.

  • app聚类

  • App Clustering

  • 用户地理位置聚类

  • User Geolocation Clustering

  • 用户分群

  • User Segmentation

  • 推荐

改进算法融合的性能方面的一个经验:如果有足够的数据,在协同过滤算法之前,先用聚类,这样可以缩短协同过滤算法在相关邻居上的选择问题上的时间;一个聚类就是一个属于这个聚类的偏好

  • Recommendations

An empirical performance improvement in algorithm fusion is to perform clustering before collaborative filtering, provided there is sufficient data. This can reduce the time taken for the collaborative filtering algorithm to select relevant neighbors. Each cluster represents a set of preferences for the users belonging to that cluster.

  • IP聚类

通过用户的来源IP聚类分析来判断是否有地域的汇聚,从而发现一些和地域相关的问题,类似的做法可以把种种隐藏在监控数值下的一些非显性问题暴露出来。

  • IP Clustering

IP clustering analysis, based on user source IP addresses, can help determine geographic aggregation. This approach can uncover location-related issues, similar to how it can reveal non-explicit problems hidden in monitoring data.

应用场景举例

用户路径的聚类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
用户路径的聚类可以通过一系列算法实现,其中包括K-Means聚类、层次聚类等。以下是一种可能的使用K-Means聚类进行用户路径分析的步骤:

Clustering of user paths can be achieved using various algorithms, including K-Means clustering and hierarchical clustering. Here is a possible set of steps for performing user path analysis using K-Means clustering:

数据准备:收集并清洗广告用户路径数据,包括用户与广告的交互行为,如点击、浏览、购买等。

Data Preparation: Collect and clean advertising user path data, including user interactions with ads such as clicks, views, purchases, etc.

数据预处理:对数据进行标准化或归一化处理,消除数据量纲的影响。
Data Preprocessing: Standardize or normalize the data to remove the influence of data scales.

构建序列:将广告用户路径数据转换为序列形式,每个序列包含了一组用户与广告的交互行为。
Sequence Construction: Transform advertising user path data into a sequence format, where each sequence contains a set of user interactions with ads.

序列标注:对每个序列进行标注,以便于后续的聚类分析。标注可以基于用户的属性、行为等进行分类。
Sequence Labeling: Annotate each sequence to facilitate subsequent clustering analysis. Labeling can be based on user attributes, behaviors, etc.

序列聚类:使用适合序列数据的聚类算法进行聚类分析。例如,可以考虑使用K-Means聚类、层次聚类、DBSCAN聚类等。
Sequence Clustering: Use clustering algorithms suitable for sequence data to perform cluster analysis. For instance, consider using K-Means clustering, hierarchical clustering, DBSCAN clustering, and others.

结果分析:对聚类结果进行解读,分析不同类别的广告用户路径序列的特点,以便于进行更有针对性的广告推送。

Result Analysis: Interpret the clustering results, analyze the characteristics of different categories of advertising user path sequences for more targeted ad campaigns.

代码实现

纯python实现

  • 解释:
    首先使用 NumPy 库生成了 100 个随机的二维数据点。然后,我们使用 K-Means 算法对这些数据进行聚类。在算法执行过程中,我们记录了每个点到聚类中心的距离,并使用标签记录了每个点的聚类结果。最后,我们使用 Matplotlib 库可视化了聚类结果。

  • Explanation:
    First, we use the NumPy library to generate 100 random two-dimensional data points. Then, we apply the K-Means algorithm to cluster these data points. During the algorithm’s execution, we record the distance of each point to the cluster centers and use labels to record the clustering results for each point. Finally, we visualize the clustering results using the Matplotlib library.

需要注意的是,在实际应用中,我们可能需要根据具体的需求对 K-Means 算法进行调整和优化。例如,我们可以使用不同的距离度量方法、初始化方法、迭代次数等参数来提高聚类效果。

It’s important to note that in practical applications, we may need to adjust and optimize the K-Means algorithm based on specific requirements. For example, we can use different distance metrics, initialization methods, and the number of iterations to improve the clustering results.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np  
import matplotlib.pyplot as plt

# 生成随机数据
np.random.seed(0)
X = np.random.randn(100, 2)

# 初始化聚类中心
n_clusters = 3
centers = np.zeros((n_clusters, 2))
centers[0] = np.array([-4, -2])
centers[1] = np.array([0, 2])
centers[2] = np.array([4, -2])

# 执行 K-Means 算法
iterations = 100
distances = np.zeros((iterations, X.shape[0]))
labels = np.zeros((iterations, X.shape[0]))
for i in range(iterations):
# 计算每个点到聚类中心的距离
for j in range(X.shape[0]):
distances[i, j] = np.linalg.norm(X[j] - centers)
# 对每个点进行聚类,并记录标签
labels[i] = np.argmin(distances[i], axis=0) + 1
# 更新聚类中心
for j in range(n_clusters):
centers[j] = np.mean(X[labels == j], axis=0)

# 可视化聚类结果
colors = ['r', 'g', 'b']
for i in range(n_clusters):
plt.scatter(X[:, 0], X[:, 1], c=colors[i % 3], label=i)
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
plt.legend()
plt.show()

sklearn模块实现

  • 解释:

首先从数据中提取了特征并对其进行了标准化。然后,我们创建了一个新的数据帧,其中包含经过标准化的数据。接着,我们使用DataFrameMapper将数据映射到适当的格式,以便将其输入到KMeans模型中。最后,我们得到了预测结果,并将它们添加回到原始数据中。

First, we extracted features from the data and standardized them. Next, we created a new DataFrame containing the standardized data. Then, we used DataFrameMapper to map the data into the appropriate format for input into the K-Means model. Finally, we obtained the prediction results and added them back to the original data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# 假设你的数据是一个DataFrame,包含两列:features和labels
data = sqlContext.createDataFrame(<your data>)

# 特征缩放
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data['features'])

# 将数据转换为DataFrame
scaled_rows = list(zip(*numpy.c_[list(range(len(data))), scaled_data]))
schema = StructType([StructField('id', IntegerType(), False), \
StructField('features', FloatType(), False)])
scaled_dataframe = sqlContext.createDataFrame(scaled_rows, schema)

# 将数据映射到适当的格式
mapper = DataFrameMapper([(\'features\', Scaling(StandardScaler()))])
normalized_dataframe = mapper.fit_transform(scaled_dataframe)

# 运行K-Means算法
kmeans = KMeans(k=2, random_state=0)
predictions = kmeans.fit_predict(normalized_dataframe)

# 将预测结果添加回原始数据
data['prediction'] = predictions

# 显示前几行
print(data.head())

pyspark实现版本1

  • 解释

假设你的数据是以 LIBSVM 格式存储的。如果你的数据是不同的格式,你需要相应地调整数据读取部分。此外,你需要将 “path_to_your_data” 替换为你的实际数据路径。

在这个例子中,我们使用了 K-Means 算法对数据进行聚类,然后评估了聚类结果的质量。最后,我们打印了每个簇的聚类中心。

Assuming your data is stored in LIBSVM format, you’ll need to adjust the data reading section accordingly. Also, make sure to replace “path_to_your_data” with your actual data path.

In this example, we’ve used the K-Means algorithm to cluster the data and evaluated the quality of the clustering results. Finally, we printed the clustering centers for each cluster.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

from pyspark.sql import SparkSession
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

# 创建 SparkSession
spark = SparkSession.builder.appName('KMeans Example').getOrCreate()

# 加载数据集
dataset = spark.read.format("libsvm").load("path_to_your_data")

# 设置 KMeans 算法参数
kmeans = KMeans(k=3, seed=1, featuresCol="features")

# 训练模型
model = kmeans.fit(dataset)

# 预测
predictions = model.transform(dataset)

# 评估聚类模型 (使用 Silhouette Score)
evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

# 显示聚类中心
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)

pyspark实现版本2

  • 解释

使用了 iris.txt 数据集,其中有 3 个类别,每个类别有 50 个样本。我们使用 KMeans 算法将这 3 个类别分为 3 个簇。在训练模型时,我们使用了 nClusters=3 和 randomSeed=42 参数,其中 nClusters 表示簇的数量,randomSeed 表示随机数生成器的种子。最后,我们使用 predict() 方法对 DataFrame 进行预测,并显示预测结果。

We used the iris.txt dataset, which consists of 3 categories, each with 50 samples. We applied the K-Means algorithm to cluster these 3 categories into 3 clusters. During model training, we used the parameters nClusters=3 and randomSeed=42, where nClusters represents the number of clusters and randomSeed is the seed for the random number generator. Finally, we used the predict() method to make predictions on the DataFrame and displayed the prediction results.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from pyspark.sql.functions import *
from pyspark.ml.clustering import KMeans

# 创建 SparkSession
spark = SparkSession.builder.appName("KMeans").enableHiveSupport().getOrCreate()

# 创建 DataFrame
df = spark.sql("SELECT * FROM iris.txt")

# 创建 KMeans 模型
kmeans = KMeans(nClusters=3, randomSeed=42)

# 训练模型
model = kmeans.fit(df)

# 进行预测
prediction = model.predict(df)

# 查看结果
prediction.show()

pyspark实现版本3

  • 解释

首先创建了一个包含四个样例的DataFrame,每个样例都有一个特征’features’和一个目标变量’label’。然后,我们将特征’features’转换为数值型,接着创建了一个KMeans模型实例,并使用训练好的模型对原始数据进行预测。最后,我们打印出了预测结果和每个样例所属的簇。

First, we created a DataFrame containing four samples, each with a ‘features’ feature and a target variable ‘label’. Then, we converted the ‘features’ feature into a numerical format. Next, we created an instance of the K-Means model and used the trained model to make predictions on the original data. Finally, we printed the prediction results and the cluster to which each sample belongs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import DenseVector
from pyspark.sql import DataFrame
from pyspark.sql.functions import col

# 假设我们有一个包含特征'features'和目标变量'label'的DataFrame
df = spark.createDataFrame([(DenseVector([1.0, 2.0]), "A"),
(DenseVector([3.0, 4.0]), "B"),
(DenseVector([5.0, 6.0]), "A"),
(DenseVector([7.0, 8.0]), "B")], ["features", "label"])

# 将特征'features'转换为数值型
vectorAssembler = VectorAssembler(inputCols=["features"], outputCol="featuresNum")
df = vectorAssembler.transform(df)

# 创建KMeans模型实例
kmeans = KMeans(k=2, seed=1, featuresCol="featuresNum", labelCol="label")

# 训练模型
kmeansModel = kmeans.fit(df)

# 预测
predictions = kmeansModel.transform(df)

# 打印预测结果
print("Cluster centers:")
display(predictions.select("featuresNum", "prediction"))

# 打印每个样本所属的簇
print("Sample points and their cluster assignments:")
for i in range(len(predictions)):
row = predictions.head(i+1).collect()[0]
print(f"Sample {i + 1}: {row['featuresNum']}, Cluster={row['prediction']}")

pyspark实现版本4 含选择最佳簇个数代码

  • 解释

首先创建了一个包含稀疏向量和标签的RDD。然后,我们将这些数据转换成特征和标签,以便输入到KMeans模型中。接下来,我们使用VectorAssembler将特征组合在一起,并将标签转换成数字。最后,我们使用KMeans模型进行训练,并对模型进行评估。

We first created an RDD containing sparse vectors and labels. Then, we transformed this data into features and labels to input into the K-Means model. Next, we used VectorAssembler to combine the features and converted the labels into numerical format. Finally, we trained the K-Means model and evaluated the model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import ClusteringEvaluator

# 准备数据
data = [(Vectors.sparse(5, {0: 4.0, 1: 7.0}), "a"), (-2.0, -3.0, 6.0, 8.0, 9.0, "b")]
dataset = sc.parallelize(data)
rdd = dataset.map(tuple)

# 特征组合
vector_input = rdd.map(lambda row: tuple([row[0].toArray(), ]))
label_input = rdd.map(lambda row: int(row[1]))
combinedInput = vector_input.zip(label_input)

# 特征预处理
assembler = VectorAssembler(
inputCols=["features"], outputCol="features")
output = assembler.transform(combinedInput)

# 训练模型
numClusters = 2
kmeans = KMeans(k=numClusters, seed=1, featuresCol='features', labelCol='label')
model = kmeans.fit(output)

# 评估模型
evaluator = ClusteringEvaluator(
metricName="silhouette", featuresCol="features", predictionCol="prediction")
silhouetteScore = evaluator.evaluate(model.transform(output))
print("Silhouette with squared euclidean distance: %f" % silhouetteScore)

pyspark 评估指标代码

轮廓系数(Silhouette Score)
1
2
3
4
5
6
7
8
9
10
11


from pyspark.ml.evaluation import ClusteringEvaluator

# 假设 dataset 是你的数据集,k_values 是可能的 k 值集合
for k in k_values:
kmeans = KMeans(k=k, seed=1, featuresCol="features").fit(dataset)
predictions = kmeans.transform(dataset)
silhouette = ClusteringEvaluator().evaluate(predictions)
print("Silhouette with k = {}: {}".format(k, silhouette))

肘部法则(Elbow Method):

通过计算不同 k 值对应的簇内误差平方和(Within-Cluster Sum of Squares,WCSS),绘制 WCSS 随 k 值变化的曲线图。选择使得曲线图“肘部”凹陷处的 k 值,即 WCSS 显著下降的转折点

By calculating the Within-Cluster Sum of Squares (WCSS) for different values of k, we can create a curve showing how WCSS changes with k. We select the value of k where the curve exhibits a significant bend or inflection point, often referred to as the “elbow,” indicating a substantial decrease in WCSS.

1
2
3
4
5
6
7
8
9
10
11
12
13

from pyspark.ml.clustering import KMeans
from pyspark.sql import functions as F

# 计算 WCSS
wcss = dataset.groupBy(F.expr("prediction")).agg(F.pow(F.col("raw_distance"), 2).alias("wcss"))

# 绘制肘部图
k_values = [2, 3, 4, 5, 6] # 尝试不同的 k 值
wcss_by_k = [wcss_df.groupBy("k").agg(F.sum("wcss").alias("wcss")).collect()[0]["wcss"] for wcss_df in [wcss]]
k_values_with_wcss = [(k, wcss) for k, wcss in zip(k_values, wcss_by_k)]
max_wcss_k = max(k_values_with_wcss, key=lambda x: x[1])[0]

交叉验证(Cross-Validation):

对于较大的数据集或需要更精确的选择时,可以使用交叉验证方法。PySpark 的 KMeansModel 提供了一个 computeCost 方法,可以用于计算点到簇中心的距离之和,这可以作为交叉验证的指标。

For larger datasets or when you need a more precise selection, you can use cross-validation methods. PySpark’s KMeansModel provides a computeCost method, which can be used to calculate the sum of distances from points to cluster centers. This can serve as a metric for cross-validation.

桑基图

聚类结果通常可以用热力图、散点图、树状图等多种形式进行可视化分析。但是,对于大型数据集,特别是当聚类数量较多时,传统的可视化方法可能难以有效地展示所有信息。这时,桑基图就是一个很好的选择。

Cluster results are often visualized using various forms such as heatmaps, scatter plots, dendrogram diagrams, and more. However, for large datasets, especially when dealing with numerous clusters, traditional visualization methods may struggle to effectively convey all the information. In such cases, Sankey diagrams can be a great choice.

桑基图能够以分层的方式展示数据的流动和转换过程,这与聚类的思想非常契合。同时,由于其对数据规模不敏感的特点,即使是非常大规模的数据集,也能够得到有效的展示。

Sankey diagrams can present data flow and transformation processes in a hierarchical manner, aligning well with the concept of clustering. Moreover, they are not sensitive to data scale, making them suitable for effectively visualizing even very large datasets.

Python中实现桑基图的一种常见方法是使用NetworkX库。

  • 解释

首先定义了一个简单的网络图,然后使用nx.maximum_flow函数计算了最大流量。接着,我们使用nx.draw函数绘制了桑基图,最后使用plt.show()显示了图像。

A common way to implement Sankey diagrams in Python is by using the NetworkX library.

  • Explanation:

First, we define a simple network graph. Then, we use the nx.maximum_flow function to calculate the maximum flow. Next, we use the nx.draw function to create the Sankey diagram, and finally, we use plt.show() to display the image.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

import matplotlib.pyplot as plt
import networkx as nx

# 定义节点和边
nodes = ['A', 'B', 'C', 'D', 'E']
edges = [('A', 'B'), ('A', 'C'), ('B', 'D'), ('C', 'E')]

# 创建网络图
G = nx.DiGraph()
G.add_nodes_from(nodes)
G.add_edges_from(edges)

# 计算流量
flow = nx.maximum_flow(G, 'A', 'E')

# 绘制桑基图
plt.figure(figsize=(10, DCenter)) # DCenter是图的中心点
pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, node_color='skyblue', edge_color='black', width=1.5)
nx.draw_networkx_edge_labels(G, pos, font_size=10, labels={e: f'{f}' for e, f in flow.items()})

# 添加标题
plt.title('Sankey Diagram')

# 显示图像
plt.show()

相关聚类代码汇总-持续更新