Optimisation par essaim de particules application au clustering des données de grandes dimensions
Date de publication2009
Clustering high-dimensional data is an important but difficult task in various data mining applications. A fundamental starting point for data mining is the assumption that a data object, such as text document, can be represented as a high-dimensional feature vector. Traditional clustering algorithms struggle with high-dimensional data because the quality of results deteriorates due to the curse of dimensionality. As the number of features increases, data becomes very sparse and distance measures in the whole feature space become meaningless. Usually, in a high-dimensional data set, some features may be irrelevant or redundant for clusters and different sets of features may be relevant for different clusters. Thus, clusters can often be found in different feature subsets rather than the whole feature space. Clustering for such data sets is called subspace clustering or projected clustering, aimed at finding clusters from different feature subspaces. On the other hand, the performance of many subspace/projected clustering algorithms drops quickly with the size of the subspaces in which the clusters are found. Also, many of them require domain knowledge provided by the user to help select and tune their settings, like the maximum distance between dimensional values, the threshold of input parameters and the minimum density, which are difficult to set. Developing effective particle swarm optimization (PSO) for clustering high-dimensional data is the main focus of this thesis. First, in order to improve the performance of the conventional PSO algorithm, we analyze the main causes of the premature convergence and propose a novel PSO algorithm, call InformPSO, based on principles of adaptive diffusion and hybrid mutation. Inspired by the physics of information diffusion, we design a function to achieve a better particle diversity, by taking into account their distribution and the number of evolutionary generations and by adjusting their"social cognitive" abilities. Based on genetic self-organization and chaos evolution, we build clonal selection into InformPSO to implement local evolution of the best particle candidate, gBest, and make use of a Logistic sequence to control the random drift of gBest. These techniques greatly contribute to breaking away from local optima. The global convergence of the algorithm is proved using the theorem of Markov chain. Experiments on optimization of unimodal and multimodal benchmark functions show that, comparing with some other PSO variants, InformPSO converges faster, results in better optima, is more robust, and prevents more effectively the premature convergence. Then, special treatments of objective functions and encoding schemes are proposed to tailor PSO for two problems commonly encountered in studies related to high-dimensional data clustering. The first problem is the variable weighting problem in soft projected clustering with known the number of clusters k . With presetting the number of clusters k, the problem aims at finding a set of variable weights for each cluster and is formulated as a nonlinear continuous optimization problem subjected to bound. constraints. A new algorithm, called PSOVW, is proposed to achieve optimal variable weights for clusters. In PSOVW, we design a suitable k -means objective weighting function, in which a change of variable weights is exponentially reflected. We also transform the original constrained variable weighting problem into a problem with bound constraints, using a non-normalized representation of variable weights, and we utilize a particle swarm optimizer to minimize the objective function in order to obtain global optima to the variable weighting problem in clustering. Our experimental results on both synthetic and real data show that the proposed algorithm greatly improves cluster quality. In addition, the results of the new algorithm are much less dependent on the initial cluster centroids. The latter problem aims at automatically determining the number of clusters k as well as identifying clusters. Also, it is formulated as a nonlinear optimization problem with bound constraints. For the problem of automatical determination of k , which is troublesome to most clustering algorithms, a PSO algorithm called autoPSO is proposed. A special coding of particles is introduced into autoPSO to represent partitions with different numbers of clusters in the same population. The DB index is employed as the objective function to measure the quality of partitions with similar or different numbers of clusters. autoPSO is carried out on both synthetic high-dimensional datasets and handcrafted low-dimensional datasets and its performance is compared to other selected clustering techniques. Experimental results indicate that the promising potential pertaining to autoPSO applicability to clustering high-dimensional data without the preset number of clusters k.
- Sciences – Thèses