Introduction
The journal entitled “Efficient Clustering Algorithms” aims at discovering groups and identifying interesting distributions and patterns in data sets. Researchers have extensively studied clustering since it arises in many application domains in engineering and social sciences. In the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains. This paper surveys clustering methods and approaches available in literature in a comparative way.
Clustering has been one of the most widely studied topics in data mining and it is often the first step of data mining process. Today, it has been witnessing enormous growth in data volume. Often, data is distributed or it can be in the form of streaming data. Efficient clustering in all these scenario becomes a very challenging problem.k-means clustering has been one of the popular clustering algorithms. It requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets and also for streaming data. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset. The research group has developed a new algorithm for very large data clustering which typically requires only one or a small number of passes on the entire dataset, and provably produces the same cluster centers as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centers, and then takes one or more passes over the entire dataset to adjust these cluster centers. One contribution of this journal is the implementation and evaluation of that algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared to k-means. The concept of this were extended to develop another clustering algorithm for distributed data set. The performance of this algorithm is compared with parallel k-means algorithm. The distributed algorithm gives better performance than parallel k-means for loosely connected data sources and in the presence of imbalanced data in remote sources. The framework of clustering evolving streaming data was developed. In this framework, two exact and two approximate algorithms were provided . Two of the algorithms use outliers to detect changes in cluster centers in streaming data. The approximate outlier based algorithm takes less memory and can compute high quality approximate cluster centers much faster than k-means algorithm. The exact algorithm is faster than k-means when the cluster center movement is little. If cluster center movement is much enough, the exact algorithm however takes more time than k-means algorithm. The experiments show that this framework can be very effective in clustering evolving streaming data.
Clustering
Clustering has been one of the most widely studied topics in data mining. Clustering algorithms group similar objects into clusters. A cluster is therefore a collection of objects which are similar between themselves and are dissimilar with objects belonging to other clusters. Formally, given a set of ‘d’ dimensional points and a function ‘f ‘ : Rd X Rd ->R that computes the distance between two points in Rd _, it is required to compute ‘k’ cluster centers, such that the points belonging to the same cluster are similar and points that are in different cluster are dissimilar. Often, clustering is very fast task in any data mining application. Most of the initial clustering techniques were developed by statistics or pattern recognition communities, where the goal was to cluster a modest number of data instances which often located on a single machine. However today, it was witnessed an enormous growth in the amount of collected data. For example, Walmart processes two million transactions per day, AT&T collects millions of call logs every hour, Amazon or other e-commerce companies collect millions of web logs every hour, high energy physics experiments generate data in tera bytes, etc. Additionally, it can also have data distributed in remote machines or it can have streaming type of data which is potentially infinite.
In this scenario, traditional clustering algorithms are inefficient because they were not designed to process large amounts of data or to operate in distributed or streaming environments. Hence, developing fast and efficient clustering algorithms for massive or distributed or streaming data has been identified as challenging problem in the data mining community. Massive data sets are known as out of core data as they do not fit entirely in main memory. In this case, the run time of the algorithm is dominated by the cost of I/O operation or number of disk scan. Hence, ideally the clustering algorithm for massive data should be optimized for the number of scans over the disk. Joydeep Ghosh states “The holy grail of scalable clustering can be to find near linear algorithm that involve only a small number of passes through the database”.
Distributed data may reside in remote loosely-coupled machines. In this case, communication between the nodes may be a bottleneck. Hence, a distributed algorithm is required. The runtime will be dominated by the slowest communicating nodes. Hence, the goal of distributed algorithm should be to minimize the effort and time expended on communication. Streaming data is potentially infinite and it is not possible to store all the data. Hence, the algorithms for streaming data should preferably be one pass and on-line and they should also require small amount of memory for their execution. The requirement of small memory is targeted towards mobile devices or sensor network nodes.
K Means Clustering
The work employs the k-means clustering. The K-means clustering algorithm was developed by MacQueen. Bottou and Bengio proved the convergence properties of the K-means algorithm. It has been shown to be very useful for a body of practical applications. The k-means algorithm is not suitable for massive or distributed or streaming data.
A problem with the k-means algorithm is that it makes a complete scan over the entire data for every iteration, and it requires many such iterations before converging to a quality solution. This makes it potentially very expensive algorithm to use, particularly for large disk-resident datasets.
Furthermore, in the context of streaming data, it may not always possible to store the complete data. This makes k-means algorithm impractical to use for streaming data. For distributed data, one can use parallel k-means algorithm but this algorithm requires that nodes communicate for every iteration. This makes the parallel k-means very expensive to use in presence of communication delays among the nodes. The communication problem is especially prominent in loosely connected remote machines. Also, the parallel k-means works on uniformly partitioned data. The nodes are said to posses data imbalance if the amount of distributed data is not uniform at each node. In presence of such data imbalance, the runtime of parallel k-means algorithm will be dominated by the processing time of the node which contains maximum amount of data. Moreover, other standard practice for clustering distributed data is to download and merge all data in a single machine and then running a sequential clustering algorithm on it. But, downloading huge amount of data and running sequential algorithm is clearly not an efficient solution for clustering distributed data. Hence, a distributed clustering algorithm is required which optimizes the communication and can also operate in presence of data imbalance. A number of algorithms or approaches focus on reducing the number of passes required over the data sets. These researches have been done in the context of clustering massive data or streaming data. However, these approaches only provide approximate solutions, possibly with deterministic or probabilistic bounds on the quality of the solutions.
Also, there is no distributed algorithm available at present for clustering which optimizes communication and can work in presence of data imbalance and can provide exact results. Therefore, the journal proposes several efficient algorithms which can produce exact results as k-means algorithm for large out of core data, distributed data and streaming data.
Contribution
The main contribution of this journal is as following:
Ø Implementation and evaluation of a Fast and Exact out of core k-means algorithm (FEKM).
Ø Development and evaluation of a Distributed Fast and Exact k-means algorithm (DFEKM)extending the concepts of FEKM.
Ø Development and evaluation of a set of algorithms for clustering evolving streaming data.
The main idea in the FEKM is as follows.
The initial process uses sample the data and run the original k-means algorithm. The computed centers were stored after each iteration of the run of the k-means on the sampled data. Then this information is used and complete a pass over the entire dataset. After the identification the points which are more likely to shift from one cluster to another are stored , as the cluster centers could move. These points are now used to correct the cluster centers.
The DFEKM is an extension of FEKM for distributed data set. In this algorithm, samples are collected from individual data sources (nodes). Then, the central node runs kmeans and it stores the centers at each iteration as it does in FEKM, then it sends that information to all other nodes.
Individual nodes then process their data in one pass and send all information to the central node. The central node then merge all information and
find out the correct centers.
Streaming data clustering algorithms have been developed considering a sliding window model for stream. In this model, each arriving data point has a time stamp and the interest is finding k clusters within a specific time window. Only last points are relevant inside a sliding window of size . This is a popular approach in mining streaming
data. In this journal, four streaming data clustering algorithms were developed.
The algorithms were referred as “Fast and Exact Stream k-means” (FESK) and “Fast and Approximate Stream k-means” (FASK), “Exact Adaptive Stream k-means” (EADSK) and “Approximate Adaptive Stream k-means” (AADSK). FESK and EADSK can find out the exact cluster centers in a sliding window as what would have been found by a multi-pass kmeans algorithms. FASK and AADSK are approximate algorithms.
These algorithms can deal with evolving streaming data. However, if the cluster centers evolve gradually within a pre-estimated bound then FASK and AADSK both will find the exact centers. EADSK and AADSK are designed to determine outliers from the data set. The extensive experiments were conducted with synthetic and real data sets for all these algorithms to evaluate their performance for different scenarios.
LITERATURE REVIEW
Main Idea of Fast and Exact K-Means Algorithm (FESK)
Fast and Exact K-Means Algorithm (FEKM) has been designed by the research group. One contribution of this journal is implementation and evaluation of the algorithm. This chapter first gives the main idea of the algorithm and implementation details
The basic idea behind the Fast and Exact K-means algorithm is as follows.
It has been believed that approximate cluster centers computed using sampling can be corrected and moved to exact cluster centers using only one or a small number of passes on the entire data. By exact cluster centers, it refers to the cluster centers that are computed by the original k-means algorithm. Thus, the sampling can be used to speedup the computation of exact clusters.
There are three key questions to be addressed.
First, when approximate cluster centers are computed using sampling, what information need to be stored. Second, how can this information be used to avoid a large number of passes on the entire dataset. Third, how to know that have it has been able to achieve the same cluster centers as in the original k-means algorithm.
Initially the k-means algorithm is made to run on a sample, using the same convergence criteria and same initial points as it has been used for the k-means. The following information is stored for future use. After every iteration of k-means on the sampled data, the centers that have been computed were stored. In addition, another value, referred to as the Confidence Radius of each cluster was computed and stored, whose computation will be described later. This information can be stored in a table with columns, and the number of rows equaling the number of iterations for which k-means was run. Each entry of the table contains a tuple (center, radius) for each cluster.
Next, one pass through the entire dataset has been completed. For every point and each row of the table, the cluster to which this point will be assigned at this iteration is determined, assuming that executing the algorithm on the entire dataset produces the same cluster centers as the initial run on sampled data. Next, it has been tried to estimate how likely it is that this point will be assigned to a different cluster when the algorithm is executed on the entire dataset.
The goal is to identify and store the points which could be assigned to a different cluster during any of the iterations. These points are refereed to as boundary points, because intuitively, they fall at the boundary of the clusters. If these points could be identified and stored in memory, any need for any further passes on the entire dataset can be eliminated. However, estimation of these points were made randomly , which implies that it requires additional passes if the estimate is not correct.
Thus, for a given point and row of the table, if this point is a boundary point is determined. If it is, it is stored in a buffer. Otherwise, updation of the sufficient statistics tuples were made , which has the number and sum of the data points for the cluster.After the pass through the dataset and storing the boundary point, the following processing has been done. Starting from the first row of the table, the recomputation of centers were made using the boundary points and sufficient statistics tuple. If any of the new computed centers fall outside the pre-estimated confidence radii which means that the computation of boundary points is not valid, there is a need to take another pass through the data. The new centers were used as new initialization points and again repeat all the steps. However, if the new computed centers are within the confidence radius, these centers were used for the next iteration and continue. The key observation is that using cluster centers from sampling, boundary points, and sufficient statistics, whether it is possible to compute the same cluster centers that it has been gotten through one pass on the entire dataset. Finally, the algorithm terminates by checking for the same termination condition that one would use in the original algorithm.
Main Idea of Distributed Fast and Exact k-means Algorithm(DESK)
The idea of FEKM has been extended for a distributed data set. The distributed algorithm has been designed and implemented using MPI. It has been referred as Distributed Fast and Exact K-Means algorithm or DFEKM. This chapter first presents the premise behind the distributed implementation and later provides the details of experiments and evaluation of this new algorithm. the same definitions and notations were used as FEKM.
The basic idea behind the Distributed Fast and Exact K-means algorithm is as follows.
DFEKM uses ideas from FEKM where pre-computed approximate cluster centers can be corrected and moved to exact cluster centers using only one or a small number of passes on the entire data, and only one or a small number of rounds of communication between data sources and the central site. There are three key questions to be addressed. First, when approximate cluster centers are computed using sampling, what information needs to be stored. Second, how can this information be used to process data independently on each data source, and to avoid frequent or high volume communication. Third, how it can be determined, in a distributed environment, that it has been able to achieve the same cluster centers as in the centralized k-means algorithm.
The sample data has been taken from each data source, and communicate it to the central node. Then, on the central node, the k-means algorithm has made to run on this sampled data. The following information is stored for processing on each data source. After every iteration of k-means on the sampled data, the k centers that have been computed has been stored . In addition, it compute and store another value, referred to as the Confidence Radius of each cluster . This information is then used to build the CAtable structure while designing FEKM.
Then, it send the table to all the data sources. Next, at each data source, one pass were taken through the portion of the dataset available at that data source. For every point and each row of the table, it compute the cluster to which this point will be assigned at this iteration. It has been estimated if the point is a stable point which was done in FEKM.The sufficient statistics of the (number, linear sum and square sum) stable points were stored and the points which are not stable or the boundary point were also stored exactly as discussed in the context of FEKM.
After the pass through the dataset and storing the boundary point, all the nodes send their boundary points and sufficient statistics to central node. The central node then does the following processing. Starting from the first row of the table, it recomputes centers using the boundary points and sufficient statistics tuple. If any of the new computed centers fall outside the pre-estimated confidence radius, which means that the computation of boundary points is not valid, it is needed to send the last corrected centers to all other nodes. Using these centers as the new initialization points, it has to go through another iteration and repeat all the steps. However, if the new computed centers are within the confidence radius, these centers for the next iteration were used and continued.
Main Ideas of Clustering evolving Streaming Data Algorithms
Clustering Evolving Streaming Data
A requirement of mining streaming data is that it can not store all data points and must produce the result rapidly, ideally in one scan on the arriving data. A popular paradigm in streaming data algorithms collects a synopsis of the incoming data in a single pass and then works on the synopsis to perform the mining task. In this journal, the clustering data points in a sliding window over the stream has been studied . The sliding window model is a commonly used mechanism for discounting old data. Here, only the last elements to arrive in the stream are considered relevant for analysis, where is the window size. Researchers have developed efficient algorithms to maintain approximate k-medians or k-centers in the sliding window. Jiawei Han, Charu Aggarwal and their research group notice that one aspect in streaming data clustering is to deal with the evolving nature of streaming data. It has been further noticed that it is important to consider detecting the change in cluster location in the sliding window in case of stream clustering. This is important because the clusters may move or change in different manner with time. It may also happen that for a particular time frame data points do not form distinct clusters. It has been noticed that the streaming data clustering algorithm should exploit the evolving nature of streaming data. Suppose, if it has been already found the clusters in a particular time frame, then it is possible to utilize the previous results to obtain the clusters in the subsequent time frames. It is observed that if the clusters do not move much enough it is possible to find out the new clusters quite fast using the previous results. In this journal the two exact and two approximate algorithms were developed which exploit the evolving nature of streaming data to find out clusters in different time frames.
The exact algorithms find the exact same final results as what would have been obtained by k-means algorithm. As it has been discussed in previous chapters, the k-means algorithm makes one scan over the entire dataset for every iteration, and it needs many such iterations before converging to a quality solution. This makes it potentially very expensive to use for streaming data. Also, one has to store all the data points in an window for the k-means to process the data. This can also be a problem if the computing device has small memory.
In the framework, one can choose clustering algorithms based on the requirements and system constraints to obtain exact or approximate cluster centers. For example, suppose, one is interested in obtaining clusters in a sliding window where the computation has to be performed in a mobile device which has a small main memory.
The basic idea behind the Fast and Exact Stream K Means (FESK), Fast and Approximation Stream K-means (FASK),Exact Adaptive Stream K-Means(EASK), Approximate Adaptive Stream K Means(AADSK) were as follows.
In this case, the objective may to obtain approximate cluster centers using the small memory. On the other hand, if the objective is to obtain exact cluster centers in a sliding window over the stream then one has to use an algorithm guaranteed to obtain the exact centers. The algorithms were referred to as “Fast and Exact Stream K-Means” (FESK) and “Fast and Approximate Stream K-Means” (FASK), “Exact Adaptive Stream K-Means” (EADSK) and “Approximate Adaptive Stream K-Means” (AADSK).
In a typical sliding window model it contains a sliding window for each time frame. At each sliding window, last data points are relevant for the analysis. In the framework, it initially cluster the data points in the first window using k-means algorithm. Then the cluster centers were stored and the cluster radius were estimated based on the number of data points within the cluster. In the next window, these cluster centers and radius values were used to classify the data points as stable points, boundary points and outlier points. The stable points are points which are most likely to retain their cluster membership. Boundary points are most likely to change their cluster membership. The outlier points are the points which are far away from its closest cluster and also quite far away from the second closest cluster.
The first two algorithms Fast and Exact Stream k-means (FESK) and Fast and Approximate Stream k-means (FASK) use the stable and boundary points and the other two algorithms, Exact Adaptive Stream k-means (EADSK) and Approximate Adaptive Stream k-means (AADSK) use stable, boundary and outlier points. The confidence radius were used in a slightly different manner than that which has defined in previous chapters. Here, the confidence radius is an upper bound of movement of a cluster center in the sliding window over a time frame. The confidence radius can be computed as a fraction of cluster radius where cluster radius is the average distance of any point in one cluster. In FESK and FASK, the sufficient statistics of the stable points and the boundary points were also stored . After that, the final centers were computed using those sufficient statistics and the boundary points.
In FESK, it is verified if the final centers are falling outside the pre-defined confidence radius. In that case, run k-means algorithm has to be made run with all the data points in that window again. Otherwise, it has been done with the clustering. In FASK, it does not verify the final centers and it continues processing to the next window.
The first two algorithms Fast and Exact Stream k-means (FESK) and Fast and Approximate Stream k-means (FASK) use the stable and boundary points and the other two algorithms, Exact Adaptive Stream k-means (EADSK) and Approximate Adaptive Stream k-means (AADSK) use stable, boundary and outlier points. The confidence radius were used in a slightly different manner than that which has defined in previous chapters. Here, the confidence radius is an upper bound of movement of a cluster center in the sliding window over a time frame. The confidence radius can be computed as a fraction of cluster radius where cluster radius is the average distance of any point in one cluster. In FESK and FASK, the sufficient statistics of the stable points and the boundary points were also stored . After that, the final centers were computed using those sufficient statistics and the boundary points.
In FESK, it is verified if the final centers are falling outside the pre-defined confidence radius. In that case, run k-means algorithm has to be made run with all the data points in that window again. Otherwise, it has been done with the clustering. In FASK, it does not verify the final centers and it continues processing to the next window.
CONCLUSION
In this article, a fast out of core clustering technique (FEKM) were implemented and evaluated. Then the ideas of that algorithm were extended and did an implementation (DFEKM.) for distributed data. Later a set of algorithms (FESK, FASK, EADSK, AADSK) were developed for clustering evolving streaming data.
The FEKM typically requires one or a small number of passes on the entire dataset. This can significantly reduce the execution times for clustering on large or disk-resident datasets, with no compromise on the quality of the results. While a number of approaches existed for approximating k-means or similar algorithms with sampling or using a small number of passes, none of these approaches could provably produce the same cluster centers as the original k-means algorithm. The basic idea in the algorithm is to use sampling to create approximate cluster centers, and use these approximate cluster centers for speeding up the computation of correct or exact cluster centers. The experimental evaluation on a number of synthetic and real datasets have shown a speedup between 2 and 4.5.
The results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down-loading all data and running sequential kmeans, or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance. In all the experiments, DFEKM needed only 2 passes on the entire dataset and 3 rounds of interprocessor communication.
FESK and EADSK give exact cluster centers as what multi pass KMeans algorithm would have produce. FASK and AADSK are approximate algorithms and they perform same as the two exact algorithms if the cluster centers are moving within the pre-estimated confidence radius. The approximate algorithms perform faster than the exact algorithms if the centers are moving outside the confidence radius at each sliding window. The experiment shows that the AADSK produces better results than FASK for this case. One can develop a framework of clustering streaming data with these algorithms. Depending on particular requirements and system constraints one can use exact or approximate algorithms. This framework can also be used for applications which require outlier detection.
There can be a few possible future work from this journal. One of them can be to develop the distributed version of the stream clustering algorithms. The another work can be in the direction of improving the algorithm to handle a variable number of clusters instead of a fixed . This will be particularly useful in case the data stream evolves such that number of clusters change. It is also possible to develop a framework of incremental clustering using the concepts from the stream algorithms. Lastly, there is a need of understanding the effects of various stream parameters such as rate of incoming data on the performance of the algorithms, which can also be a potential future work.
Text, Research and Survey