best android apps

indian herald - your news right here

Most Articles

There are few insurance plans protecting organizations from facing the issues which can harm on financial backgrounds.

Singapore is one of the most advance and developed country. People, from all over the world, visit this country for studies, business, jobs and tourism. This place provides so many career opportunities for new comers, as well as, experienced people. These people are helped by recruitment agencies in their job search.

Rarotonga has some of the most world renowned villas in its locality. Titikavera district has great quality Rarotonga villas. Such accommodations are completely secluded from the hustle and bustle and gives complete relaxation opportunity to the families coming with children.

Quantum DAT-72 tape is one of the most reliable, efficient and cost effective medium of backup technology. DDS 5 tape has a raw data storage capacity of 36 GB and like most products in the market the tape also allows for compression.

Consider for Windows 7 password often forgotten and difficultly reset, this tutorial teach you how to reset your forgotten Windows 7 password by yourself easily.

Is your iPhone contact list too important to backup it to Mac.This guide is a advice on how to transfer your iPhone contact list to Mac quickly and safely.

Underweight, caused due to many factors can be well controlled and cured by proper care and herbal medication. Main physical causes contributing for underweight problem include hypothyroidism and diabetes.

At present, you can find many weight gain supplements for women in online market stores. Finding the best one among them is a great task.

Low sperm production, one among the main disorders contributing for infertility can be well controlled and cured by proper care and herbal treatment. Multiple causes contribute for this health problem.

Finding the best herbal supplement from thousands of products is not an easy task today. If you wish to buy the best one, you have to see certain criteria like product ingredients, product manufacturer and supplier.

 Picture

Efficient Clustering Algorithms

By: Kavitha Sivasamy
For : Computer Science
Date Added : June 11, 2010 Views : 643
Rate Author : Current : 3.05 /5
Rate this PR : Current : 2.24 /5



 

Introduction

The journal entitled “Efficient Clustering Algorithmsaims at discovering groups and identifying interesting distributions and patterns in data sets. Researchers  have  extensively  studied  clustering  since  it arises  in  many  application  domains  in  engineering  and social  sciences.  In  the  last  years  the availability of huge transactional and  experimental data  sets and  the arising requirements for data mining created needs for clustering algorithms  that  scale  and  can  be  applied  in  diverse domains.  This  paper  surveys  clustering  methods  and approaches available in literature in a comparative way.

            Clustering has been one of the most widely studied topics in data mining and it is often the first step of data mining process. Today, it has been witnessing enormous growth in data volume. Often, data is distributed or it can be in the form of streaming data. Efficient clustering in all these scenario becomes a very challenging problem.k-means clustering has been one of the popular clustering algorithms. It requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets and also for streaming data. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset. The research group has developed a new algorithm for very large data clustering which typically requires only one or a small number of passes on the entire dataset, and provably produces the same cluster centers as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centers, and then takes one or more passes over the entire dataset to adjust these cluster centers. One contribution of this journal is the implementation and evaluation of that algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared to k-means. The  concept of this  were extended to develop another clustering algorithm for distributed data set. The performance of this algorithm is compared with parallel k-means algorithm. The distributed algorithm gives better performance than parallel k-means for loosely connected data sources and in the presence of imbalanced data in remote sources. The framework of clustering evolving streaming data was developed. In this framework,  two exact and two approximate algorithms were provided . Two of the algorithms use outliers to detect changes in cluster centers in streaming data. The approximate outlier based algorithm takes less memory and can compute high quality approximate cluster centers much faster than k-means algorithm. The exact algorithm is faster than k-means when the cluster center movement is little. If cluster center movement is much enough, the exact algorithm however takes more time than k-means algorithm. The experiments show that this framework can be very effective in clustering evolving streaming data.

 

 Clustering

            Clustering has been one of the most widely studied topics in data mining. Clustering algorithms group similar objects into clusters. A cluster is therefore a collection of objects which are similar between themselves and are dissimilar with objects belonging to other clusters. Formally, given a set of ‘d’ dimensional points and a function ‘f ‘ : Rd X Rd ->R that computes the distance between two points in Rd _, it is  required to compute ‘k’ cluster centers, such that the points belonging to the same cluster are similar and points that are in different cluster are dissimilar. Often, clustering is very fast task in any data mining application. Most of the initial clustering techniques were developed by statistics or pattern recognition communities, where the goal was to cluster a modest number of data instances which often located on a single machine. However today, it was witnessed an enormous growth in the amount of collected data. For example, Walmart processes two million transactions per day, AT&T collects millions of call logs every hour, Amazon or other e-commerce companies collect millions of web logs every hour, high energy physics experiments generate data in tera bytes, etc. Additionally, it can also have data distributed in remote machines or it can have streaming type of data which is potentially infinite.

           

            In this scenario, traditional clustering algorithms are inefficient because they were not designed to process large amounts of data or to operate in distributed or streaming environments. Hence, developing fast and efficient clustering algorithms for massive or distributed or streaming data has been identified as challenging problem in the data mining community. Massive data sets are known as out of core data as they do not fit entirely in main memory. In this case, the run time of the algorithm is dominated by the cost of I/O operation or number of disk scan. Hence, ideally the clustering algorithm for massive data should be optimized for the number of scans over the disk. Joydeep Ghosh states “The holy grail of scalable clustering can be to find near linear algorithm that involve only a small number of passes through the database”.

           

            Distributed data may reside in remote loosely-coupled machines. In this case, communication between the nodes may be a bottleneck. Hence, a distributed algorithm is required. The runtime will be dominated by the slowest communicating nodes. Hence, the goal of distributed algorithm should be to minimize the effort and time expended on communication. Streaming data is potentially infinite and it is not possible to store all the data. Hence, the algorithms for streaming data should preferably be one pass and on-line and they should also require small amount of memory for their execution. The requirement of small memory is targeted towards mobile devices or sensor network nodes.

K Means Clustering

            The work employs the k-means clustering. The K-means clustering algorithm was developed by MacQueen. Bottou and Bengio proved the convergence properties of the  K-means algorithm. It has been shown to be very useful for a body of practical applications. The k-means algorithm is not suitable for massive or distributed or streaming data.

A problem with the k-means algorithm is that it makes a complete scan over the entire data for every iteration, and it requires many such iterations before converging to a quality solution. This makes it potentially very expensive algorithm to use, particularly for large disk-resident datasets.

            Furthermore, in the context of streaming data, it may not always possible to store the complete data. This makes k-means algorithm impractical to use for streaming data. For distributed data, one can use parallel k-means algorithm but this algorithm requires that nodes communicate for every iteration. This makes the parallel k-means very expensive to use in presence of communication delays among the nodes. The communication problem is especially prominent in loosely connected remote machines. Also, the parallel k-means works on uniformly partitioned data. The nodes are said to posses data imbalance if the amount of distributed data is not uniform at each node. In presence of such data imbalance, the runtime of parallel k-means algorithm will be dominated by the processing time of the node which contains maximum amount of data. Moreover, other standard practice for clustering distributed data is to download and merge all data in a single machine and then running a sequential clustering algorithm on it. But, downloading huge amount of data and running sequential algorithm is clearly not an efficient solution for clustering distributed data. Hence, a distributed clustering algorithm is required which optimizes the communication and can also operate in presence of data imbalance. A number of algorithms or approaches focus on reducing the number of passes required over the data sets. These researches have been done in the context of clustering massive data or streaming data. However, these approaches only provide approximate solutions, possibly with deterministic or probabilistic bounds on the quality of the solutions.

            Also, there is no distributed algorithm available at present for clustering which optimizes communication and can work in presence of data imbalance and can provide exact results. Therefore, the journal proposes several efficient algorithms which can produce exact results as k-means algorithm for large out of core data, distributed data and streaming data.

Contribution

The main contribution of this journal is as following:

Ø  Implementation and evaluation of a Fast and Exact out of core k-means algorithm (FEKM).

Ø  Development and evaluation of a Distributed Fast and Exact k-means algorithm (DFEKM)extending the concepts of FEKM.

Ø  Development and evaluation of a set of algorithms for clustering evolving streaming data.

 

The main idea in the FEKM is as follows.     

            The  initial process uses  sample the data and run the original k-means algorithm. The computed centers were stored  after each iteration of the run of the k-means on the sampled data. Then this information is used  and complete a pass over the entire dataset. After the identification  the  points which are more likely to shift from one cluster to another are stored , as the cluster centers could move. These points are now used to correct the cluster centers.

 

            The DFEKM is an extension of FEKM for distributed data set. In this algorithm, samples are collected from individual data sources (nodes). Then, the central node runs kmeans and it stores the centers at each iteration as it does in FEKM, then it sends that information to all other nodes.

Individual nodes then process their data in one pass and send all information to the central node. The central node then merge all information and

find out the correct centers.

 

            Streaming data clustering algorithms have been developed considering a sliding window model for stream. In this model, each arriving data point has a time stamp and the interest is finding k clusters within a specific time window. Only last 􀀀 points are relevant inside a sliding window of size 􀀀 . This is a popular approach in mining streaming

data. In this journal, four streaming data clustering algorithms were developed.

            The algorithms were referred as “Fast and Exact Stream k-means” (FESK) and “Fast and Approximate Stream k-means” (FASK), “Exact Adaptive Stream k-means” (EADSK) and “Approximate Adaptive Stream k-means” (AADSK). FESK and EADSK can find out the exact cluster centers in a sliding window as what would have been found by a multi-pass kmeans algorithms. FASK and AADSK are approximate algorithms.

            These algorithms can deal with evolving streaming data. However, if the cluster centers evolve gradually within a pre-estimated bound then FASK and AADSK both will find the exact centers. EADSK and AADSK are designed to determine outliers from the data set. The extensive experiments were conducted with synthetic and real data sets for all these algorithms to evaluate their performance for different scenarios.

 

LITERATURE REVIEW

 

Main Idea of Fast and Exact K-Means Algorithm (FESK)

            Fast and Exact K-Means Algorithm (FEKM) has been designed by the research group. One contribution of this journal is implementation and evaluation of the algorithm. This chapter first gives the main idea of the algorithm and implementation details

The basic idea behind the Fast and Exact K-means algorithm is as follows.

            It has been  believed that approximate cluster centers computed using sampling can be corrected and moved to exact cluster centers using only one or a small number of passes on the entire data. By exact cluster centers, it  refers  to the cluster centers that are computed by the original k-means algorithm. Thus, the sampling can be used to speedup the computation of exact clusters.

There are three key questions to be addressed.

            First, when approximate cluster centers are computed using sampling, what information need to be stored. Second, how can this information be used to avoid a large number of passes on the entire dataset. Third, how to  know that  have it has been  able to achieve the same cluster centers as in the original k-means algorithm.

             Initially the k-means algorithm is made to run  on a sample, using the same convergence criteria and same initial points as it has been used for the k-means. The following information is stored for future use. After every iteration of k-means on the sampled data,  the centers that have been computed were stored. In addition, another value, referred to as the Confidence Radius of each cluster was computed and stored, whose computation will be described later. This information can be stored in a table with columns, and the number of rows equaling the number of iterations for which k-means was run. Each entry of the table contains a tuple (center, radius) for each cluster.

            Next,  one pass through the entire dataset has been completed. For every point and each row of the table,  the cluster to which this point will be assigned at this iteration is determined, assuming that executing the algorithm on the entire dataset produces the same cluster centers as the initial run on sampled data. Next, it has been  tried  to estimate how likely it is that this point will be assigned to a different cluster when the algorithm is executed on the entire dataset.

 

            The goal is to identify and store the points which could be assigned to a different cluster during any of the iterations. These points are refereed to as boundary points, because intuitively, they fall at the boundary of the clusters. If these points could be identified and stored in memory, any need for any further passes on the entire dataset can be eliminated. However, estimation of  these points were made randomly , which implies that it  requires additional passes if the  estimate is not correct.

 

            Thus, for a given point and row of the table, if this point is a boundary point is determined. If it is, it is stored in a buffer. Otherwise, updation of   the sufficient statistics tuples were made , which has the number and sum of the data points for the cluster.After the pass through the dataset and storing the boundary point, the following processing has been done. Starting from the first row of the table, the recomputation of centers were made  using the boundary points and sufficient statistics tuple. If any of the new computed centers fall outside the pre-estimated confidence radii which means that the computation of boundary points is not valid, there is a  need to take another pass through the data.  The  new centers were used as new initialization points and again repeat all the steps. However, if the new computed centers are within the confidence radius, these centers were used for the next iteration and continue.            The key observation is that using cluster centers from sampling, boundary points, and sufficient statistics, whether it is possible  to compute the same cluster centers that it has been  gotten through one pass on the entire dataset. Finally, the algorithm terminates by checking for the same termination condition that one would use in the original algorithm.

 

 

 Main Idea of Distributed Fast and Exact k-means Algorithm(DESK)

 

            The idea of FEKM has been extended for a distributed data set. The distributed algorithm has been designed and implemented using MPI. It  has been referred  as Distributed Fast and Exact K-Means algorithm or DFEKM. This chapter first presents the premise behind the distributed implementation and later provides the details of experiments and evaluation of this new algorithm. the same definitions and notations were used as FEKM.

The basic idea behind the Distributed Fast and Exact K-means algorithm is as follows.

            DFEKM uses ideas from FEKM where pre-computed approximate cluster centers can be corrected and moved to exact cluster centers using only one or a small number of passes on the entire data, and only one or a small number of rounds of communication between data sources and the central site. There are three key questions to be addressed. First, when approximate cluster centers are computed using sampling, what information needs to be stored. Second, how can this information be used to process data independently on each data source, and to avoid frequent or high volume communication. Third, how it can  be determined, in a distributed environment, that it has been  able to achieve the same cluster centers as in the centralized k-means algorithm.

 

            The  sample data has been taken from each data source, and communicate it to the central node. Then, on the central node,  the k-means algorithm has made to run on this sampled data. The following information is stored for processing on each data source. After every iteration of k-means on the sampled data, the k centers that have been computed has been stored . In addition, it  compute and store another value, referred to as the Confidence Radius of each cluster . This information is then used to build the CAtable structure while designing FEKM.

            Then, it send the table to all the data sources. Next, at each data source, one pass were taken  through the portion of the dataset available at that data source. For every point and each row of the table, it  compute the cluster to which this point will be assigned at this iteration. It has been  estimated  if the point is a stable point which was done  in FEKM.The sufficient statistics of the (number, linear sum and square sum) stable points were stored and the points which are not stable or the boundary point were also stored  exactly as discussed in the context of FEKM.

            After the pass through the dataset and storing the boundary point, all the nodes send their boundary points and sufficient statistics to central node. The central node then does the following processing. Starting from the first row of the table, it recomputes centers using the boundary points and sufficient statistics tuple. If any of the new computed centers fall outside the pre-estimated confidence radius, which means that the  computation of boundary points is not valid, it is  needed to send the last corrected centers to all other nodes. Using these centers as the new initialization points, it has to  go through another iteration and repeat all the steps. However, if the new computed centers are within the confidence radius, these centers for the next iteration were used and continued.

Main Ideas of Clustering evolving Streaming Data Algorithms  

Clustering Evolving Streaming Data

            A requirement of mining streaming data is that it  can not store all data points and must produce the result rapidly, ideally in one scan on the arriving data. A popular paradigm in streaming data algorithms collects a synopsis of the incoming data in a single pass and then works on the synopsis to perform the mining task. In this journal, the  clustering data points in a sliding window over the stream has been studied . The sliding window model is a commonly used mechanism for discounting old data. Here, only the last 􀀀 elements to arrive in the stream are considered relevant for analysis, where 􀀀 is the window size. Researchers have developed efficient algorithms to maintain approximate k-medians or k-centers in the sliding window. Jiawei Han, Charu Aggarwal and their research group  notice that one aspect in streaming data clustering is to deal with the evolving nature of streaming data. It has been further noticed  that it is important to consider detecting the change in cluster location in the sliding window in case of stream clustering. This is important because the clusters may move or change in different manner with time. It may also happen that for a particular time frame data points do not form distinct clusters. It has been noticed  that the streaming data clustering algorithm should exploit the evolving nature of streaming data. Suppose, if it has been already  found the clusters in a particular time frame, then it is possible to utilize the previous results to obtain the clusters in the subsequent time frames. It is observed that if the clusters do not  move much enough it is possible to find out the new clusters quite fast using the previous results. In this journal the  two exact and two approximate algorithms were developed which exploit the evolving nature of streaming data to find out clusters in different time frames.

            The exact algorithms find the exact same final results as what would have been obtained by k-means algorithm. As it has been  discussed in previous chapters, the k-means algorithm makes one scan over the entire dataset for every iteration, and it needs many such iterations before converging to a quality solution. This makes it potentially very expensive to use for streaming data. Also, one has to store all the data points in an window for the k-means to process the data. This can also be a problem if the computing device has small memory.

            In the  framework, one can choose clustering algorithms based on the requirements and system constraints to obtain exact or approximate cluster centers. For example, suppose, one is interested in obtaining clusters in a sliding window where the computation has to be performed in a mobile device which has a small main memory.

The basic idea behind the Fast and Exact Stream K Means (FESK), Fast and Approximation Stream K-means (FASK),Exact Adaptive Stream K-Means(EASK), Approximate Adaptive Stream K Means(AADSK)  were  as follows.

            In this case, the objective may to obtain approximate cluster centers using the small memory. On the other hand, if the objective is to obtain exact cluster centers in a sliding window over the stream then one has to use an algorithm guaranteed to obtain the exact centers. The algorithms were referred to as  “Fast and Exact Stream K-Means” (FESK) and “Fast and Approximate Stream K-Means” (FASK), “Exact Adaptive Stream K-Means” (EADSK) and “Approximate Adaptive Stream K-Means” (AADSK).

            In a typical sliding window model it contains  a sliding window for each time frame. At each sliding window, last 􀀀 data points are relevant for the analysis. In the  framework, it initially cluster the data points in the first window using k-means algorithm. Then  the cluster centers were stored and  the cluster radius were estimated based on the number of data points within the cluster. In the next window,  these cluster centers and radius values were used  to classify the data points as stable points, boundary points and outlier points. The stable points are points which are most likely to retain their cluster membership. Boundary points are most likely to change their cluster membership. The outlier points are the points which are far away from its closest cluster and also quite far away from the second closest cluster.

            The first two algorithms Fast and Exact Stream k-means (FESK) and Fast and Approximate Stream k-means (FASK) use the stable and boundary points and the other two algorithms, Exact Adaptive Stream k-means (EADSK) and Approximate Adaptive Stream k-means (AADSK) use stable, boundary and outlier points. The  confidence radius were used  in a slightly different manner than that which has  defined in previous chapters. Here, the confidence radius is an upper bound of movement of a cluster center in the sliding window over a time frame. The confidence radius can be computed as a fraction of cluster radius where cluster radius is the average distance of any point in one cluster. In FESK and FASK, the sufficient statistics of the stable points and  the boundary points were also stored . After that,  the final centers were computed using those sufficient statistics and the boundary points.

            In FESK, it is verified  if the final centers are falling outside the pre-defined confidence radius. In that case,  run k-means algorithm has to be made run with all the data points in that window again. Otherwise, it has been  done with the clustering. In FASK, it  does not verify the final centers and it  continues  processing to the next window.

The first two algorithms Fast and Exact Stream k-means (FESK) and Fast and Approximate Stream k-means (FASK) use the stable and boundary points and the other two algorithms, Exact Adaptive Stream k-means (EADSK) and Approximate Adaptive Stream k-means (AADSK) use stable, boundary and outlier points. The  confidence radius were used  in a slightly different manner than that which has  defined in previous chapters. Here, the confidence radius is an upper bound of movement of a cluster center in the sliding window over a time frame. The confidence radius can be computed as a fraction of cluster radius where cluster radius is the average distance of any point in one cluster. In FESK and FASK, the sufficient statistics of the stable points and  the boundary points were also stored . After that,  the final centers were computed using those sufficient statistics and the boundary points.

            In FESK, it is verified  if the final centers are falling outside the pre-defined confidence radius. In that case,  run k-means algorithm has to be made run with all the data points in that window again. Otherwise, it has been  done with the clustering. In FASK, it  does not verify the final centers and it  continues  processing to the next window.

CONCLUSION

In this article, a fast out of core clustering technique (FEKM) were implemented and evaluated. Then the ideas of that algorithm were extended and did an implementation (DFEKM.) for distributed data. Later a set of algorithms (FESK, FASK, EADSK, AADSK) were developed for clustering evolving streaming data.

            The FEKM typically requires one or a small number of passes on the entire dataset. This can significantly reduce the execution times for clustering on large or disk-resident datasets, with no compromise on the quality of the results. While a number of approaches existed for approximating k-means or similar algorithms with sampling or using a small number of passes, none of these approaches could provably produce the same cluster centers as the original k-means algorithm. The basic idea in the  algorithm is to use sampling to create approximate cluster centers, and use these approximate cluster centers for speeding up the computation of correct or exact cluster centers. The  experimental evaluation on a number of synthetic and real datasets have shown a speedup between 2 and 4.5.

            The  results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down-loading all data and running sequential kmeans, or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a  significant load imbalance. In all the  experiments, DFEKM needed only 2 passes on the entire dataset and 3 rounds of interprocessor communication.

            FESK and EADSK give exact cluster centers as what multi pass KMeans algorithm would have produce. FASK and AADSK are approximate algorithms and they perform  same as the two exact algorithms if the cluster centers are moving within the pre-estimated  confidence radius. The approximate algorithms perform faster than the exact algorithms if the centers are moving outside the confidence radius at each sliding window. The  experiment shows that the AADSK produces better results than FASK for this case. One can develop a framework of clustering streaming data with these algorithms. Depending on particular requirements and system constraints one can use exact or approximate algorithms. This framework can also be used for applications which require outlier detection.

            There can be a few possible future work from this journal. One of them can be to develop the distributed version of the stream clustering algorithms. The another work can be in the direction of improving the algorithm to handle a variable number of clusters instead of a fixed . This will be particularly useful in case the data stream evolves such that number of clusters change. It is also possible to develop a framework of incremental clustering using the concepts from the  stream algorithms. Lastly, there is a need of understanding the effects of various stream parameters such as rate of incoming data on the performance of the algorithms, which can also be a potential future work.

 



Text, Research and Survey

Additional Links :
Clustering
Algorithm
Data mining


Contact Info

Lecturer,Kavitha Sivasamy
Phone : +919043471048

Email : askavitha79@gmail.com

Post Comments

Name : 
EmailAddress : 
URL : 
Comments :