中国科技论文在线

上传时间

2011年01月14日

【期刊论文】Distributed Data Stream Clustering: A Fast EM-based Approach

周傲英， Aoying Zhou§ Feng Cao§ Ying Yan§ Chaofeng Sha§ Xiaofeng He†‡

，-0001，（）：

-1年11月30日

Clustering data streams has been attracting a lot of research efforts recently. However, this problem has not received enough consideration when the data streams are generated in a distributed fashion, whereas such a scenario is very common in real life applications. There exist constraining factors in clustering the data streams in the distributed environment: the data records generated are noisy or incomplete due to the unreliable distributed system; the system needs to on-line process a huge volume of data; the communication is potentially a bottleneck of the system. All these factors pose great challenge for clustering the distributed data streams. In this paper, we proposed an EM-based (Expectation Maximization) framework to effectively cluster the distributed data streams, with the above fundamental challenges in mind. In the presence of noisy or incomplete data records, our algorithms learn the distribution of underlying data streams by maximizing the likelihood of the data clusters. A test-and-cluster strategy is proposed to reduce the average processing cost, which is especially effective for online clustering over large data streams. Our extensive experimental studies show that the proposed algorithms can achieve a high accuracy with less communication cost, memory consumption and CPU time.

65浏览
0点赞
0收藏
0分享
97下载
0

引用

上传时间

2011年01月14日

【期刊论文】Sonnet: An Efficient Distributed Content-based Dissemination Broker

周傲英， Aoying Zhou†， Weining Qian‡， Xueqing Gong†， and Minqi Zhou†

SIGMOD’07, June 12-14, 2007，-0001，（）：

-1年11月30日

摘要

In this demonstration, we present a prototype content-based dissemination broker, called Sonnet, which is built upon structured overlay network. It combines approximate filtering of XML packets with routing in the overlay network. Deliberate optimization technologies are implemented. The running and tracing of the system in a real-life application are to be emonstrated.

Distributed publish/， subscribe,， XML data dissemination,， approximate filtering,， path digest

44浏览
0点赞
0收藏
0分享
103下载
0

引用

上传时间

2011年01月14日

【期刊论文】Adaptive Probabilistic Search Over Unstructured Peer-to-Peer Computing Systems

周傲英， Aoying Zhou & Linhao Xu & Chenyun Dai

World Wide Web (2006) 9: 537-556，-0001，（）：

-1年11月30日

摘要

A challenging problem that confronts unstructured peer-to-peer (P2P) computing systems is how to provide efficient support to locate desired files. This paper addresses this problem by using some quantitative information in the form of probabilistic knowledge. Two types of probabilistic knowledge are considered in this paper: overlap between topics shared in the network and coverage of topics at each individual peer. Based on the probabilistic knowledge, this paper proposes an adaptive probabilistic search algorithm that can efficiently support file locating operation in the unstructured P2P network. Then, an update algorithm is devised to keep the freshness of the probabilistic knowledge of individual peers by taking advantage of feedback from the previous user queries. Finally, some extensive experiments are conducted to evaluate the fficiency and effectiveness of the proposed method.

P2P computing.， probabilistic search.， query routing

60浏览
0点赞
0收藏
0分享
106下载
0

引用

上传时间

2011年01月14日

【期刊论文】False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams

周傲英， Jeffrey Xu Yui， Zhihong Chong， Hongjun Lu， Aoying Zhou

，-0001，（）：

-1年11月30日

摘要

The problem of finding frequent items has been recently studied over high speed datastreams. However, mining frequent iteinsetsfroIn transactional data streams has not beenwell addressed yet in terms of its bounds ofmemory consumption. The main difficulty isdue to the nature of the exponential explo-sion of itemsets. Given a domain of uniqueitems, the possible number of itemsets can beup to 2i-i. When the length of data stremsapproaches to a very large number N, thepossibility of an itemset to be frequent be-comes larger and difficult to track with lim-ited memory. However. the real killer of ef-fective frequent itemset mining is that mostof existing algorithms are false-positive ori-nted. That is, they control memory con-sumption in the counting processes by an er-ror arameter e, aud allow items with sup-port below the specified minimum support s but above s-e counted as frequent ones. Such false-positive items increase the num- ber of false-positive frequent itemsets expo-nentially, which rn, make the problem com- putationally intractable with bounded mem-ory consumption. In this paper, we developed algorithms that can effectively mine fl'equent item(set)s from high speed transactional datastreams with a bound of memory consump-tion. While our algorithms are false-negative oriented, that is, certain frequent itemsets may not appear in the zesults, the number of false-negative itemsets can be controlled by a predefined parameter so that desired recall rate of frequent itemsets can be guaranteed. We developed algorithms based on Chernoff bound. Our extensive experimental studies

121浏览
0点赞
0收藏
0分享
59下载
0

引用

上传时间

2011年01月14日

【期刊论文】A false negative approach to mining frequent itemsets from high speed transactional data streams

周傲英， Jeffrey Xu Yu a， *， Zhihong Chong b， Hongjun Lu c， Zhenjie Zhang d， Aoying Zhou b

Information Sciences, 176 (2006): 1986-2015，-0001，（）：

-1年11月30日

摘要

Mining frequent itemsets from transactional data streams is challenging due to the nature of the exponential explosion of itemsets and the limit memory space required for mining frequent itemsets. Given a domain of I unique items, the possible number of itemsets can be up to 2I-1. When the length of data streams approaches to a very large number N, the possibility of an itemset to be frequent becomes larger and difficult to track with limited memory. The existing studies on finding frequent items from high speed data streams are false-positive oriented. That is, they control memory consumption in the counting processes by an error parameter_, and allow items with support below the specified minimum support s but aboves counted as frequent ones. However, such false-positive oriented approaches cannot be effectively applied to frequent itemsets mining for two reasons. First, false-positive items found increase the number of false-positive frequent itemsets exponentially. Second, minimization of the number of false-positive items found, by using a small, will make memory consumption large. Therefore, such approaches may make the problem computationally intractable with bounded memory consumption. In this paper, we developed algorithms that can effectively mine frequent item(set)s from high speed transactional data streams with a bound of memory consumption. Our algorithms are based on Chernoff bound in which we use a running error parameter to prune item(set)s and use a reliability parameter to control memory. While our algorithms are false-negative oriented, that is, certain frequent itemsets may not appear in the results, the number of false-negative itemsets can be controlled by a predefined parameter so that desired recall rate of frequent itemsets can be guaranteed. Our extensive experimental studies show that the proposed algorithms have high accuracy, require less memory, and consume less CPU time. They significantly outperform the existing false-positive algorithms.

Data stream， Frequent pattern mining， Memory minimization

54浏览
0点赞
0收藏
0分享
235下载
0

引用