Francisco Matias Cuenca-Acuna
mcuenca@cs.rutgers.edu
PhD Thesis
Department of Computer Science, Rutgers University
110 Frelinghuysen Rd, Piscataway, NJ 08854
April 2, 2004
Traditional techniques for building distributed systems, have generally provided resource management and communication by relying on structured solutions like: (a) imposing an overlay structure over the system (i.e. multicast tree or distributed hash table), (b) depending on centralized services or (c) relying on distributed consistency protocols. Previous work[46,42,102,64], has shown that these techniques become expensive and sometime unfeasible in environments where membership changes rapidly and nodes are heterogeneous and unpredictable.
In this dissertation, we explore a different approach for building large scale distributed systems. Our thesis is to create distributed algorithms that allow members to operate autonomously so that their progress is not conditioned by other nodes. Despite their independence, as a whole members should be able to make constant progress toward achieving stable global goals. In order to ensure autonomy, global progress and stability, we build randomized algorithms that depend only on loosely synchronized global data.
In this dissertation we explore an infrastructure called PlanetP that we have simulated and partially prototyped to validate our thesis. PlanetP embodies several of our ideas by using probabilistic algorithms to provide, group communication, membership management, content based search and ranking, autonomous service deployment and management, and autonomous replication to provide predictable data availability. Our work is novel in that we target highly dynamic environments where nodes join and leave constantly in an uncontrolled manner.
While federated services can revolutionize collaboration and commerce across the Internet, the realization of this vision faces a number of challenges arising from its fundamental cross-organizational nature. Federated systems pose new challenges that limit the efficacy of traditional techniques for building distributed systems. For example, classic distributed operating systems like Sprite[71], Amoeba[68], and Locus[78] depend on centralized services like transaction coordinators and storage managers. This dependence on centralized resources is not well suited for environments like the European Data Grid where countries may have different policies that cannot be embodied within a single shared centralized resource. For example, countries may disagree on the maintenance of a server depending on how it affects their own use of the infrastructure.
Not only political reasons, but also technical limitations prevent the use of existing systems. A possible solution to decentralize and distribute the control of federated environments is to replicate critical services across organizations. Unfortunately traditional consistency protocols like two phase commit, leader election and virtual synchrony have been found to scale poorly on wide area networks[46,42,102]. Gupta et al.[46] have shown that it only takes a single faulty or slow member to severely affect the performance of the whole community. They have also observed that this kind of problems becomes the norm when dealing with even small communities comprised of tens of peers distributed across a wide area network. Furthermore, Gray et al.[42] studied the problem of scaling replicated databases. Their results conclude that transactional replication is not feasible for large systems, as a ten fold increase on the number of replicas and clients causes a thousand fold increase in the number of deadlocks or reconciliations.
In this dissertation, we explore a different approach to building large scale distributed systems. Our thesis is to use distributed algorithms that allow members to operate autonomously so that their progress cannot be impeded by other nodes. Despite their independence, as a whole, members should be able to make constant progress toward achieving stable global goals. In order to ensure autonomy, global progress and stability, we build randomized algorithms that depend only on loosely synchronized (i.e. weakly consistent) global data .
In the past, probabilistic and randomized algorithms have been very successful at providing reliable and scalable solutions for problems like membership management[47], multicast communication [9], database replication [42,22], information aggregation [100], failure detection [101] and security [109]. Motivated by this growing trend in software design, we set ourselves to build a new distributed infrastructure to support federated services based on probabilistic distributed algorithms.
Our work is novel in that we target highly dynamic environments where nodes join and leave constantly in an uncontrolled manner. Further, problems like faulty hardware and software, operator error, dynamic reallocation of resources, load spikes, and malicious users cause these systems to be highly volatile. For example, Saroiu et al.[88] report an average node availability of only 24% on communities like Gnutella[37].
Every component of PlanetP has been simulated to estimate its performance and scalability on communities of thousands of nodes. Throughout this dissertation we use simulated communities to study different aspects of real systems like the European Data Grid and PlanetLab. These communities are based on our own observations of real systems [24,77] and previous work on the field [88,44,106,48,29,11]. Furthermore, PlanetP has been successfully used to deploy federated services in a community composed of 100 PlanetLab[77] nodes across the US.
Group communication and state propagation. A significant amount of work has been done to support group communication [8,5,55,9,27,13]. Still, as previous studies show, even models with weak delivery guarantees suffer significantly when used on volatile and unpredictable environments [46]. Thus, in Chapter 3, we propose a novel approach to the construction of a content addressable publish/subscribe service that uses Demers et al's[22] anti-entropy [22] algorithm to replicate global state across unstructured communities of several thousand nodes.
Briefly, Demers et al.'s algorithm for synchronizing a global data structure and propagating updates (i.e. broadcasting) works as follows: suppose x has a piece of information of interest to the entire community. Periodically, x would push this change (called a rumor) to a randomly chosen peer y. If y has not seen this rumor, it records the change and starts to push the rumor just like x. In this manner, information eventually diffuses throughout the whole community.
We argue that gossiping is simple, yet it provides a powerful tool for sharing information. Gossiping is simple because each peer must only agree to perform a periodic, randomized, point-to-point message exchange with other peers, rather than collaborate to correctly and consistently maintain a complex distributed data structure. Gossiping is powerful for two reasons: (a) it can propagate information in probabilistically bounded time in spite of the uncoordinated communal behavior, and (b) it can maintain loosely consistent replicated state without depending on centralized resources or the on-line presence of specific peers.
The latter allows PlanetP to maintain loosely synchronized communal
databases. For example, PlanetP includes a default membership
database, called the global directory, which allows federated services
to get information about the community and the status of its
members. As shall be seen, the global directory is also used to
provide a multidimensional keyword index. This global index stores
key-to-node mappings ``
'', which are used to
advertise that node n has a some particular content or resource
(i.e. an object) associated with keyword k. Objects are then retrieved
by specifying queries comprised of sets of keys combined using three
operators, and (
), or (
), and without
(-). For example, a query (``cat''
``dog'' - ``bird'')
would contact all the nodes advertising the keywords ``cat'' and
``dog'' but not ``bird'' and then retrieve all the related objects.
Finally, using the global index PlanetP also provides a publish/subscribe abstraction where applications can register call-backs to be informed when new mappings that match a particular keyword k are published.
Data availability. Given the above infrastructure, we proceed to consider the problem of reliably sharing data in federated environments. One of earliest widespread use of federated computing was file sharing applications like Gnutella[37] and Freenet[19]. In these environments, files were manually replicated by users based on their popularity. Moreover, searching consisted of sampling a subset of the online community. As a result, uncontrolled file availability became a limitation for both search and retrieval. If we were to move file sharing beyond just music content into sharing scientific results or legal case studies, today's pattern of availability determined by file popularity may not hold. Moreover, locating unpopular references may be critical to answering specific queries. Similarly, data availability can greatly influence the performance of more structured approaches like federated file systems[69,61] as they critically depend on key shared data like for example the root directory.
Recent measurements suggest that highly dynamic communities, like P2P groups, are fundamentally different from current servers [7,88]; for example, Saroiu et al. report an average node availability of only 24% [88]. Such low availability implies that providing practical availability for shared data, say 99-99.9%, which is comparable to today's web services [65], would be prohibitively expensive storage-wise using traditional replication methods. Yet, requiring that data be moved or re-replicated as nodes leave and rejoin the online community would be prohibitively expensive bandwidth-wise. For example, replicating a file 6 times (i.e. 1 original copy plus 6 replicas) when the average node availability is only 24% would only raise its availability to 85%. Moreover, if a thousand nodes were to share 100GB (700GB after replication), the bandwidth required to keep all replicas online as members join and leave the online community would be around 4GB per node per day (see Chapter 4).
We are thus motivated to study how to improve the availability of shared data for highly dynamic communities where individuals may be disconnected as often as being online. In particular, assuming that nodes eventually rejoin the online community when they go offline, we address the question: is it possible to place replicas of shared files in such a way that, despite constant changes to the online membership, files are highly available without requiring the continual movement of replicas? To answer this question, we propose and evaluate a distributed replication algorithm where all replication decisions are made autonomously by individual members using only a small amount of loosely synchronized global state shared through PlanetP's global directory. To maximize the utility of excess storage, we assume that files are replicated in their entirety only when a member hoards that file for disconnected operation. Otherwise, files are replicated using an erasure code [104]. While the use of an erasure code is not novel, we show how to avoid the need for tracking the placement of specific fragments. In particular, we show how to increase the availability of a file by adding new random fragments rather than regenerating individual fragments that have been lost.
Content search and ranking. As communities get larger, locating relevant data becomes an important issue[63,95,18,49], consequently we proceed to study how to provide content search and ranking on federated environments. The success of Internet search engines in indexing newsgroups and mailing lists (e.g., Google Groups) as well as the web in general argues that content-based search and ranking is a powerful model for locating information across data collections exhibiting a wide range of sizes and content.
Our studies of a local P2P community comprised of 3000 University students, show that they currently share over 20TB of information. Similarly, we find that scientific grids aggregate petabytes of data with relatively few nodes (e.g. 160 nodes in the case of the European Data Grid). Locating data in these large federated environments is challenging because shared information is not centralized and as the community membership changes, so does the content that is available for searching. Traditional information retrieval techniques that rely on crawling to build a centralized index, would need to spend lots of resources polling for updates to be able to keep up with this dynamicity. On Chapter 5, we propose a scheme for searching and ranking documents distributively without building such index. This scheme is based on the popular vector space ranking model [107] used on Internet search engines like Altavista.
Further, using the autonomous replication algorithms presented on Chapter 4, we study how to provide high availability for indexing metadata. Effectively, we address the issue of how to being able to search, rank and download a file with predictable availability. Figure 1.1 summarizes the relation between the different layers that we have described so far.
![]() |
Self managed federated services. Now we turn to addressing the sharing of computing resources. In particular, we focus on building self-managing federated services. Recent studies [73] show that operator errors are typically the largest cause of Internet service failures. Oppenheimer et al.[72] show that 19% to 33% of the total errors are caused by the operators and more than 50% of these are due to faulty configurations.
In federated environments, deploying and managing services becomes an even more difficult task. In order to realize the federated computing promise, we must first raise the abstraction level at which systems operators currently manage federated services. In particular, applications must be able to autonomously execute in heterogeneous and potentially highly dynamic environments that are not under the control or management of any single entity.
Following our goal of building infrastructural support for federated environments, we study the case of critical services that can cause the failure of entire systems. For example, the European Data Grid relies on a continental scale directory service comprised of more than 100 data providers. This directory service is vital to the operation of the grid, yet its topology and redundancy must be manually configured. Any adjustments or changes must also be performed manually.
Similarly, consider services such as the infrastructural Universal Description, Discovery and Integration (UDDI)[97] service that is used in federated environments to locate and access Web Services across organizations. Such multi-site services must currently be configured and maintained by hand, placing a considerable burden on system operators [108]. Worse, it is quite difficult to understand and properly manage the overall performance and availability of such a service because the components are spread across multiple administrative domains.
In Chapter 6, we design a distributed resource management framework that can be used to build self-managing multi-site services that dynamically adapt to changing system configuration and client load. Our goal is to reduce the burden of deploying and managing a multi-site service to three tasks: (1) defining an application model for the framework to choose appropriate runtime configurations and specifying the desired availability and performance goals, (2) deciding the set of machines that can host instances of the service, and (3) providing maintenance support to repair these machines when they fail. Given a set of hosting machines, our framework will start an appropriate number of service replicas to meet the desired quality of service. Further, it will monitor the application as well as the hosting infrastructure; as the membership of the hosting set, processing capabilities and availability profiles of hosting machines, and client load change, our framework will adjust the number and placement of replicas to maintain the target quality of service.
Our management framework is novel in that it is completely distributed; each replica of a service is wrapped with a management agent that makes autonomous decisions about whether new replicas should be spawned and whether the current replica should be stopped or migrated to a better node. Agents rely on a set of loosely synchronized data to make their decisions, including the client load each replica is currently serving and load information about potential host machines. This autonomy and dependence only on loosely synchronized global state make our framework scalable and highly robust to the volatility inherent within federated systems. Despite autonomous actions based on weakly consistent data, we show that our management framework reaches a stable and appropriate configuration rapidly in response to system dynamics.
Although we share similar goals in terms of functionality with these systems, our target environment is very different. Distributed operating systems were not designed to deal with issues like dynamic membership, large number of nodes, highly heterogeneous hardware, constant node and network failures, etc. These systems use solutions like centralized servers and distributed consistency algorithms that are not effective in federated environments (see Chapter 1).
Closer to our target environment, systems like Globus[30], Legion[43], Globe[92] and WebOS[98] address the problem of global scale computing. While Globe and Legion concentrate on the distributed programming paradigm, Globus and WebOS deal with the infrastructural problems.
All these frameworks use a structured approach toward managing information and services. For example, Globus uses an LDAP[110] directory hierarchy to store information about the resources shared by a grid community. Similarly, Maarten van Steen et al.[93] builds his own hierarchical directory service to support mobile objects in Globe while optimizing for lookups and updates. WebOS adds more flexibility to the idea of a directory services by introducing Active Names[99]. An Active Name is a code snippet that is downloaded by a client the first time it tries to use a service (using DNS and a WWW server). To access the service, the client first runs the snippet which returns a service provider. Although Active Names have primarily been used for load balancing and transcoding on mobile environments, they could also be used to improve the client perceived availability of directory services.
In contrast, PlanetP uses the idea of unstructured communities to eliminate dependencies on individual nodes. PlanetP's design reduces the need for hand customizing and configuring nodes according to their communal roles. On the other hand, PlanetP's lack of structure limits its scalability as it increases the amount of information exchanged and maintained by each node. This decision is justified by the desire to support highly dynamic environments in which nodes may be available for small periods of time. In these environments a structured approach not only spends a significant amount time adding and removing nodes from the hierarchy, but it also requires complex algorithms to cope with instability[64].
Previous work has addressed some of these services although sometimes the goals and environments differ. For example, numerous research efforts have focused on building highly scalable distributed hash tables (DHT) for P2P communities [113,85,94,79]. In general, DHTs use consistent hashing[58] to spread (key, value) pairs across the community and to provide retrieval mechanisms based on the key. The idea of consistent hashing is that nodes are responsible for contiguous portions of the key space. This scheme is well suited for implementing distributed hash structures because node departures and arrivals are solved by just contacting a single existing node to release or acquire a portion of the key space (and its associated data). Although DHTs have been successfully used to build file system services [69,61], we believe they are less suitable for the type of communities studied in this thesis. For example, the high cost of publishing thousands of keys and the lack of update propagation make it difficult to implement content-addressable publish/subscribe systems using only DHTs[51,80]. PlanetP overcomes these difficulties using gossiping to propagate information and replicating a compact inverted index on every peer. Although this trade off limits scalability, it increases content publishing performance and fault-tolerance. Similarly, as membership becomes more dynamic the cost of adding and removing nodes (i.e. shifting keys) from a DHT starts to reduce the benefits of using a structured approach. Recent work [64] has shown that DHTs need special stabilization algorithms in order to preserve their invariants on highly dynamic communities.
In this thesis, we show that our combination of state replication and random gossiping provides bounded convergence times with no need for maintenance algorithms. Moreover gossiping has the advantage of distributing the cost of propagating new information across the community, thus helping nodes with limited resources.
The anti-entropy mechanisms used in PlanetP were originally introduced by Demers et al. [22]. The ideas behind these algorithms have been successfully applied to solve a variety of problems including membership management [47], information aggregation [100] , multicast communication [9], database replication [42,22,76,111], failure detection [101] and P2P DHTs [45]. As far as we know, nobody has looked at gossiping in scenarios where nodes join and leave constantly and in an uncontrolled manner. In this thesis we have quantified the use of gossiping techniques on P2P environments and adapted them for better bandwidth usage and propagation time stability (similar to the work done by Liben-Nowell et al.[64] for DHTs). Furthermore, Yu et al.[111] has shown that the use of anti-entropy does not necessary impose a weak consistency model. On their work, they study a consistency model and mechanisms that can drive anti-entropy to achieve a continuous consistency spectrum ranging from weak to strong consistency.
PlanetP's information propagation and information sharing model was inspired by previous work on tuple spaces [34,62] and publish/subscribe systems [27,16,13,84]. Tuple spaces introduced the idea of space and time decoupling by allowing publishers to post tuples (i.e. data) without knowing the set of receiving nodes and, similarly, by letting receivers search for tuples at a later time. Publish/subscribe systems also added the concept of flow decoupling, meaning that nodes do not need to poll for updates. Instead, they are notified asynchronously when an event occurs. PlanetP is the first large scale decentralized infrastructure that has tried to leverage functionality from both of these bodies of work. Previous work in those areas had relied on assumptions like broadcast/multicast [86] or server based schemes [62] to communicate among members.
In contrast to our approach of randomly generating erasure coded
fragments,
OceanStore [61] proposes to replicate
files using a fixed number of erasure coded fragments, which are
repaired periodically but at a very low rate (on the order of once
every few months [105]). In their scenario,
nodes have relatively high availability and so replication, and
fragment repair, are motivated mostly because of disk failures.
Conversely, in the environments we study, the wide variance of node
availabilities makes it difficult to adopt a similar approach and also
requires a more pro-active replication algorithm.
Similarly, Ivy [69] uses a distributed hash table called DHash [21] to store file blocks as well as a fixed number of replicas across a P2P community. To ensure availability, they assume that replicas are refreshed over time. Further, because the location of data in DHash depends on the current set of online members, it is difficult for them to adopt a less dynamic placement strategy such as ours. We believe that the constant refresh of data would be too expensive in highly dynamic communities.
We have found that the use of adaptive erasure codes and estimated node availability plays a key role in equalizing the overall file availability. Along these lines, Farsite [11,25,26,1] uses information about node availability to increase minimum file availability. Although we have incorporated pieces of their algorithms into our replication scheme, there are several key differences. First, Farsite eliminates user-created file replicas before performing its internal replication, whereas we have to leave hoarded copies alone. Second, Farsite does not use erasure codes because it targets corporate LAN environments where average node availability is high compared to federated systems. Third, Farsite replicates all files equally while carefully placing them in order to maximize the minimum file availability, a metric that is quite meaningful for file systems. We, on the other hand, may replicate files at widely different levels to account for the different availability of peers hosting hoarded copies. In addition, we are more interested in the median/average file availability and the standard deviation than the minimum. This is because in our loosely coupled environments, there are always peers that are rarely online and so cannot replicate their content for high availability. The only option to increase these file's availability is for more available peers to become interested in this content and start replicating it. Since we cannot control user interest, there may always be files with very low availability.
While current P2P systems such as Gnutella [37] and KaZaA [59] have been tremendously successful for music and video sharing communities, their search and information diffusion capabilities have been frustratingly limited. Our goal for PlanetP is to increase the power with which users can locate information in federated communities by providing content based search and ranking capabilities.
Several efforts parallel to PlanetP have also looked at better querying mechanisms [36,80,49]. Their focus, however, is on serving very large-scale communities. In order to be scalable these systems trade off performance and functionality by using iterative queries and distributed inverted indexes. None of this previous work supports content ranking.
More related to PlanetP's information retrieval goals, Cori [14] and Gloss [40] address the problems of database selection and ranking fusion on distributed collections. Recent studies done by French et al.[32] show that both systems can scale to 900 nodes. Although Cori and Gloss use different indexing techniques, both maintain an index of all the nodes containing a term t and its number of occurrences at each node2.1. Using this data, they construct a centralized index that can direct queries toward the nodes storing the most relevant documents.
Because PlanetP focus larger and more dynamic communities, which do not have any centralized resources, we have chosen to keep even less information in the global index. Reducing the index size, allows us to minimize communication as well as storage. In spite of this minimal index, we find that our distributed search and rank algorithm is nearly as effective as the one used in a centralized Internet search engine like Altavista.
Previous research[74,100,77,31,4,91,98] have looked at building wide area infrastructures to simplify application development, software deployment and services management. Issues like group communication [74,82,91], service monitoring [100,31], distributed data structures [112,94], autonomic deployment [91,4] and remote execution [31,98] have been extensively studied. The focus of our work is to raise the abstraction level at which systems operators currently manage federated services. As we have noted on the introduction (Chapter 1), operators are the most frequent individual cause of service failures.
In our approach we use an application model, which could be derived from a Service Level Agreement (SLA), to constantly monitor and improve service configuration. We use autonomous agents to constantly optimize the application model and deploy new configurations.
Traditionally, the use of SLAs and quality of service metrics (QoS) has been applied on cluster environments [90,6,17] where new configurations were evaluated and deployed by a single node. In our framework each agent operates autonomously, therefore it is possible for several agents to instantiate a new configuration simultaneously. Concurrent deployments may interfere with each other, leading to too many or too few replicas. To address this problem, we introduce the idea of probabilistic serialization, which resembles Ethernet's back off algorithm for avoiding transmission collisions.
Previous work on global service management and deployment has used centralized solutions [98,31] or relied on group communication to coordinate deployment agents [91]. Closer toward our decentralization goals, work done on distributed agent-based systems [53,2], like Archon [52], has explored the use of agents that rely only partial communal information and cooperate to reach global goals. Although these frameworks have been evaluated on WAN environments, their underlying assumption is that agents can build a distributed database schema using information from an agent registry. Thanks to this schema agents can direct queries to other nodes and inquire about the state of the community. As discussed before the volatility of the environments studied on this work, makes this approach less attractive.
More recently, scientists have combined work on peer-to-peer and agent-based systems to build more robust and unstructured frameworks[67,89]. Still, nobody has yet address problems that require the kind of coordination needed to maintain federated services. For example, previous work has looked at load balancing of embarrassingly parallel jobs on P2P communities[67] and location of data on distributed tuple spaces[89].
In particular, we seek to answer the following questions:
We use simulation and measurements from our prototype implementation to answer these questions. In particular, we show that the globally replicated database that implements the multidimensional keyword index, only requires a modest amount of storage. Changes to replicated data consistently reach all on-line members within several minutes. Further, synchronizing this database using our gossiping algorithm requires only a modest amount of bandwidth, even for extremely dynamic communities with very high rates of change.
In order to eliminate this residue, every so often, each peer performs an anti-entropy operation instead of rumoring. For example, in our implementation of this algorithm, every tenth round of rumoring (or if there is currently no new information to be rumored), a peer x would send an anti-entropy message instead of a rumor. The anti-entropy message asks the target y to send a summary of its entire directory to x. When x gets y's summary, x parses it to see whether y has more updated information. If so, then x asks y for the needed information. This combination of push rumoring and pull anti-entropy helps to reliably spread new information everywhere.
Unfortunately, in a dynamic environment, the time required to spread new information can become highly variable. This is because rapid changes in the membership leads individual peers to have a less accurate view of the directory, elevating the problem of residual peers that do not receive rumors before they die out. Increasing the anti-entropy frequency is a possible solution to ensure eventual propagation of all the updates. Unfortunately, anti-entropy is much more expensive in terms of network bandwidth than rumoring. While rumors are only as large as the update they carry, an anti-entropy communication session is proportional to the community size.
Thus, we instead extend each push operation with a partial pull that works as follows. When x sends a rumor to y, y piggybacks the summaries of a small number m of the most recent rumors that y learned about but is no longer actively spreading onto its reply to x; this allows x to pull any recent rumor that did not reach it. This partial pull requires only one extra message in the case that y knows something that x does not since the normal rumoring process is really implemented as a query/request/reply sequence using unique rumor identifiers to save bandwidth when the target has already received the rumor. Furthermore, the amount of data piggybacked on y's message is of constant size, on order of tens of bytes.
Observe that while the pushing of rumors has a termination condition, pulling does not. To address this, PlanetP dynamically adjusts its gossiping interval Tg; if a peer is not actively pushing any rumors, it slowly raises its Tg (to some maximum value). When it receives a new rumor, it immediately resets its gossiping interval to the default. This dynamic adaptation leads to negligible bandwidth usage shortly after global consistency has been achieved.
Finally, note that although in this thesis, we assume that shared data structures are universally replicated and are gossiped with a single Tg for simplicity, this is not the general case. In fact, our implementation allows each data structure to be associated with only a subset of peers and gossiped at a distinct rate. This allows partial replication as well as rapid dissemination of time-sensitive information such as messages for group communication without increasing the overhead of maintaining more slowly changing data structures.
Each record in the global directory represents a single peer and stores three types of information: first, a mandatory set of properties like IP address, user's nickname, etc. Second, an optional set of numerical properties and third a set of terms associated with the objects shared by the peer (i.e. files, services, etc.).
In order to join a PlanetP community, new nodes need to contact an existing online member to get a copy of the current global directory. Using this copy, the new node adds itself to the directory and starts to propagate the updated version. Similarly, to advertise local objects a node just updates its corresponding entry on the local copy of the global directory and let the gossiping layer propagate it.
Thanks to the global directory, nodes can execute boolean queries to locate shared objects. A searching node first looks at all the terms associated with each entry to derive a group of nodes that contains these query terms. Then, it forwards the query to these nodes and asks them to return contact information for any object that is relevant to the query (i.e. URLs or RMI stubs). Effectively, the sets of terms included on the global directory provide a communal term-to-node index (referred as the global index). PlanetP uses this two-stage search process to perform exhaustive searches while limiting the size of the globally replicated index.
In order to minimize the amount of memory and bandwidth taken up by the global index, each node's term set is implemented using a Bloom filter [10]. Briefly, a Bloom filter is an array of bits used to represent a set of strings; in our case, the set of terms associated with a node. The filter is computed by using k different hashing functions to compute k indices for each term and setting the bit at each index to 1. Given a Bloom filter, we can ask, if some term t is a member of the set by computing the k indices for t and checking whether those bits are 1. Bloom filters can give false positive answers but never false negatives.
We use the combination of Bloom filters and a two stage search technique to implement the global multidimensional index, because it provides several advantages:
In the following chapters, we will further extend the use of the global directory thanks to the optional per node properties. As shall be seen, we can use them to propagate node statistics like CPU load, node availability, available memory, etc
Having described PlanetP's data gossiping layer, we now turn to evaluating its performance. We study the cost (space and time) and the reliability of the gossiping layer when supporting the global directory. Our performance study is simulation-based but most of the parameters were derived from a prototype implementation. Table 3.1 lists these parameters. We validated our simulator by comparing its results against numbers measured on a cluster of eight 800 MHz Pentium III PCs with 512MB of memory, running a Linux 2.2 kernel and the BlackDown JVM, version 1.3.0. Because of the JVM's resource requirements, we were limited to 25 peers per machine, allowing us to validate our simulation for community sizes of up to 200 peers.
To drive the experiments, we study the replication of the global directory when used in the context of content search. In this environment, nodes parse and index every term on a document. Therefore, this is one of the most challenging scenarios as the number of terms per text file is very large when compared to other types of files. Events that change the directory and so require gossiping include the joining of a new member, the rejoin of a previously off-line member, and a change in a Bloom filter. We do not gossip the leaving (temporary or permanent) of a peer. Each peer discovers that another peer is off-line when an attempt to communicate with it fails. It marks the peer as off-line in its directory but does not gossip this information. When the peer x comes back on-line, its presence will eventually be gossiped to the entire community; each peer that has marked x as off-line in its directory changes x's status back to on-line. If a peer has been marked as off-line continuously for TDead time, then all information about it is dropped from the directory under the assumption that the peer has left the community permanently.
Global Index Size. First, we study the space required by the global index to summarize the well known TREC [48] document collection (944,651 documents, 256,686,468 terms, 592,052 unique terms, 3,428.41 MB). This collection contains only text documents, so the ratio of unique terms to collection size is very high. For collections including multi-media documents, this ratio is likely to be much smaller. For example, a collection of 326,913 MP3 files requiring 1.4TB of storage collected from an existing P2P community only yielded 55,553 unique terms.
The size of the global index is determined by the summation of the number of unique terms per node. Therefore we need to address the effect of the community size, the document distribution and the number of documents per node. In the following experiments we assume a uniform document to node distribution. This distribution is a worst case scenario since any other document distribution (e.g. Weibull) would likely reduce the chances of having the same unique terms repeated across several nodes. For example, the best case scenario is when all the documents are stored on a single node.
![]() |
Figure 3.1 shows that if a node already contains 0.4% of the TREC collection, it would have had to add approximately 3000 more documents, totaling 800,000 more terms, to have found an additional 1000 unique terms. The trend we found in Figure 3.1 is consistent with that found by a much larger study of word distribution [106]. These trends argue that sharing only unique terms across nodes is a scalable approach. Figure 3.1 shows that the rate at which new terms are introduced decreases with the number of documents per node.
Finally, in Figure 3.2 we count the number of unique words at each peer and compute the size of the global index if each Bloom filter was sized to summarize the per-node unique terms with less than 5% probability of error. We also show what happens if each document is replicated 3 times in the community as well as the case when we save space by indexing only the 30% most frequent unique terms in each document (on Chapter 5 we show that this reduction still provides reasonable search results).
![]() |
Observe that at 1000 peers, the global index is quite small: 16.1MB, which is just 0.5% of the collection. If each document were replicated 3 times, the storage requirement would increase to 28.7MB, which is actually only 0.3% of the enlarged collection. At 5000 peers, the storage cost is somewhat higher, rising to 62.3MB if each document is replicated 3 times. Observe, however, that if we sacrifice a little accuracy by indexing only the 30% most frequent unique terms in each document, the storage requirement is reduced again to 26.9MB, which is just 0.3% of the replicated collection.
Based on these results, we conclude that PlanetP should easily scale to several thousand peers in terms of the required per peer storage for the replicated global index.
In this experiment, we use a Bloom filter with 1000 words. Because PlanetP sends diffs of the Bloom filters to save bandwidth, this scenario simulates the addition of 1000 new terms to some peer's local index. Note that, while 1000 new terms may seem small, it actually is quite large when comes from a node that already shares some documents as seen on Figure 3.1.
![]() |
Figure 3.3(a) plots the time3.1 required for updates to reach every single peer on six different scenarios. These scenarios present different combinations of bandwidth capacity and gossiping schemes as described next:
Unless noted all the environments use a 30 second gossiping interval. Figure 3.3(b) shows the aggregate network volume used to propagate the new piece of information throughout the community. Figure 3.3(c) shows the average gossiping bandwidth used per peer during the experiment for DSL-10, DSL-30, and DSL-60.
Based on these graphs, we make several observations:
In this experiment, we start a community of n peers and wait until their views of membership is consistent. Then, m new peers will attempt to join the community simultaneously. We measure the time required until all members have a consistent view of the community again as well as the required bandwidth during this time. For this experiment, each peer was set to share 20,000 terms with the rest of the community through their Bloom filters. Using the scenario presented on Section 3.4.1 we find that in order to index 20,000 unique terms, peers need to share 500MB of pure text documents.
![]() |
Figure 3.4 plots the time to reach consistency vs. the number of joining peers for an initial community of 1000 nodes. These results show that if there is sufficient bandwidth (i.e. LAN), consistency is reached within approximately 600 seconds (10 minutes), even when the community grows by 25%. In contrast to propagating a change, however, the joining process is a much more bandwidth intensive one; a joining member must retrieve 1000 Bloom filters representing a total of 20 million terms from the existing community. Also, having 250 members join at once means that 250 Bloom filters representing 5 million terms must be gossiped throughout the community. As a result, convergence times for communities interconnected only with DSL-speed links are approximately twice that of LAN-connected communities. Finally, convergence times for the MIX-connected communities become unacceptable, possibly requiring from 50 minutes to over two hours.
We draw two conclusions from these results. First, even in this worst-case scenario for PlanetP, which we do not expect to occur often, if peers have DSL or higher connectivity, then PlanetP does quite well. Second, we need to modify PlanetP if we are to accommodate users with modem-speed connections. While the artificial lengthening of gossiping convergence time can be easily fixed if peers are assumed to be multi-threaded, when a new peer first join, the time to download the entire directory would still likely take too long. Thus, we should either exclude peers with less than DSL connectivity or allow a new modem-connected peer to acquire the directory in pieces over a much longer period of time. We would also need to support some form of proxy search, where modem-connected peers can ask peers with better connectivity to help with searches.
We also decided to modify our gossiping algorithm to be bandwidth-aware, assuming that peers can learn of each other's connectivity speed. The motivation for this is that a flat gossiping algorithm penalizes the community to spread information only as fast as the slow members can go. Thus, we modify the basic PlanetP gossiping algorithm for peers with faster connectivity to preferentially gossip with each other and peers with slower connectivity to preferentially gossip with each other. This idea is implemented as follows. Peers are divided into two classes, fast and slow. Fast includes peers with 512 Kb/s connectivity or better. Slow includes peers connected by modems. When rumoring, a fast peer makes a binary decision to talk to a fast or slow peer. Probability of choosing a slow peer is 1%. Once the binary decision has been made, the peer chooses a particular peer randomly from the appropriate pool. When performing anti-entropy, a fast peer always chooses another fast peer. When rumoring, a slow peer always chooses another slow peer, so that it cannot slow down the target peer, unless it is the source of the rumor; in this case, it chooses a fast peer as the initial target. Finally, when performing anti-entropy, a slow peer chooses any node with equal probability.
In the next section, we study the performance of the bandwidth aware algorithm when used in a dynamic environment where membership changes constantly.
![]() |
To complete our exposition, we study a dynamic community with the following behavior. The community is comprised of 1000 members. 40% of the members are online all the time. 60% of the members are online for an average of 60 minutes and then offline again for an average of 140 minutes. Both online and offline times are generated using a Poisson process. 20% of the time, when a peer rejoins the on-line community, it sends a Bloom filter diff containing 1000 new terms. These parameters were again based roughly on measurements reported by Saroiu et al. [88] (except for the number of new terms being shared occasionally) and are meant to be representative of real communities. We note again that 1000 new unique terms typically represents the sharing of a significant set of new documents. (We have also studied a more dynamic community, where 50% of the time, a peer coming back on-line shares 100 new words. The results are similar to those present below.)
![]() |
Figure 3.6(a) plots the cumulative percentage of events against the convergence time. We observe that with sufficient bandwidth, convergence time is very tight around 400 seconds. For the MIX community we separate the CDF in two classes: the time it takes for fast nodes to propagate events to other fast nodes (MIX-F) and the time it takes for slow nodes to reach the whole community (MIX-S). Intuitively, what we are trying to asses is whether the presence of slow nodes affects the performance of the well connected part of the community when using the bandwidth aware algorithm (BA). The graph shows that the BA algorithm allows fast nodes (MIF-F) to propagate events as in the LAN case.
Furthermore, in Figure 3.6(b) we compare the convergence time for the non bandwidth aware algorithm (NBA) against the BA version. The figure shows that BA does not harms the propagation speed for slow nodes (MIX-S vs. MIX-S NBA), but it significantly help the fast nodes (MIX-F vs. MIX-F NBA) to propagate at their maximum speed.
![]() |
Finally, Figure 3.7 plots the aggregate bandwidth against time. This graph shows that the normal operation of a community requires very little bandwidth, ranging from between 10 KB/s to 100 KB/s across the entire community.
PlanetP uses gossiping to robustly disseminate new information and to maintain a loosely replicated database across all the nodes.
We have shown that changes to replicated data consistently reach all on-line members within several minutes even when the communal membership changes at a fast speed. Further, synchronizing a global database, like the global directory, using our gossiping algorithm requires only a modest amount of bandwidth. We have used a challenging environment, where a heterogeneous P2P community shares only text documents and is constantly adding new files to predict performance in a worst case scenario.
We deliberately study a very weakly structured system because tight coordination is likely difficult and costly in large distributed and dynamic communities [64]. Fundamentally, our approach only depends on nodes managing their individual excess storage in a fair manner and having approximate data about replica-to-node mapping and average node availability. The need for information on replica-to-node mapping is obvious. With respect to node availability, without this information, a replication algorithm cannot differentiate between nodes with very different availabilities and thus it may under- or over-replicate files. Beyond this loosely synchronized global state, all decisions are local and autonomous. Of course, one can increase the level of coordination between nodes to increase the efficiency of the system in utilizing storage and bandwidth. In essence, we seek to conservatively estimate the amount of storage required to provide a practical availability of 99.9%, which, as already mentioned, is comparable to today's web services.
Our studies show that it is possible to increase availability of shared data to practical levels, e.g., 99.9%, using a decentralized algorithm where members operate autonomously with little dependence on the behaviors of their peers. We study the performance of our algorithm for three distinct environments, modeled using data published from earlier studies of a corporate environment [11], the Napster and Gnutella file sharing communities [88], and a file sharing community local to students of our University.
In our replication approach, each member of a community hoards some subset of the shared files entirely on their local storage, called the member's hoard set, and pushes replicas of its hoard set to nodes with excess storage using an erasure code. We propose such a structure to support disconnected access to shared data. In loosely organized applications such as current file sharing, hoarding is uncoordinated and entirely driven by members' need for disconnected access. In more tightly organized applications, such as a file system, the application can coordinate the division of the shared data set among individuals' hoard sets, in essence dividing the responsibility for ensuring the availability of the shared data.
To simplify our description, we introduce a small amount of terminology. We call a member that is trying to replicate an erasure-coded fragment of a file the replicator and the node that the replicator is asking to store the fragment the target. We call the excess storage space contributed by each member for replication its replication store. (Note that we do not include replication via hoarding as part of the replication store.) We assume that each file is identified by a unique ID. Finally, when we say ``the availability of a fragment,'' we are referring to the availability of the file that the fragment is a piece of.
Given this terminology, the overall algorithm is as follows:
Our goal in designing this algorithm is to increase the availability of all shared files toward a common target availability while allowing nodes to act completely autonomously using only a small amount of loosely synchronized global data. Given a replication store that is very large compared to the set of documents being shared, we know that this approach will work [83]. The interesting questions become what is the necessary ratio of the replication store to the size of the document set and what happens when the replication store is not sufficiently large for the community to achieve the target availability for all files. We explore these questions in Section 4.3. In the remainder of this section, we will discuss our use of erasure coding, how to estimate file availability, our replacement policy, and the resiliency of our approach to misbehaving nodes.
To date, given (n,m), most uses of erasure codes generate all n fragments and, over time, detect and regenerate specific lost fragments. This approach has three significant disadvantages for highly dynamic environments: (i) as the average per-member availability changes over time, files' availability will change given a fixed n; to maintain a target availability, it may thus be necessary to change n, necessitating the re-fragmenting and replication of some (perhaps all) files; (ii) an accurate fragment-to-node mapping is required for the regeneration of fragments lost due either to nodes leaving the community permanently or ejection from the replication store; and (iii) it must be possible to differentiate accurately between nodes temporarily going offline and leaving the community permanently to avoid introducing duplicate fragments, which reduces the effectiveness of erasure coding.
To overcome these limitations, we choose n » m but do not generate all n fragments. When a member decides to increase the availability of a file, it simply generates an additional random fragment from the set of n possible fragments. If n is sufficiently large, the chances of having duplicate fragments should be small, thus maximizing the usefulness of every fragment generated in spite of not having any node coordination. In this manner, it is easy to dynamically adjust the number of fragments generated for each file to reflect changes in the community.
RS is particularly suited to our proposed use because the cost of generating each fragment is independent of n. The fundamental idea behind RS is that a polynomial of degree m-1 in a Galois field GF(2w) is uniquely defined by any m points in the field. In order to create an erasure code for the blocks D1,...,Dm of a file, we need a polynomial p such that p(t1)=D1,...,p(tm)=Dm. Once we have this polynomial, it is easy to create up to 2w - m additional points p(ti)=Di, i>m such that the polynomial can be reconstructed from any combination of m items from the set {(t1,D1),...,(tm,Dm),...,(ti,Di),...}. Observe that, given the polynomial, the generation of any one fragment is independent of n as desired. According to Rizzo [81], files can be encoded/decoded on a Pentium 133Mhz at 11MB/s. Moreover using w's up to 16 is quite feasible, which translates into a 0.006 probability of having collisions for the environments studied in Section 4.3.
Given the above information in the global index, we can identify the
set of nodes hoarding any file f, H(f), and those that contain a
fragment of f, F(f). Then, assuming that nodes' behaviors are not
correlated [7,11], the
availability of f, A(f), can be estimated as 1 minus the
probability that all nodes in H(f) are simultaneously offline and at least n-m+1 of the nodes in F(f) are also offline; m is
the number of fragments required to reconstruct f and n is
redefined as
. In general, since every node in
the system may have a different probability for being offline, say
Pi for node i, it would be too expensive to compute the exact
file availability. Thus, we instead use the following approximation
that uses the average probability of being offline (Pavg) of
nodes in F(f):
where
Note that equation 4.1 assumes that H(f) and F(f) do not intersect and that all n fragments reside on different nodes. We ensure this by allowing a node to either hoard an entire file or store only a single fragment of that file.
In addition, we are assuming a closed community, where members may go offline but do not permanently leave the community. Currently, we assume that members will be dropped from the directory if they have been offline for some threshold amount of time. Thus, when members permanently leave the community, the predicted availabilities of files may become stale. Since we periodically refresh the predicted availability of files, this should not become a problem.
Finally, note that equation 4.1 does not account for the possibility of duplicate fragments; as already argued, however, we can make the chance of having duplicate fragments quite small and so the impact should be negligible.
Our replacement policy is as follows. We first compute the average number of nines in the availability of all fragments currently stored at the target. Then, if the incoming fragment's number of nines is above 10% of this average, we simply reject it outright. Otherwise, we use lottery scheduling [103] to effect the weighted random selection of victim fragments. In particular, we create a set of tickets and divide them into two subsets with the ratio 80:20. Each fragment is assigned an equal share of the smaller subset. In addition, fragments with availability above 10% of the average are given a portion of the larger subset. The portion given to each fragment is proportional to the ratio between its number of nines and the sum of the number of nines of all such fragments. The notion of different currencies makes this division of tickets straightforward. For example: if a target node has three fragments with availabilities 0.99, 0.9, 0.5 or 2, 1, .3 ``nines'' respectively4.1then the average availability in nines plus 10% is 0.76. Now if we have 100 lottery tickets the first fragment will get 67+6.6 tickets from the first and second pool respectively, the second fragment will get 13+6.6 tickets and the third fragment will get 0+6.6 tickets. Overall, the probability of each fragment being evicted will be 0.73, 0.19 and 0.06 respectively.
The intuitions behind our replacement policy are as follows. First, we reject the incoming fragment if it will simply become a target for eviction the next time a replication request is received by the target node. Without this condition, we will simply be shifting fragments around without much effect. Our threshold for this outright rejection may seem rather low; at some cost of bandwidth, if we were less aggressive at rejecting fragments, perhaps over time, the system can reach a better configuration. However, our experimentation shows that while this threshold affects bandwidth usage, it does not significantly affect the overall replication. Since we are targeting environments where bandwidth may be a precious commodity (see Section 4.3), we decided that an aggressive threshold was appropriate.
Next, we penalize over-replicated files heavily for the number of nines in their availability, making it highly probable that a fragment of an over-replicated file will be evicted. We use the number of nines rather than the availability itself because it linearizes the differences between values, i.e. the difference between 0.9 and 0.99 is the same as that between 0.99 and 0.999.
While the critical decisions rest at the targets in our algorithm, replicators can implement several optimizations to increase the convergence rate. First, a replicator might increase availability convergence speed by favoring files that have low estimated availability over files with high estimated availability. Because of hoarding and potentially stale data, it is again necessary to use a weighted random selection process instead of a deterministic process. We use lottery scheduling in a manner similar to replacement, except in the reverse sense, of course, favoring files with low availability.
Second, a replicator can try to find nodes with free space in their replication store rather than causing an eviction at a node whose replication store is full. To implement this optimization, a replicator first selects a target randomly. Then, it inquires whether the target has sufficient free space in its replication store. If not, then the replicator selects another target, again randomly. The replicator repeats this process for five times (it could arbitrarily be any number of times), and then gives up and simply chooses randomly from these five previously chosen targets.
Finally, a member can choose to replicate only a subset of its hoard set to increase the availability of this subset at the cost of reduced availability for the remaining files. The power of our approach is that, if some member is interested in a file being highly available, it can act as a champion for that file by hoarding it. Ultimately, it is up to the application that places the hoarded files to decide when extra copies are needed.
Finally, we always fragment files using a Reed Solomon code with a fixed m set to 10. This means that all fragmented files can be reassembled with 10 fragments, regardless of its size. Although in practice the algorithm could vary m to achieve the best trade-off between fragment size, download latency, and space utilization, we fixed it to simplify the analysis.
| ||||||||||||||||||||||||||||||||||||||||||
For all three communities, we vary the amount of excess space provided by each member across several different simulations. In all cases we refer to excess space as a proportion of the number of bytes in members' hoard sets. In all scenarios, the number of files and nodes has been scaled to allow us to run experiments within reasonable times. As shall be seen, since nodes provide excess space based on what they share, the total number of files and nodes will not affect the algorithm. In all experiments, we assume a target file availability of three nines (99.9%).
File-sharing (FS). The first community that we study is modeled using two sources of data:
![]() |
Finally, we use Saroiu et al.'s reported Gnutella uptimes and availabilities to drive members' arrival to and departure from the online community. We made two assumptions because of insufficient data:
Corporate (CO). The second community represents what we might expect to see in a corporate or university environment. In addition to being a significantly different environment than FS, studying this environment also allow us to compare our results with Farsite [11]. Farsite has somewhat similar goals to ours but takes a significant, almost inverse approach (see section 2.3 for more details). Thus, this community is modeled using data provided by Bolosky et al. [11]. All the parameters shown in Table 4.1 were taken directly from Bolosky et al.'s work with the exception of the absolute number of nodes and total number of files, which were scaled down to limit simulation time.
![]() |
Workgroup (WG). Finally, the third configuration tries to capture a geographically distributed work group environment, such as a group of developers collaborating on an open-source project. The idea is to evaluate the impact of not having any member with server-like behaviors, i.e., members with large hoard sets that are (relatively) highly available. In particular, we simulate 100 nodes sharing 10,000 files of 290KB each--this is the average size of non-media files found in [29]. The number of files per user is distributed either uniformly or according to a distribution approximating our measurements of the local P2P network, shown in Figure 4.2. Nodes will follow a work schedule with a mean of 1 hour online followed by 2 hours offline. Again, arrival is exponentially distributed.
To evaluate the impact of the amount of information assumed to be available to drive the replication algorithm, we will compare our algorithm, called REP, against two alternatives: the first one, called BASE, simulates nodes pushing and accepting replicas in the complete absence of information on file availability; the second, called OMNI, assumes centralized knowledge, where replication is driven by a central agent that tries to maximize the minimum file availability. Thus, a comparison with BASE quantifies the effect of having approximate knowledge of current file availability. Comparing against OMNI quantifies the effect of autonomous actions using loosely synchronized data as opposed to having perfect, centralized information.
In BASE, nodes will use FIFO to manage their excess storage and replicators will only select files when its estimated availability is below the target availability. Note that this is really just an optimization to limit simulation time as the results would be essentially the same if BASE didn't implement this optimization.
OMNI is implemented as a