Estimating File Availability

In order to compare the different replication algorithms discussed in [1], we must find a tractable and accurate method to compute file availability. A straightforward approach would be to divide the observed file uptime over the total simulation time. However, estimating file availability in this way has two disadvantages: First, the observed uptime value includes transient effects; specially when the simulation starts without any replication at all. Second, perhaps more significantly, we cannot simulate a long enough interval to observe a significant amount of downtime for highly available nodes. This effect causes the observed (past) file availability to be overly optimistic and noisy to predict future performance.

In order to understand the different alternatives for estimating file availability, Figure 1 (below) presents the CDFs of several estimators obtained from the same arrangements on a commercial (CO) and a file sharing (FS) environments. First, the E curve is the full combinatorial estimator mentioned in Section 3. The full estimator computes all possible failure scenarios leading to data loss and therefore it becomes less tractable as the number of fragments increases. The BAM curve uses equation 1 from Section 3 combined with the measured availability of peers from the simulation; this is the estimator used in real-life by peers to drive the REP algorithm. The BAE estimator is like BAM but using the expected availability of peers as drawn from the node availability CDF. Finally, curve M is the measured availability after letting the algorithm run for sufficient time, which has the drawback mentioned above.

Using these results we can compare the different estimators. Notice that BAM and BAE are pessimistic with respect to E, while M is slightly optimistic. Since our simulations run for a fixed period of time (6 simulated months on average) we need to limit the estimated availability for files and nodes that were never unavailable. As we use 4 nines to limit the observed availability the results on Figure 1 are truncated at 4 and therefore we can not see a bigger difference between E and M.

The important conclusion that we draw from these results is that BAM, which is the only metric that can be computed in real time to drive the algorithm, is very accurate for low availability levels and always pessimistic with respect to other estimators.

 

File availability predicted by various methods compared against observed values for (a) CO at 1X, (b) CO at 1.5X, and (c) FS at 1X excess capacity. E is the use of the true combinatorial estimator, using the known expected availability of each individual peer. BAM is the use of equation 1 from Section 3, using the measured availability of peers from the simulation. BAE is the use of equation 1 but using the known expected availability of peers instead. Finally, M is the average measured file availability after letting the algorithm run for sufficient time.