Main Menu
About us
Project Description
Quantitative Results
Research Lines
Research Results
Impact on Society
Press room
Contact us
Secure Login
Events Calendar
« < April 2019 > »
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 1 2 3 4 5

subTask 1.4(UPV): Development of efficient fault tolerant routing techniques

Leader: Antonio Robles; Researchers: Jose Miguel Montañana, Crispín Gómez, José Flich, María Engracia Gómez, Pedro López

1. Brief Description of the Goals

 In many of the cluster-based systems, providing fault tolerance is a requirement and is becoming a necessity. These systems use very large number of components (processors, switches, and links). Each individual component can fail, and thus the probability of failure of the entire system increases with the number of components. Although switches and links are robust, they work close to their technological limit, and therefore they are prone to faults. Increasing clock frequency leads to higher power dissipation, and a higher heating could lead to premature faults. So, fault-tolerant mechanisms in cluster-based systems are becoming a key issue. 

Most of the fault-tolerant routing strategies proposed in the literature for massively parallel computers are not suitable for clusters. This is because they often require certain hardware support and/or network features (specific topologies, adaptive routing, etc.) that, unlike the proprietary network technologies commonly used in parallel computers, is not provided by the current commercial interconnect technologies, such as InfiniBand.  

The main objective of this task is to provide efficient and scalable fault-tolerant routing strategies suitable for being applied to high-performance commercial interconnects for PC clusters. Some of the fault-tolerant routing schemes developed are intended to particular topologies, such as meshes or fat-trees. The latter has become very popular among high–speed interconnects manufacturers. However, others can be applied regardless of the network topology.

2. Scientific and Technical Developed Activities

Firstly, we have proposed two new fault tolerant routing strategies with selection of path in the source end node, which are referred to as Transition-Based Fault Tolerant Routing (TFTR) and Scalable Pattern-Based Fault-Tolerant Routing (SPFTR). Both mechanisms are based on providing a set of alternative disjoint paths for each pair of end nodes, in such a way that it is guaranteed that the system is deadlock free. The sets of disjoint paths provide a fast solution when a fault is detected, because they are ready to be used without needing to update any routing table (they are already computed and the routing information is already distributed). Despite the fact that both mechanisms minimize the number of required resources, only SPFTR is indeed scalable because it uses the same number of resources regardless of the network size. The fault tolerance degree of both mechanisms corresponds to the number of provided alternative paths minus one.

Secondly, it has been proposed a new fault tolerance mechanism based on routing in stages that is referred to as Reachability-based Fault Tolerant Routing (RFTR). The main objective consists of obtaining alternative paths for all the faulty ones in a time-efficient manner and using a bounded number of network resources. The idea consists of computing, in presence of faults, alternative paths joining fragments of existing paths, where the first fragment starts at the source end node, and the last fragment ends at the destination end node. This approach is able to provide, in probabilistic terms, a fault tolerance degree higher than that provided by TFTR/SPFTR. Indeed, the proposed approach can be considered as a network reconfiguration process limited to those switches/routing paths affected by the fault. When compared to a full network reconfiguration process, RFTR reduces significantly the number of dropped packets and the time required to manage the fault occurrence situation.

Thirdly, we proposed a new reconfiguration mechanism that is referred to as Epoch-Based Reconfiguration (EBR). It can be applied for providing fault tolerance Reconfiguration. With reconfiguration it is possible to tolerate any number of faults as the network remains connected. It is achieved starting a reconfiguration process once a fault is detected. Evaluation results show that EBR works efficiently in different topologies and with different routing algorithms. When compared with current reconfiguration proposals, EBR always gets the best numbers in all the analyzed parameters (e.g., dropped packets, latency, throughput, reconfiguration time, resources required), thus achieving the good properties of all the reconfiguration mechanisms. The results of this work were published by Montañana et al. in 2008 IPDPS Symposium and later, an extended version, and published by Lysne et al. in IEEE Transactions on Computers, vol. 57, no. 6, 2008.

These three results have constituted the main contributions of a PhD thesis of José Miguel Montañana, defended by July 2008.

Lastly, we have proposed a simple and efficient fault-tolerant distributed routing strategy for fat-trees that, unlike other previous proposals, it does not require additional network hardware, and its memory, switch hardware and routing delay scales up with the network size. Indeed, it nullifies only the strictly necessary paths, allowing adaptive routing through the healthy paths. The methodology is based on enhancing the Interval Routing scheme with exclusion intervals. Exclusion intervals are associated to each switch output port, and represent the nodes that are unreachable from this port after a fault. We propose a methodology to identify the links where the exclusion intervals must be updated after a fault, the values to write on them and a very efficient mechanism to distribute the required information through the network without stopping the system activity. Our methodology can tolerate a high number of network failures with a low degradation in performance. Moreover, it can achieve zero-packet losing during the updating period. Results of this work were published by Gómez et al. in ISPA 2007 Conference (Best paper award) and by Gómez et al. in IEEE Transactions on Parallel and Distributed Systems, vol. 20, issue 6, 2009.    

Finally in order to tolerate switch faults, we have designed a built-self-test switch architecture that is able to self-analyze and correct faults. It has been published by Strano et al. in DATE 2011.

Publications: [Gomez06a], [Gomez06b], [Gomez07], [Montañana07], [Montañana08a], [Montañana08b], [Gomez09

Projects funded by Public Calls:  TIN2006-15516-C04-01TIN2009-14475-C04-01

External collaborations Academia: Olav Lysne and Tor Skeie, SIMULA Laboratory (Norway), Timothy Mark Pinkston, University of Southern California (US)

External collaborations Industry: --

Company Agreements: --

PhD dissertations: José Miguel Montañana Aliaga

Patents:  --