Home  
 

Main Menu
Home
About us
Project Description
Quantitative Results
Research Lines
Research Results
Impact on Society
Press room
Contact us
News
Secure Login
Events Calendar
« < October 2017 > »
M T W T F S S
25 26 27 28 29 30 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5
Login

subTask 1.4 (UM): Fault-tolerance in CMP architectures

Leader: José Manuel García; Researchers: Ricardo Fernández, Daniel Sánchez, Juan Luis Aragón

1. Brief Description of the Goals

So far, improvements in the CMOS scaling fabrication technology have permitted to exponentially increase the number of transistors per chip. For some time, this increasing count of transistors was employed for the design of more aggressive out-of-order single-core processors. However, at the time that proposed techniques were not able to exploit the Instruction Level Parallelism (ILP), architectural designs were forced to move towards multi-core architectures to exploit Thread Level Parallelism (TLP) as a way to deliver an increased performance while maintaining power and design complexity at manageable levels. Nonetheless, as a counterpart, miniaturization trends are increasing the susceptibility of integrated circuits to a variety of hardware errors such as soft errors, wear-out related permanent faults and process variations. To overcome these problems, we need to develop new architectural techniques that ensure the reliability of the chip. Traditionally, this can be achieved by both devoting significant redundant hardware resources. Our main goal has consisted of providing fault-tolerance with minimal performance degradation. We have developed our fault-tolerance techniques both at the microarchitectural level and at the interconnection network level.

2. Scientific and Technical Developed Activities

In the context of fault tolerance at the microarchitecture level, we have proposed a new fault-tolerant architecture by redundant execution within SMT cores called REPAS. In this proposal we achieve an improvement in terms of both performance degradation and area overhead compared to previous works. The research results were published by Sanchez et al. in the WDDD 2008 Workshop; by Sanchez et al. in the DPDNS 2009 Workshop; and by Sanchez et al. in Euro-Par 2009 Conference.

We also proposed a solution based on coarse-grained redundancy to create a reliable multiprocessor by using the capabilities provided by hardware transactional memory (HTM). In particular it was a log-based redundant architecture for building fault-tolerant CMP processors by using the capabilities provided by the HTM sub-system. The research results were published by Sanchez et al. in HiPC 2010 Conference, and by Sanchez et al. in The Journal of Supercomputing, vol. 61, nº 3, 2012.

Within this sub-task we also focused on the design of reliable architectures to mitigate process variation in the cores of the CMP. In particular we focused on caches since they dominate the area of modern processors and are built with minimum sized but prone to failure SRAM cells (due to the use of decreasing voltages, higher frequencies and temperature, and other events such as power supply noise, signal cross-talking and process variation). We proposed an analytical model for determining the expected miss ratio for a given application when it is executed in a cache with a random probability of cell failure. This analytical model allows designers to better understand the real impact of faults in caches. The research results were published by Sanchez et al. in IOLTS 2011 Conference. As a side note, this research has resulted in the PhD Thesis of Daniel Sanchez.

The other major component affected by faults in current and future CMPs is the on-chip interconnection network (NoC). We proposed to deal with these faults at the level of the cache coherence protocol instead of at the level of the interconnection network itself. We showed the viability of our approach and we developed several fault-tolerant cache coherence protocols. These results were published in well-know international conferences, like the paper published by Ricardo et al in the HPCA 2007, the paper by Ricardo et al, in the  DSN 2008, and the paper published in the Journal of Transactions on Parallel and Distributed Systems by Ricardo et al. in 2008.


Publications: [Fern07], [Fern08a], [Fern08b], [Fern08c], [Sanchez08], [Sanchez09a], [Sanchez09b]

Projects funded by Public Calls: 

External collaborations Academia: --

External collaborations Industry: Exagroup (Spain)Ben Arabi (Murcia Supercomputing Centre) (Spain)

Company Agreements: --

PhD dissertations:  Ricardo Fernández PascualDaniel Sánchez Pedreño

Patents: --