Home  
 

Main Menu
Home
About us
Project Description
Quantitative Results
Research Lines
Research Results
Impact on Society
Press room
Contact us
News
Secure Login
Events Calendar
« < October 2017 > »
M T W T F S S
25 26 27 28 29 30 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5
Login

subTask 1.2 (UM): Memory hierarchy and cache coherence in CMP architectures

Leader: Manuel Acacio; Researchers: Juan Fernández, Ricardo Fernández, José M. García, Epifanio Gaona, Alberto Ros, Rubén Titos

Very at the beginning we started a new research line inside this subTask related to coherence protocols for virtualization, as the importance of server consolidation is increasing due to its effectiveness to take advantage of multicore systems. 

1. Brief Description of the Goals

One of the most critical design points for future many-core CMPs is the organization of the on-chip cache hierarchy. On-chip caches are crucial to avoid the increasing gap between processor and memory speeds (the well-known memory wall problem). Small, and therefore, fast caches per core can capture the great majority of data accesses due to the temporal and spatial locality of memory accesses exhibited by applications. Unfortunately, the presence of private caches also demands the implementation of a cache coherence protocol in hardware that guarantees correct executions.

On the other hand, the transactional memory (TM) programming paradigm has emerged as a promising alternative to current lock-based synchronization. Using the TM model, the programmer declares what regions of the code must appear to execute atomically and in isolation, leaving the burden of how to provide such properties to the underlying levels. The TM system then executes optimistically transactions, stalling or aborting them whenever real run-time data conflicts appear, potentially achieving the performance of fine-grain locks with the programming effort similar to that of coarse-grain synchronization. In the context of many-core CMPs, the aim of this research line is to provide new solutions to the cache coherence problem, as well as to study novel organizations and management policies for the on-chip caches and to investigate techniques to ensure more efficient execution of transactions even in presence of conflicts. 

In this subTask we were also interested on the design of the best memory hierarchy and coherence protocol for CMPs used to execute consolidated workloads by means of virtualization. 

2. Scientific and Technical Developed Activities

We have developed a new family of cache coherence protocols called Direct Coherence Protocols aimed at avoiding the indirection of traditional directory-based protocols, but without relying on broadcasting requests. The key property of Direct Coherence Protocols is that both the role of ordering requests from different processors to the same memory block and the role of storing the coherence information is moved from the home tile to the tile that provides the data block in a cache miss, i.e., the owner tile. Therefore, indirection is avoided by directly sending requests to the owner tile instead of to the home one. The concept of Direct Coherence was published by Ros et al. in 2007 HiPC Int'l conference. Subsequently, we proposed an implementation of Direct Coherence for CMPs (DiCo-CMP), which was published by Ros et al. in 2008 IPDPS Int'l conference. Finally, several extensions of DiCo-CMP were proposed and evaluated in a journal paper published by Ros et al. in IEEE Transactions of Parallel and Distributed Systems (TPDS), vol. 21, issue 12, 2010.

Regarding the management policies for the on-chip caches we investigated a new policy for last-level caches that tried to map the pages accessed by a core to its closest (local) bank, like in a first-touch policy. However, the proposed policy also introduced an upper bound on the deviation of the distribution of memory pages among cache banks, which lessened the number of off-chip accesses. This tradeoff was addressed without requiring any extra hardware structure. The results of such investigation were published by Ros et al. in 2010 HiPC Int’l conference.

In the context of Hardware Transactional Memory Systems (HTMs) we have proposed a novel approach to conflict detection that relies on the directory controller instead of the private L1 caches. Our proposal achieves significant reductions in terms of both traffic and execution time. This proposal was published by Titos et al. in the 2008 HiPC Int'l conference. Additionally, we proposed a hybrid, pseudo-optimistic scheme of conflict resolution that recaptures the concept of speculation to allow transactions to continue their execution past conflicting accesses. This was published by Titos et al. in 2009 IPDPS Int'l conference. Besides, we have also studied hybrid-policy HTM designs that allow us to increase concurrency between transactions, and consequently, performance, taking into account resulting complexity. The results of these researches were published by Titos et al. in 2011 ICS Int’l conference and by Negi et al. in 2011 ICPP Int’l conference and 2012 HPCA Int’l conference. Finally, we also analyzed the energy consumed by each one of the different HTM approaches proposed in the literature. The results of this study were published by Gaona et al. in 2010 SBAC-PAD Int’l conference. Finally, we investigated the benefits of turning the concept of transactional conflict from its traditionally fixed definition into a variable one that can be dynamically controlled in software. We proposed the extension of the atomic language construct with an attribute that specifies the definition of conflict, so that programmers can write code which adjusts what kinds of conflicts are to be detected, relaxing or tightening the conditions according to the forms of interference that can be tolerated by a particular algorithm. These results of this investigation were published by Titos et al. in ACM Transactions on Architecture and Code Optimization, vol. 8, issue 4, 2012. 

Another important issue we analyzed in the context of this research line was how to provide efficient synchronization primitives in future many-core CMPs. In particular, we found that a large fraction of the coherence traffic expected in many-core CMPs is due to synchronization. Taking into account this observation, we proposed efficient hardware implementations of barriers and locks that drastically cut down synchronization traffic, and consequently, energy consumption at the interconnection network, and improve performance. The results of this research were published by Abellán et al. in 2010 ICPP Int’l conference, 2011 IPDPS Int’l conference (the paper won one best paper award), and more recently, in IEEE Transactions of Parallel and Distributed Systems, vol. 23, issue 8, 2012.

Regarding the virtualization coherency research, we developed Multiple-Area Cache Coherence Protocols as a new mechanism to improve the locality of data retrieval inside a chip, reducing NoC traffic. We propose a family of cache coherence protocols that statically divide the chip in areas. Coherence is maintained per area, and pointers link the areas, reducing the directory overhead to save energy. This reduction is especially significant with large core counts (e.g., 93% overhead reduction compared to full-map directory for 1024 cores and just 4 areas).

We also developed DAPSCO: Distance-Aware Partially Shared Cache Organization, an efficient directory cache-coherence scheme that does not use dedicates structures, saving area and energy, and enables flexible private/shared cache behaviour to fit application needs. We propose DAPSCO as an optimization to traditional partially-shared caches, based on the observation that clustering the LLC banks is not efficient. 

DAPSCO uses a more efficient core-bank mapping in which no clusters exist and each core accesses its surrounding LLC banks, giving every core the impression that it is located in the center of a cluster, minimizing the average distance to the LLC. DAPSCO's mapping holds the same desirable properties as traditional clustered organizations.

All these developments were presented in two top-conferences by Antonio García et al. (at the International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2010, and at the International Conference on Parallel Processing (ICPP), 2011), and also in a journal paper in the ACM Trans. Archit. Code Optim. 8, 4, Article 25 (January 2012).


Publications: [Ros07a], [Ros07b], [Titos07], [Ros08a], [Ros08b], [Ros08c], [Ros08d], [Titos08a], [Titos08b], [Ros09a], [Ros09b], [Titos09]

Projects funded by european grants: [HiPEAC]

External collaborations Academia:  Marcelo Cintra, Per Stenström, David Bertozzi

External collaborations Industry: --

Company Agreements: --

PhD dissertations: Ricardo Fernández PascualAlberto Ros Bardisa

Patents:  --