Home  
 

Main Menu
Home
About us
Project Description
Quantitative Results
Research Lines
Research Results
Impact on Society
Press room
Contact us
News
Secure Login
Events Calendar
« < October 2017 > »
M T W T F S S
25 26 27 28 29 30 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5
Login

subTask 1.7(UPV): Efficient cache coherence protocols

Leader: Antonio Robles; Researchers: Blas Antonio Cuesta

1. Brief Description of the Goals

Multiprocessor systems require the use of cache coherence protocols to   coordinate the several caches distributed throughout the system. Since these protocols greatly influence the performance and scalability of the system and those are increasingly large, efficient and scalable cache coherence protocols are required. Traditionally, the design of coherence protocols has been based on snooping or directory. On the one hand, snooping protocols are efficient, but they require the system to have a totally-ordered interconnect, which are not scalable. On the other hand, directory protocols can use scalable networks, but they are not so efficient. To the aim of simultaneously providing scalability and efficiency, token-based cache coherence protocols was proposed.  

Token Coherence is a framework for developing cache coherence protocols based on tokens. Token Coherence is efficient and it scales quite well. However, in case of contention, it requires the use of a starvation prevention mechanism (called Persistent Requests) which is extremely inefficient and which does not scale. Firstly, Persistent Requests override the performance policy which provides low latency and efficiency. Secondly, Persistent Requests use broadcast messages, which are only suitable for small systems due to their lack of scalability. Thirdly, some structures are required at each node whose size depends on the system size and, therefore, they do not scale. Thus, the main problems of token-based protocols are caused by the Persistent Request mechanism.  

This research line aims at developing efficient cache coherence protocols by taking advantages of some well-known interconnection network techniques. In particular, we have focused on improving token-based cache coherence protocols by tackling their main weaknesses.

In addition, initial goals have been extended to address the scalability problem inherent to directory-based protocols. Since keeping track of every memory block in the system entails huge storage requirements, some recent proposals and commodity systems, only keep track of cached memory blocks. In this case, the directory information is only kept in small directory caches. Due to the lack of a full directory, the eviction of directory entries entails the invalidation of the cached copies of the corresponding block. Since the size of directory caches is limited and systems incorporate an increasing number of processors and cores, directory caches may suffer frequent evictions and, consequently, they may exhibit high miss rates (up to 70 percent as reported in some recent studies). As a result, the miss rate of processor caches may become excessively high, which can lead to serious performance degradation. Although the number of directory evictions can be reduced by using larger directory caches, this is not a scalable solution since it entails both larger directory access latencies and higher directory memory overhead.

In our research line, we opt to increase the effectiveness of the available space for directory caches, assuming that it will commonly be a scarce resource, especially in large systems. We take advantage of the fact that a significant fraction of the memory blocks accessed by applications does not need coherence maintenance, i.e., they are either only accessed by one processor (private blocks) or not modified by any processor (read-only blocks). Therefore, this research line focuses on designing effective techniques to detect such memory blocks.  

On the other hand, we also extend the initial goals to address the scalability problem inherent to the cache coherence support provided by some current commercial processors, such as the AMD Opterons, which limit up to 8-dies the size of the supported coherence domain.  

2. Scientific and Technical Developed Activities

The primary goal of this research line was the development of a new starvation prevention mechanism for token coherence that was more elegant, efficient, and flexible than the current mechanism (Persistent Requests). To this end, we proposed the Priority Request mechanism. This mechanism describes a method to put requests in order by using the routing algorithm. Once the order is ensured, the contention can be easily avoided without having to override the performance policy. Results showed that the Priority Request mechanism reduces the number contention situations and, in addition, the number of messages required to solve such situations. The overall traffic decreases too which, in turn, entails a reduction in the latency thanks to lower the traffic overhead. As a result, the runtime of applications reduces and this reduction is more significant as the system size grows. This initial research was published by Cuesta et al. in 2007 Euromicro PDP Conference.

The second aim was to tackle the scalability problems related to the storage requirements of the starvation prevention mechanism. This mechanism requires a table at each node whose size depends on the factor of the number of simultaneous outstanding requests and the system size. By limiting the maximum number of simultaneous outstanding requests, the table size lowers, but it still depends on the system size. To decouple the table size from the system size, we proposed a method that limits the number of entries at each table at the expense of slight performance degradation. Results showed that the reduction in the number of entries hardly affect the whole performance. In fact, the use of single-entry tables does not entail a significant reduction in the performance. The results of this research line, which leaded to a scalable Priority Request mechanism, were published by Cuesta et al. in IEEE Transactions on Parallel and Distributed Systems, vol.22, no. 10, 2011.

Apart from addressing the scalability problem in terms of storage requirements, we also dealt with the scalability problems of broadcast messages (Priority Requests). To this end, we proposed a switch-based packing technique that allows us to pack several broadcast messages into a single one. By doing so, the bandwidth requirements of the broadcast messages lower significantly, being closer to the requirements of point-to-point messages. The evaluation of this technique showed that, when applying it, the interconnection network is less congested, which favors the latency of messages and, therefore, the runtime of applications. A secondary contribution of this technique was its application to alleviate the nodes which want to modify a widely shared block. Before modifying this class of blocks, writers must wait for a lot of acknowledgments informing about the block invalidation in the rest of nodes. Thus, by applying the packing technique to those messages, the number of acknowledgments that a writer must wait for before performing the write lowers significantly, which reduces the latency of solving write misses and, therefore, improves the system performance. The results of research line were published by Cuesta et al. in 2008 PDCAT Conference and later, an extended version, in Elsevier Journal of Parallel and Distributed Computing, vol. 72, no.3, 2012.

The last goal of this research line was to avoid that the owner node can become a bottleneck in case of highly contended block. In this scenario, the owner node has to serve a lot of near simultaneous requests, which can congest the output buffers of the owner node. To solve this situation, we introduced a new class of multicast responses which can be used to simultaneously serve several requests. Besides, thanks to the implementation that we proposed, this class of message can be dynamically made. As shown by the results, the latency of multicast responses lowers. Besides, since the outbuffers are less congested, the latency of messages (in general) also reduces, which contributes to improve the overall system performance. These results were published by Cuesta et al. in 2008 Euromicro PDP Conference.

The results of the previous research lines constituted the PhD Thesis of Blas Cuesta, defended by June 2009.

In addition to the proposals made to improve token-based coherence protocols, we have also addressed the lack of scalability of the current AMD Magny-Cours processors (2 dies), whose coherence domain is limited to only eight dies due to the current design of their Probe Filters (a kind of directory cache aimed at filtering the coherence traffic). To overcome such a limitation and widen its scalability, we have designed an external logic support that provides a way to efficiently extend the coherence domain provided by the new generation of AMD Opteron processors beyond the 8-die limit, maintains the advantages provided by the Probe Filters, and filters additional coherence traffic to enhance the Probe Filter effectiveness and scalability. In particular, we have extended the addressing capabilities of the Probe Filter by means of certain external logic (bridge chip implemented on FPGA or ASIC) associated to each of the boards that constitute the system, which contains up to two processors (4 dies). Thanks our proposal, an undefined number of boards can be connected through their bridge chips by using whatever commercial interconnect (v.g., InfiniBand or Gigabit Ethernet), guaranteeing coherence maintenance along the entire system. The proposed chip not only maintains the Probe Filter capability to filter the coherence traffic over the entire system, but also filters additional traffic, providing the scalability required to build large-scale servers. This proposal was published by Ros et al. in 2010 HiPC Int'l conference. Subsequently, we have improved the initial proposal by reducing the storage resources required by the bridge chip by a factor of 8 (which reduces, in turn, the silicon area required by the chip), while still maintaining system performance. In addition, we have proposed y evaluated different versions of bridge chip trying to meet a trade-off between traffic filtering capabilities and required silicon area. The results of this research were published by Ros et al. in IEEE Transaction on Computers, vol. 61, no. 5, 2012.

Finally, in order to improve the scalability of directory-based coherence protocols, we have proposed techniques aimed at making an efficient use of the storage resources provided by directory caches. We observed that a significant quantity of memory references does not require coherence maintenance, as it is the case of private data (74 percent on average). Taking into account this observation, we propose to deactivate coherence just for private blocks, avoiding so the tracking of such blocks by the directory cache. In this sense, we have developed an OS-aided mechanism to classify, at the page granularity, memory blocks as private or shared. Only the blocks belonging to shared pages will be tracked by the directory cache. Avoiding the tracking of private blocks decreases the contention on directory caches. As a result, the number of blocks invalidated due to the lack of entries in directory caches can be drastically reduced (by about 57% on average). This advantage can be used not only for increasing the performance of a system (15% of reduction in application runtime), but also for continuing to obtain good performance having less storage requirements (systems with directory caches 8 times smaller). A first version of the technique was published by Cuesta et al. in 2011 ISCA Int'l symposium. Subsequently, we improved the page classification&detection mechanism in order to detect those blocks that, being shared among two or more nodes, do not require, though, coherence maintenance, as it is the case of shared read-only blocks. We have observed that such blocks also represent a significant percentage of the referenced blocks (near the 50% in some applications). Our enhanced mechanism allows us to reduce even more the number of blocks tracked by the directory caches (about 9 percent more non-coherent blocks are detected). Hence, the new mechanisms achieves to reduce up to 16 times the directory cache size while maintaining application runtime below that of a system that does not apply any coherence deactivation mechanism. The results of this research were published by Cuesta et al. in IEEE Transaction on Computers, vol. 63, no. 3, 2013.


Publications: [Cuesta07a], [Cuesta07b], [Cuesta08a], [Cuesta08b], [Cuesta08c],[Cuesta09b]  

Projects funded by Public Calls:  TIN2006-15516-C04-01TIN2009-14475-C04-01

External collaborations Academia: --

External collaborations Industry: --

Company Agreements: --

PhD dissertations: Blas Antonio Cuesta Sáez

Patents: --