Main Menu
About us
Project Description
Quantitative Results
Research Lines
Research Results
Impact on Society
Press room
Contact us
Secure Login
Events Calendar
« < October 2017 > »
25 26 27 28 29 30 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5

Activity 2: Extracting automatic task level parallelism

 Leader: Julio Sahuquillo; Researchers: David Yuste, Rafael Ubal, Salvador Petit, Pedro López 

1. Brief Description of the Goals

Because of technology advances, current processor architectures have moved to multicore systems instead of increasing the complexity of single core processors. The main reason is that cores in these systems are more power efficient and less complex, but also use to be slower. An important feature of these systems is that they are able to execute several threads in parallel, which make them a suitable platform for the application of automatic parallelization techniques. In other words, it is expected that single-thread applications do not significantly increase their performance on current and incoming multicore processors; thus heightening the interest in compiler parallelization techniques. In addition, an alternative choice is the design of specific processor microarchitectures that at run-time detect independent threads within the same benchmarks, so enabling the exploitation of thread level parallelism in such microarchitecture.

2. Scientific and Technical Developed Activities

The research performed on this task has followed two main directions and methodologies, each one with its own tools.

On one hand, we devised a compiler-based technique, referred to as Function Level Parallelism (FLP) that looks for parallelism at a function level. This technique detects larger threads than other existing techniques like the loop-level based ones, which is a main concern when exploiting TLP. In addition, it also reduces the involved runtime overhead. To evaluate this technique we developed a simulator that analyzes the different functions in a benchmark in a three-step algorithm. The analysis attempts to reduce the amount of job to do and to recycle the analyzed “chunks” of the program for further stages of the algorithm. To this end, we first analyze each function individually without considering the rest of functions of the program. Then, we perform an interprocedural analysis across the call graph, resulting in a per function characterization. Finally, this information is used to characterize each call to function, refining the data collected in the previous stage for each call instance. The main results of this work were published by Yuste et al. in INTERACT12 conference. After that, we concentrate on modifying the gcc code to get more accurate results.

On the other hand, we devised microarchitectural mechanisms to exploit thread level parallelism. We widely extended the Multi2Sim simulator to implement the devised microarchitecture and evaluate this technique. Among others, we implemented, a cluster microarchitecture, a trace cache, and specific circuitry for detecting and extracting independent subtraces or tasks. The clustered microarchitecture was a key issue, since each independent task could be launched to a different cluster. The main results of this work has been published by Ubal et al. in IEEE Transactions on Computer vol. 62, no. 5, 2013 and by Ubal et al. in PACT 2010 conference.

Publications: [Yuste07 ], [Yuste08 ].

Projects funded by Public Calls: TIN2006-15516-C04-01,  TIN2009-14475-C04-01  TIN2008-05338-E/TIN,  PAID-06-07-3281GV/2009/043 

External collaborations Academia: --

External collaborations Industry: IBM Haifa (Israel)

Company Agreements: --

PhD dissertations: --

Patents: --