Ph.D. Defense by Nagesh Bangalore Lakshmin

*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************

Event Details
  • Date/Time:
    • Thursday October 30, 2014 - Friday October 31, 2014
      11:00 am - 12:59 pm
  • Location: KACB 2100
  • Phone:
  • URL:
  • Email:
  • Fee(s):
    N/A
  • Extras:
Contact
No contact information submitted.
Summaries

Summary Sentence: Efficient Graph Algorithm Execution on Data-Parallel Architectures

Full Summary: No summary paragraph submitted.

Title: Efficient Graph Algorithm Execution on Data-Parallel Architectures

Nagesh Bangalore Lakshminarayana
Ph.D. Candidate
School of Computer Science
College of Computing

Date: October 30, 2014 (Thursday)
Time: 11 AM ET
Location: KACB 2100

Committee:
Dr. Hyesoon Kim, School of Computer Science (Advisor)
Dr. Santosh Pande, School of Computer Science
Dr. Milos Prvulovic, School of Computer Science
Dr. Moinuddin K. Qureshi, School of Electrical and Computer Engineering
Dr. Richard Vuduc, School of Computational Science and Engineering


Abstract:

Graph algorithms are an important component of applications from several domains such as social computing, bio-informatics, communication networks and so on. To improve the throughput of processing large graphs parallel versions of several graph algorithms have been designed. Data-parallel architectures such as GPUs with their high execution throughput and high memory bandwidth capabilities are a good candidate platform for executing such parallel graph algorithms.

However, GPU architectures are not a natural fit for graph algorithms. Due to their irregular control flow, non-uniform work distribution and data-dependent memory accesses, graph algorithms have a low execution efficiency on GPUs. The large number of data-dependent memory accesses often result in a significant number of idle cycles. Considerable work has been done on reducing the negative impact of control flow divergence in GPGPU applications and the same approaches can be applied to the execution of graph algorithms as well. The other problems of non-uniform work distribution and inability to tolerate memory access latency completely are much less explored and thus, provide opportunities for innovation.

In this dissertation I will present a graph algorithm specific prefetching mechanism and the evaluation of different design options for GPU cache hierarchies to improve the tolerance of graph algorithms to memory access latency. Lastly, I will examine the impact of graph algorithm implementation decisions.

The prefetching mechanism stores prefetched data in spare registers in addition to storing them in the cache as well. Due to resource allocation constraints, GPUs often have several unallocated registers during kernel execution. By using such unallocated registers the mechanism eliminates the loss of prefetched data due to evictions from the cache and makes prefetching more effective.

Unlike the traditional streaming GPGPU applications, graph algorithms exhibit some data locality and the small GPGPU caches can play an effective role in improving memory access latency tolerance. First, we evaluate how the cache inclusion property affects the performance of graph algorithms. Besides a non-inclusive cache hierarchy, we consider an exclusive cache hierarchy and other cache designs that selectively do exclusion including a new memory region based cache exclusion scheme. Second, we examine the impact of different cache bypass mechanisms on the performance of graph algorithms. Third, we examine the impact of fine-grained cache hierarchies that dynamically decide whether to fetch a full cache block or one or more cache sub-blocks. The evaluated schemes include a new PC-based mechanism for predicting the sub-blocks that are likely to be accessed in the future.

Finally, we examine the impact of different implementation strategies on graph algorithm performance, power and energy consumption and also the impact of software optimizations such as prefetching and loop unrolling.

Additional Information

In Campus Calendar
No
Groups

Graduate Studies

Invited Audience
Public
Categories
Other/Miscellaneous
Keywords
coc, graduate students, Phd Defense, thesis
Status
  • Created By: Danielle Ramirez
  • Workflow Status: Published
  • Created On: Oct 17, 2014 - 10:29am
  • Last Updated: Oct 7, 2016 - 10:09pm