Accelerating Parallel Tasks by Optimizing GPU Hardware Utilization

2020-04-29T14:47:32Z (GMT) by Tsung-Tai Yeh

Efficient GPU applications rely on programmers carefully structure their codes to fully utilize the GPU resources. In general, programmers spend a significant amount of time optimizing their applications to run efficiently on domain-specific architectures. To reduce the burden on programmers to utilize GPUs fully, I create several hardware and software solutions that improve the resource utilization on parallel processors without significant programmer intervention.


Recently, GPUs are increasingly being deployed in data centers to accelerate latency-driven applications, which exhibit a modest amount of data parallelism. The synchronous kernel execution on these applications cannot fully utilize the entire GPU. Thus, a GPU contains multiple hardware queues to improve its throughput by executing multiple kernels on a single device simultaneously when there are sufficient hardware resources. However, a GPU faces severe underutilization when the space in these queues has been exhausted, and the performance benefit vanishes with the decreased parallelism. As a result, I proposed a GPU runtime system – Pagoda, which virtualizes the GPU hardware resources by using an OS-like daemon kernel called MasterKernel. Tasks (kernels) are spawned from the CPU onto Pagoda as they be- come available, and are scheduled by the MasterKernel at the warp granularity to increase the GPU throughput for latency-driven applications. This work invents several programming APIs to handle task spawning and synchronization and includes parallel tasks and warp scheduling policies to reduce runtime overhead.


Latency-driven applications have both high throughput demands and response time constraints. These applications may launch many kernels that do not fully utilize the GPU unless grouped with large batch sizes. However, batching forces jobs to wait, which increases their latency. This wait time can be unacceptable when considering real-world arrival times of jobs. However, the round-robin GPU kernel scheduler is oblivious to application deadlines. This deadline-blind scheduling policy makes it harder to ensure that kernels meet their QoS deadlines. To enhance the responsiveness of the GPU, I also proposed LAX, including an execution time estimate for jobs with one or many kernels. Moreover, LAX adjusts priorities of kernels dynamically based on their slack time to increase the number of jobs that complete by their real-time deadlines. LAX improves the responsiveness and throughput of GPUs.


It is well-known that grouping threads into warps can create redundancy across scalar values in GPU vector registers. However, I also found that the layout of thread indices in multi-dimensional threadblocks (TBs) creates redundancy in the registers storing thread IDs. This redundancy propagates into dependent instructions that can be traced and identified statically. To remove GPU redundant instructions, I proposed DARSIE that uses a per-kernel compiler finalization check that uses TB dimensions to determine which instructions are redundant. Once identified, DARSIE hardware skips TB-redundant instructions before they are fetched.

DARSIE uses a new multithreaded register renaming and instruction synchronization technique to share the values from redundant instructions among warps in each TB. Altogether, DARSIE decreases the number of executed instructions to improve GPU performance and energy.

Keyword(s)

License

CC BY 4.0