计算机软件新技术国家重点实验室
摘 要:
General Purpose Graphics Processing Units
(GPGPUs) have rapidly evolved to enable energy-efficient data-parallel
computing for a broad range of scientific areas. While GPUs achieve exascale
performance at a stringent power budget, they are also susceptible to soft
errors (faults), often caused by high-energy particle strikes, that can
significantly affect application output quality. As those applications are
normally long-running, investigating the characteristics of GPU errors becomes
imperative to better understand the reliability of such systems. In this talk,
I will present a study of the system conditions that trigger GPU soft errors
using a six-month trace data collected from a large-scale, operational HPC
system from Oak Ridge National Lab. Workload characteristics, certain GPU
cards, temperature and power consumption could be indicative of GPU faults, but
it is non-trivial to exploit them for error prediction. Motivated by these
observations and challenges, I will show how machine-learning-based error
prediction models can capture the hidden interactions among system and workload
properties. The above findings beg the question: how can one better understand
the resilience of applications if faults are bound to happen? To this end, I
will illustrate the challenges of comprehensive fault injection in GPGPU
applications and outline a novel fault injection solution that captures the
error resilience profile of GPGPU applications.
报告人简介:
Evgenia Smirni received the Diploma degree
in Computer Science and Informatics from the University of Patras, Greece, in
1987 and the Ph.D. degree in Computer Science from Vanderbilt University in
1995. She is the Sidney P. Chockley Professor of Computer Science at the
College of William and Mary, Williamsburg, VA, USA. Her research interests
include queuing networks, stochastic modeling, Markov chains, resource
allocation policies, storage systems, data centers and cloud computing,
workload characterization, models for performance prediction, and reliability
of distributed systems and applications. She has served as the Program co-Chair
of QEST’05, ACM Sigmetrics/Performance’06, HotMetrics’10, ICPE’17, DSN’17,
SRDS’19, and HPDC'19. She also served as the General co-Chair of QEST’10 and
NSMC’10. She is an ACM Distinguished Scientist.
地点:计算机科学技术楼229室
时间:6月11日 10:00-10:40
|