欢迎访问江苏省计算机学会网站!    设为首页  |  收藏本站
  •  当前位置首页 > 新闻中心 > 学会动态
    学术报告Practical Reliability Analysis of GPGPUs in the Wild:
    发布时间:2019-05-31 14:55:32




    General Purpose Graphics Processing Units (GPGPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors (faults), often caused by high-energy particle strikes, that can significantly affect application output quality. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative to better understand the reliability of such systems. In this talk, I will present a study of the system conditions that trigger GPU soft errors using a six-month trace data collected from a large-scale, operational HPC system from Oak Ridge National Lab. Workload characteristics, certain GPU cards, temperature and power consumption could be indicative of GPU faults, but it is non-trivial to exploit them for error prediction. Motivated by these observations and challenges, I will show how machine-learning-based error prediction models can capture the hidden interactions among system and workload properties. The above findings beg the question: how can one better understand the resilience of applications if faults are bound to happen? To this end, I will illustrate the challenges of comprehensive fault injection in GPGPU applications and outline a novel fault injection solution that captures the error resilience profile of GPGPU applications.


    Evgenia Smirni received the Diploma degree in Computer Science and Informatics from the University of Patras, Greece, in 1987 and the Ph.D. degree in Computer Science from Vanderbilt University in 1995. She is the Sidney P. Chockley Professor of Computer Science at the College of William and Mary, Williamsburg, VA, USA. Her research interests include queuing networks, stochastic modeling, Markov chains, resource allocation policies, storage systems, data centers and cloud computing, workload characterization, models for performance prediction, and reliability of distributed systems and applications. She has served as the Program co-Chair of QEST’05, ACM Sigmetrics/Performance’06, HotMetrics’10, ICPE’17, DSN’17, SRDS’19, and HPDC'19. She also served as the General co-Chair of QEST’10 and NSMC’10. She is an ACM Distinguished Scientist.


    时间:611  10:00-10:40



    下一篇:JSCS 2019年“物联网与工业互联网科研创新论坛”在南京航空航天大学召开
    江苏省科学技术协会 中国计算机学会 南京大学 南京大学计算机科技与技术系 南京大学软件学院 东南大学计算机科学与工程学院 江苏经贸职业技术学院 南京信息职业技术学院 南京工业职业技术学院 江苏海事职业技术学院 常州信息职业技术学院 国网电力科学研究院 电子科技集团第28研究所 江南计算技术研究所 

    Copyright (c) 版权所有 江苏省计算机学会          南京网站建设公司
    秘书处办公室       地址: 江苏省南京市仙林大道163号  邮编:210023   电话/传真:025-89680909   
    秘书处市内联络点   地址: 江苏省南京市汉口路22号     邮编:210093   电话/传真:025-86635622
    电子邮箱:jscs@nju.edu.cn   网址:www.jscs.org.cn    技术支持:南京成旭通信息技术有限公司  

