In this second AccML workshop, we aim to bring together researchers working in Machine Learning and System Architecture to discuss requirements, opportunities, challenges and next steps in developing novel approaches for machine learning systems.
Find Out More In the last 5 years, the
remarkable performance achieved in a variety of application areas
(natural language processing, computer vision, games, etc.) has
led to the emergence of heterogeneous architectures to accelerate
machine learning workloads. In parallel, production deployment,
model complexity and diversity pushed for higher productivity
systems, more powerful programming abstractions, software and
system architectures, dedicated runtime systems and numerical
libraries, deployment and analysis tools. Deep learning models are
generally memory and computationally intensive, for both training
and inference. Accelerating these operations has obvious
advantages, first by reducing the energy consumption (e.g. in data
centers), and secondly, making these models usable on smaller
devices at the edge of the Internet. In addition, while
convolutional neural networks have motivated much of this effort,
numerous applications and models involve a wider variety of
operations, network architectures, and data processing. These
applications and models permanently challenge computer
architecture, the system stack, and programming abstractions. The
high level of interest in these areas calls for a dedicated forum
to discuss emerging acceleration techniques and computation
paradigms for machine learning algorithms, as well as the
applications of machine learning to the construction of such
systems.
The workshop brings together researchers and practitioners working
on computing systems for machine learning, and using machine
learning to build better computing systems. It also reaches out to
a wider community interested in this rapidly growing area, to
raise awareness of the existing efforts, to foster collaboration
and the free exchange of ideas.
This builds on the success of the First
AccML at HiPEAC 2020.
Topics of interest include (but are not limited to):
Novel ML systems: heterogeneous multi/many-core systems, GPUs and FPGAs;
Novel ML hardware accelerators and associated software;
Emerging semiconductor technologies with applications to ML hardware acceleration;
ML for the construction and tuning of systems;
Cloud and edge ML computing: hardware and software to accelerate training and inference;
Computing systems research addressing the privacy and security of ML-dominated systems;
Submission deadline: May 1
May 8, 2020
Notification to authors: May 15
May 20, 2020
Papers should be in double column IEEE format of between 4 and 8 pages including references. Papers should be uploaded as PDF and not anonymized.
Submissions can be made at easychair.org/conferences/?conf=2ndaccml.
Papers will be reviewed by the workshop's technical program committee according to criteria regarding a submission's quality, relevance to the workshop's topics, and, foremost, its potential to spark discussions about directions, insights, and solutions on the topics mentioned above. Research papers, case studies, and position papers are all welcome.
The workshop does not have formal proceedings, so accepted papers do not preclude publishing at future conferences and/or journals..
Universitat Politècnica de Catalunya
Title: Removing Ineffectual Computations in
Neural Networks
Abstract
There is a growing interest in extending computing devices with
the ability to analyze and understand signals and data coming from
a large variety of activities in our daily live, and provide real
time responses in complex situations, with the goal to emulate
human perception and problem solving. Examples include personal
assistants, self-driving cars, domestics robots and health-care
devices just to name a few. Neural networks have proven to be an
effective approach to support many of these functionalities.
Most of these systems have very limited energy budgets so the
effectiveness of this approach is strongly dependent on the
energy-efficiency of the adopted solution. In this talk we present
several alternative directions for improving the energy-efficiency
of neural networks based on identifying and removing ineffectual
computations.
Bio
Antonio González (Ph.D. 1989) is a Full Professor at the Computer
Architecture Department of the Universitat Politècnica de
Catalunya, Barcelona (Spain), and the director of the Architecture
and Compiler research group. He was the founding director of the
Intel Barcelona Research Center from 2002 to 2014. His research
has focused on computer architecture. In this area, Antonio holds
52 patents, has published over 370 research papers and has given
over 120 invited talks. He has also made multiple contributions to
the design of the architecture of several commercial
microprocessors.
Antonio has been program chair for ICS, ISPASS, MICRO, HPCA and
ISCA, and general chair for MICRO and HPCA among other symposia.
He has served on the program committee for over 130 international
symposia in the field of computer architecture, and has been
Associate Editor of the IEEE Transactions on Computers, IEEE
Transactions on Parallel and Distributed Systems, IEEE Computer
Architecture Letters, ACM Transactions on Architecture and Code
Optimization, ACM Transactions on Parallel Computing, and Journal
of Embedded Computing.
Antonio’s awards include the award to the best student in computer
engineering in Spain, the Rosina Ribalta award as the advisor of
the best PhD project in Information Technology and Communications,
the Duran Farrell award for research in technology, the Aritmel
National Award of Informatics to the Computer Engineer of the
Year, the King James I award for his contributions in research on
new technologies, and the ICREA Academia Award. He is an IEEE
Fellow.
Northeastern University
Title: Scaling Machine Learning Workloads on
Today’s GPUs
Abstract
Machine learning applications place large computational demands on
hardware resources when performing classification, regression,
clustering and training. What is common in many of these
applications is that the quality of the outcome or model improves
as we process more data. GPUs have been shown to be an effective
platform for accelerating machine learning workloads, though have
limits in terms of the amount a single GPU can process. This talk
will look at ongoing work in hardware compaction and multi-GPU
acceleration, enabling further scaling of machine learning
workloads.
Bio
David Kaeli received his BS and PhD in Electrical Engineering from
Rutgers University, and an MS in Computer Engineering from
Syracuse University. He is presently a COE Distinguished Full
Processor on the ECE faculty at Northeastern University, Boston,
MA. Dr. Kaeli has published over 350 critically reviewed
publications, 7 books, and 13 patents. He serves as the Editor in
Chief of ACM Transactions on Computer Architecture and Code
Optimization, and an Associate Editor of the IEEE Transactions on
Parallel and Distributed Systems and the Journal of Parallel and
Distributed Computing. Dr. Kaeli is an IEEE Fellow and an ACM
Distinguished Scientist.
Georgia Tech
Title: A Communication-Centric Approach for
Designing Flexible DNN Accelerators
Abstract
Deep Neural Networks (DNN) have demonstrated highly promising
results across computer vision and speech recognition, and are
becoming foundational for ubiquitous AI. The computational
complexity of these algorithms and a need for high
energy-efficiency has led to a surge in research on hardware
accelerators. To reduce the latency and energy costs of accessing
DRAM, most DNN accelerators are spatial in nature, with hundreds
of processing elements (PE) operating in parallel and
communicating with each other directly.
DNNs are evolving at a rapid rate - leading to myriad layer types
(convolution, attention, LSTM, MLP) of varying shape (regular and
irregular). Given a DNN there can be myriad computationally
efficient implementations (e.g., via pruning) - leading to
structured and unstructured sparsity. Finally, a given DNN can be
tiled and partitioned in myriad ways to exploit data reuse. All of
the above can lead to irregular dataflow patterns within the
accelerator substrate. Getting high mapping efficiency for all
these cases is highly challenging in accelerators today that are
often tightly coupled 2D grids with rigid near-neighbor
connectivity.
First, given a target DNN, we will demonstrate a systematic
methodology for understanding data reuse opportunities within the
algorithm and determine the cost vs benefit for efficiently
exploiting them in hardware using our dataflow +
microarchitectural model called MAESTRO (MICRO 2019 + IEEE Micro
Top Picks). Next, we present a systematic communication-centric
methodology for accelerator design, that can provide ~100%
efficiency for arbitratry DNNs shapes, sparsity ratios and
mappings. We demonstrate instances of this approach with two
accelerators, MAERI (ASPLOS 2018 + IEEE Micro Top Picks Hon’
mention) and SIGMA (HPCA 2020 + Best Paper Award), that show
orders of magnitude better utilization over state-of-the-art
baselines like NVIDIA's NVDLA and Google’s TPU.
Bio
Tushar Krishna is an Assistant Professor in the School of
Electrical and Computer Engineering at Georgia Tech. He also holds
the ON Semiconductor Junior Professorship. He has a Ph.D. in
Electrical Engineering and Computer Science from MIT (2014), a
M.S.E in Electrical Engineering from Princeton University (2009),
and a B.Tech in Electrical Engineering from the Indian Institute
of Technology (IIT) Delhi (2007). Before joining Georgia Tech in
2015, Dr. Krishna spent a year as a post-doctoral researcher at
Intel, Massachusetts.
Dr. Krishna’s research spans computer architecture,
interconnection networks, networks-on-chip (NoC) and deep learning
accelerators with a focus on optimizing data movement in modern
computing systems. Three of his papers have been selected for IEEE
Micro’s Top Picks from Computer Architecture, one more received an
honorable mention, and three have won best paper awards. He
received the National Science Foundation (NSF) CRII award in 2018,
and both a Google Faculty Award and a Facebook Faculty Award in
2019.
Title: Reflections on TPUs, Current Problems
in Acceleration, and What's Next
Abstract
Google's first TPU has been a remarkably successful accelerator,
spawning a sequence of successors and inspiring a wave of new
chips from established companies and startups. I'll start with
some retrospection about what we got right and the ways in which
we were lucky in building that first TPU. Then I'll pivot to the
problems I think are currently hard and possibly underserved by
our NN accelerator systems (to spoil: programmability, memory, and
multi-tenancy). Lastly I'll speculate about where ML might take
us: how much might the algorithms and computations change, the
implications of the Accelerator Wall, and the virtuous feedback
between algorithms and architecture that might be the basis of a
true Golden Age for our field.
Bio
Cliff Young is a software engineer in the Google Brain team, where
he works on codesign for deep learning accelerators. He is one of
the designers of Google’s Tensor Processing Unit (TPU), which is
used in production applications including Search, Maps, Photos,
and Translate. TPUs also powered AlphaGo’s historic 4-1 victory
over Go champion Lee Sedol. Previously, Cliff built
special-purpose supercomputers for molecular dynamics at D. E.
Shaw Research and worked at Bell Labs. Cliff holds AB, MS, and PhD
degrees in computer science from Harvard University.
Time (EDT/New York) | Virtual Event - 31st May 2020 |
---|---|
9:00 AM – 9:10 AM | Welcome |
9:10 AM – 10:10 AM |
Invited talk: Removing
Ineffectual Computations in Neural Networks
(9:10 AM – 9:50 AM)
Antonio Gonzalez, Universitat
Politècnica de Catalunya
Paper talk: You Only Spike Once: Improving Energy-Efficient Neuromorphic Inference to ANN-Level Accuracy (9:50 AM – 10:10 AM) Srivatsa P, Kyle Timothy Ng Chu, Yaswanth Tavva, Jibin Wu, Malu Zhang, Haizhou Li and Trevor E. Carlson |
10:10 AM – 11:10 AM | Invited talk: Scaling
Machine Learning Workloads on Today’s GPUs
(10:10 AM – 10:50 AM) David Kaeli, Northeastern University Paper talk: HCM: Hardware-Aware Complexity Metric for Neural Network Architectures (10:50 AM – 11:10 AM) Alex Karbachevsky, Chaim Baskin, Evgenii Zheltonozhskii, Yevgeny Yermolin, Freddy Gabbay, Alexander Bronstein and Avi Mendelson |
11:10 AM – 11:40 AM | Break |
11:40 AM – 12:40 PM | Invited talk: A
Communication-Centric Approach for Designing Flexible
DNN Accelerators (11:40 AM – 12:20 PM) Tushar Krishna, Georgia Tech Paper talk: STONNE: A Detailed Architectural Simulator for Flexible Neural Network Accelerators (12:20 PM – 12:40 PM) Francisco Muñoz-Martínez, José L. Abellán, Manuel E. Acacio and Tushar Krishna |
12:40 PM – 2:00 PM | Invited talk: Reflections
on TPUs, Current Problems in Acceleration, and What's
Next (12:40 PM – 1:20 PM) Cliff Young, Google Paper talk: Statistical Robustness of MCMC Accelerators (1:20 PM – 1:40 PM) Xiangyu Zhang, Ramin Bashizade, Yicheng Wang, Cheng Lyu, Sayan Mukherjee and Alvin R. Lebeck Paper talk: Acceleration Techniques for Sampling-based Machine Learning (1:40 PM – 2:00 PM) Yanqi Liu, Ruth Iris Bahar and Giuseppe Calderoni |
2:00 PM – 2:05 PM | Closing remarks |