中国科技论文在线

上传时间

2020年11月04日

【期刊论文】A Small-Footprint Accelerator for Large-Scale Neural Networks

ACM Transactions on Computer Systems，2015，33（2）：6

2015年05月01日

Machine-learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve toward heterogeneous multicores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have been focusing on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance, and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02mm<sup>2</sup> and 485mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87 × faster, and it can reduce the total energy by 21.08 ×. The accelerator characteristics are obtained after layout at 65nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

无

0

35浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】Robust Design Space Modeling

ACM Transactions on Design Automation of Electronic Systems，2015，20（2）：18

2015年03月01日

摘要

Architectural design spaces of microprocessors are often exponentially large with respect to the pending processor parameters. To avoid simulating all configurations in the design space, machine learning and statistical techniques have been utilized to build regression models for characterizing the relationship between architectural configurations and responses (e.g., performance or power consumption). However, this article shows that the accuracy variability of many learning techniques over different design spaces and benchmarks can be significant enough to mislead the decision-making. This clearly indicates a high risk of applying techniques that work well on previous modeling tasks (each involving a design space, benchmark, and design objective) to a new task, due to which the powerful tools might be impractical. Inspired by ensemble learning in the machine learning domain, we propose a robust framework called ELSE to reduce the accuracy variability of design space modeling. Rather than employing a single learning technique as in previous investigations, ELSE employs distinct learning techniques to build multiple base regression models for each modeling task. This is not a trivial combination of different techniques (e.g., always trusting the regression model with the smallest error). Instead, ELSE carefully maintains the diversity of base regression models and constructs a metamodel from the base models that can provide accurate predictions even when the base models are far from accurate. Consequently, we are able to reduce the number of cases in which the final prediction errors are unacceptably large. Experimental results validate the robustness of ELSE: compared with the widely used artificial neural network over 52 distinct modeling tasks, ELSE reduces the accuracy variability by about 62%. Moreover, ELSE reduces the average prediction error by 27% and 85% for the investigated MIPS and POWER design spaces, respectively.

无

0

28浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】FreeRider: Non-Local Adaptive Network-on-Chip Routing with Packet-Carried Propagation of Congestion Information

IEEE Transactions on Parallel and Distributed Systems，2014，26（8）：2272 - 228

2014年08月05日

摘要

Non-local adaptive routing techniques, which utilize statuses of both local and distant links to make routing decisions, have recently been shown to be effective solutions for promoting the performance of Network-on-Chip (NoC). The essence of non-local adaptive routing was an additional network dedicated to propagate congestion information of distant links on the NoC. While the dedicated Congestion Propagation Network (CPN) helps routers to make promising routing decisions, it incurs additional wiring and power costs and becomes an unnecessary decoration when the load of NoC is light. Moreover, the CPN has to be extended if one would utilize more sophisticated congestion information to enhance the performance of NoC, bringing in even larger wiring and power costs. This paper proposes an innovative non-local adaptive routing technique called FreeRider, which does not use a dedicated CPN but instead leverages free bits in head flits of existing packets to carry and propagate rich congestion information without introducing additional wires or flits. In order to balance the network load, FreeRider adopts a novel three-stage strategy of output link selection, which adequately utilizes the propagated information to make routing decisions. Experimental results on both synthetic traffic patterns and application traces show that FreeRider achieves better throughput, shorter latency, and smaller power consumption than a state-of-the-art adaptive routing technique with dedicated CPN.

无

0

40浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】Statistical Performance Comparisons of Computers

IEEE Transactions on Computers，2014，65（5）： 1442 - 14

2014年04月04日

摘要

As a fundamental task in computer architecture research, performance comparison has been continuously hampered by the variability of computer performance. In traditional performance comparisons, the impact of performance variability is usually ignored (i.e., the means of performance observations are compared regardless of the variability), or in the few cases directly addressed with i-statistics without checking the number and normality of performance observations. In this paper, we formulate a performance comparison as a statistical task, and empirically illustrate why and how common practices can lead to incorrect comparisons. We propose a non-parametric hierarchical performance testing (HPT) framework for performance comparison, which is significantly more practical than standard i-statistics because it does not require to collect a large number of performance observations in order to achieve a normal distribution of sample mean. In particular, the proposed HPT can facilitate quantitative performance comparison, in which the performance speedup of one computer over another is statistically evaluated. Compared with the HPT, a common practice which uses geometric mean performance scores to estimate the performance speedup has errors of 8.0 to 56.3 percent on SPEC CPU2006 or SPEC MPI2007, which demonstrates the necessity of using appropriate statistical techniques. This HPT framework has been implemented as an open-source software, and integrated in the PARSEC 3.0 benchmark suite.

无

0

27浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】Performance Portability Across Heterogeneous SoCs Using a Generalized Library-Based Approach

ACM Transactions on Architecture and Code Optimization，2014，11（2）：21

2014年06月01日

摘要

Because of tight power and energy constraints, industry is progressively shifting toward heterogeneous system-on-chip (SoC) architectures composed of a mix of general-purpose cores along with a number of accelerators. However, such SoC architectures can be very challenging to efficiently program for the vast majority of programmers, due to numerous programming approaches and languages. Libraries, on the other hand, provide a simple way to let programmers take advantage of complex architectures, which does not require programmers to acquire new accelerator-specific or domain-specific languages. Increasingly, library-based, also called algorithm-centric, programming approaches propose to generalize the usage of libraries and to compose programs around these libraries, instead of using libraries as mere complements. In this article, we present a software framework for achieving performance portability by leveraging a generalized library-based approach. Inspired by the notion of a component, as employed in software engineering and HW/SW codesign, we advocate nonexpert programmers to write simple wrapper code around existing libraries to provide simple but necessary semantic information to the runtime. To achieve performance portability, the runtime employs machine learning (simulated annealing) to select the most appropriate accelerator and its parameters for a given algorithm. This selection factors in the possibly complex composition of algorithms used in the application, the communication among the various accelerators, and the tradeoff between different objectives (i.e., accuracy, performance, and energy). Using a set of benchmarks run on a real heterogeneous SoC composed of a multicore processor and a GPU, we show that the runtime overhead is fairly small at 5.1% for the GPU and 6.4% for the multi-core. We then apply our accelerator selection approach to a simulated SoC platform containing multiple inexact accelerators. We show that accelerator selection together with hardware parameter tuning achieves an average 46.2% energy reduction and a speedup of 2.1× while meeting the desired application error target.

无

0

33浏览
0点赞
0收藏
0分享
0下载
0

引用