陈云霁，学者主页-中国科技论文在线

陈云霁

博士研究员博士生导师

中国科学院计算技术研究所

微体系结构；机器学习；并行计算；视频处理

个性化签名

TA的关注(0) 关注TA的(0)

留言板

暂无留言

主页成果学术会议学者精选辑更多功能敬请期待

姓名：陈云霁
目前身份：在职研究人员
担任导师情况：博士生导师
学位：博士
学术头衔：

博士生导师
职称：高级-研究员
学科领域：

计算机系统结构
研究兴趣：微体系结构；机器学习；并行计算；视频处理

个人简介

陈云霁，男，1983年生，江西南昌人，中国科学院计算技术研究所研究员，博士生导师。同时，他担任了中国科学院脑科学卓越中心特聘研究员，以及中国科学院大学岗位教授。他带领其研究组长期开展深度学习处理器研究，被Science杂志刊文评价为“该方向的先驱”和“处于引领者行列”。他在包括ISCA、HPCA、MICRO、ASPLOS、ICSE、ISSCC、Hot Chips、IJCAI、FPGA、SPAA、IEEE Micro以及8种IEEE/ACM Trans.在内的学术会议及期刊上发表论文60余篇。陈云霁曾获国家杰出青年科学基金、中国青年科技奖、全国创新争先奖、国家自然科学基金委“优秀青年基金”，被评为最美科技工作者，荣获科学探索奖，并被MIT技术评论评为全球35位杰出青年创新者（2015年度）。

教育经历
2002.9-2007.7            硕博连读       中国科学院计算技术研究所
1997.9-2002.7            本科生           中国科学技术大学少年班
1992.7-1997.9            中学生           南昌市第十中学少年班

研究经历
2012.9-            研究员           中国科学院计算技术研究所
2009.9-2012.9            副研究员       中国科学院计算技术研究所
2007.7-2009.9            助理研究员       中国科学院计算技术研究所

研究兴趣
微体系结构；机器学习；并行计算；视频处理

主页访问

225
关注数

0
成果阅读

663
成果数

22

TA的成果

上传时间

2020-11-04

【期刊论文】Godson-3: A Scalable Multicore RISC Processor with x86 Emulation

IEEE Micro，2009，29（2）：17 - 29

2009年04月07日

摘要

The Godson-3 microprocessor aims at high-throughput server applications, high-performance scientific computing, and high-end embedded applications. It offers a scalable network on chip, hardware support for x86 emulation, and a reconfigurable architecture. The four-core Godson-3 chip is fabricated with 65-nm CMOS technology. Eight- and 16-core Godson-3 chips are in development.

无

39浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Global Clock, Physical Time Order and Pending Period Analysis in Multiprocessor Systems

arXiv，2009，（）：

2009年07月12日

摘要

In multiprocessor systems, various problems are treated with Lamport's logical clock and the resultant logical time orders between operations. However, one often needs to face the high complexities caused by the lack of logical time order information in practice. In this paper, we utilize the \emph{global clock} to infuse the so-called \emph{pending period} to each operation in a multiprocessor system, where the pending period is a time interval that contains the performed time of the operation. Further, we define the \emph{physical time order} for any two operations with disjoint pending periods. The physical time order is obeyed by any real execution in multiprocessor systems due to that it is part of the truly happened operation orders restricted by global clock, and it is then proven to be independent and consistent with traditional logical time orders. The above novel yet fundamental concepts enables new effective approaches for analyzing multiprocessor systems, which are named \emph{pending period analysis} as a whole. As a consequence of pending period analysis, many important problems of multiprocessor systems can be tackled effectively. As a significant application example, complete memory consistency verification, which was known as an NP-hard problem, can be solved with the complexity of O(n2) (where n is the number of operations). Moreover, the two event ordering problems, which were proven to be Co-NP-Hard and NP-hard respectively, can both be solved with the time complexity of O(n) if restricted by pending period information.

无

21浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】System Architecture of Godson-3 Multi-Core Processors

Journal of Computer Science and Technology，2010，25（）：181–191

2010年03月16日

摘要

Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem.

无

23浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Linear Time Memory Consistency Verification

IEEE Transactions on Computers，2011，61（4）：502 - 516

2011年02月10日

摘要

Verifying the execution of a parallel program against a given memory consistency model (memory consistency verification) is a crucial problem in the functional validation of Chip Multiprocessor (CMP). In the absence of additional information, the above problem is known to be NP-hard. By adopting the pending period information, this paper proposes the first linear-time software-based approach to memory consistency verification. Our approach relies on a novel technique called reusable cycle checking, which reuses the previous order information when repeatedly checking cycle at different frontiers. In the context of pending period information, this technique significantly reduces the overall computational costs required by cycle checking, enabling linear-time (in the number of memory operations) memory consistency verification for any given multicore system with a constant number of processors. From a practical perspective, an industrial memory consistency verification tool, named XCHECK, has been developed based on our approach. XCHECK is capable of working with neither test program constraint nor dedicated hardware support in postsilicon verifications of many multiprocessor systems. Experimental results show that XCHECK is 3-10 times faster than a state-of-art software-based approach. XCHECK has been integrated into the verification platforms for an industrial multicore processor Godson-3B, and found several bugs of the design.

无

35浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Program Regularization in Memory Consistency Verification

IEEE Transactions on Parallel and Distributed Systems，2012，23（11）：2163 - 217

2012年01月31日

摘要

A widely adopted methodology for verifying the memory subsystem of a Chip Multiprocessor (CMP) is to verify executions of parallel test programs on the CMP against the given memory consistency model, which has been long known to be time consuming in both theory and practice. To accelerate memory consistency verification, previous approaches have to bear the cost of availability (e.g., relying on dedicated hardware supports that have not been offered by many commodity CMPs) or completeness (e.g., missing some bugs). In the meantime, the impact of parallel programs on memory consistency verification has more or less been overlooked. One piece of evidence is that few investigations have been dedicated to finding appropriate test programs enabling more efficient verification From a novel perspective of test program, we devise a practical technique called “program regularization,” which can effectively reduce the computation time of memory consistency verification. The key intuition behind program regularization is that any parallel program, if being reformed appropriately, can enable efficient memory consistency verification. More specifically, for an original program, program regularization introduces some auxiliary memory addresses, and periodically inserts load/store operations accessing these addresses to the original program. With the regularized program, memory consistency verification can be accomplished in linear time (with respect to the number of memory operations) when the number of processors is fixed. Experimental results show that program regularization can significantly accelerate memory consistency verification. Last but not least, our technique, which does not rely on concrete verification algorithm or dedicated hardware support, can be smoothly integrated into existing presilicon/postsilicon verification platforms of industrial CMPs to speed up memory consistency verification.

无

40浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Motion Estimation Without Integer-Pel Search

IEEE Transactions on Image Processing，2012，22（4）：1340 - 135

2012年11月20日

摘要

The typical motion estimation (ME) consists of three main steps, including spatial-temporal prediction, integer-pel search, and fractional-pel search. The integer-pel search, which seeks the best matched integer-pel position within a search window, is considered to be crucial for video encoding. It occupies over 50% of the overall encoding time (when adopting the full search scheme) for software encoders, and introduces remarkable area cost, memory traffic, and power consumption to hardware encoders. In this paper, we find that video sequences (especially high-resolution videos) can often be encoded effectively and efficiently even without integer-pel search. Such counter-intuitive phenomenon is not only because that spatial-temporal prediction and fractional-pel search are accurate enough for the ME of many blocks. In fact, we observe that when the predicted motion vector is biased from the optimal motion vector (mainly for boundary blocks of irregularly moving objects), it is also hard for integer-pel search to reduce the final rate-distortion cost: the deviation of reference position could be alleviated with the fractional-pel interpolation and rate-distortion optimization techniques (e.g., adaptive macroblock mode). Considering the decreasing proportion of boundary blocks caused by the increasing resolution of videos, integer-pel search may be rather cost-ineffective in the era of high-resolution. Experimental results on 36 typical sequences of different resolutions encoded with x264, which is a widely-used video encoder, comply with our analysis well. For 1080p sequences, removing the integer-pel search saves 57.9% of the overall H.264 encoding time on average (compared to the original x264 with full integer-pel search using default parameters), while the resultant performance loss is negligible: the bit-rate is increased by only 0.18%, while the peak signal-to-noise ratio is decreased by only 0.01 dB per frame averagely.

无

38浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】LDet: Determinizing Asynchronous Transfer for Postsilicon Debugging

IEEE Transactions on Computers，2012，62（9）：1732 - 174

2012年06月05日

摘要

To efficiently and effectively debug silicon bugs, a promising solution is to determinize the chip, so that the buggy silicon behaviors can be faithfully reproduced on a RTL simulator. In this paper, we propose a novel scheme, named LDet, to determinize a chip through removing the nondeterminism in transfers crossing different clock domains, even when these clock domains are heterochronous. The key insight of LDet is that we can slightly adjust the frequencies of clocks at runtime so that the actual frequency ratio between two clocks always approaches a rational constant with bounded accumulated error. With the technique called dynamic frequency adjusting, the processing time of each asynchronous transfer can be determinized with deterministic asynchronous fifo (DAF). As a consequence, the behavior of the whole chip is deterministic, thus the chip behavior can be reproduced on the RTL simulator (given the same initial state and input sequence). We implement LDet on the RTL design of a processor chip with many clock domains. Experiments show that on average, LDet only causes about one cycle of additional latency to each asynchronous transfer. As a result, LDet only incurs a negligible performance overhead of about 0.7 percent slowdown. Moreover, LDet only brings less than 0.2 percent additional area to the chip. The low performance and area overheads of LDet well demonstrate its applicability in industry.

无

22浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Deterministic Replay Using Global Clock

ACM Transactions on Architecture and Code Optimization，2013，10（1）：1

2013年04月01日

摘要

Debugging parallel programs is a well-known difficult problem. A promising method to facilitate debugging parallel programs is using hardware support to achieve deterministic replay on a Chip Multi-Processor (CMP). As a Design-For-Debug (DFD) feature, a practical hardware-assisted deterministic replay scheme should have low design and verification costs, as well as a small log size. To achieve these goals, we propose a novel and succinct hardware-assisted deterministic replay scheme named LReplay. The key innovation of LReplay is that instead of recording the logical time orders between instructions or instruction blocks as previous investigations, LReplay is built upon recording the pending period information infused by the global clock. By the recorded pending period information, about 99% execution orders are inferrable, implying that LReplay only needs to record directly the residual 1% noninferrable execution orders in production run. The 1% noninferrable orders can be addressed by a simple yet cost-effective direction prediction technique, which further reduces the log size of LReplay. Benefiting from the preceding innovations, the overall log size of LReplay over SPLASH-2 benchmarks is about 0.17B/K-Inst (byte per k-instruction) for the sequential consistency, and 0.57B/K-Inst for the Godson-3 consistency. Such log sizes are smaller in an order of magnitude than previous deterministic replay schemes incurring no performance loss. Furthermore, LReplay only consumes about 0.5% area of the Godson-3 CMP, since it requires only trivial modifications to existing components of Godson-3. The features of LReplay demonstrate the potential of integrating hardware support for deterministic replay into future industrial processors.

无

21浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Effective and efficient microprocessor design space exploration using unlabeled design configurations

ACM Transactions on Intelligent Systems and Technology，2014，5（1）：20

2014年01月01日

摘要

Ever-increasing design complexity and advances of technology impose great challenges on the design of modern microprocessors. One such challenge is to determine promising microprocessor configurations to meet specific design constraints, which is called Design Space Exploration (DSE). In the computer architecture community, supervised learning techniques have been applied to DSE to build regression models for predicting the qualities of design configurations. For supervised learning, however, considerable simulation costs are required for attaining the labeled design configurations. Given limited resources, it is difficult to achieve high accuracy. In this article, inspired by recent advances in semisupervised learning and active learning, we propose the COAL approach which can exploit unlabeled design configurations to significantly improve the models. Empirical study demonstrates that COAL significantly outperforms a state-of-the-art DSE technique by reducing mean squared error by 35% to 95%, and thus, promising architectures can be attained more efficiently.

无

27浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】An 8-Core MIPS-Compatible Processor in 32/28 nm Bulk CMOS

IEEE Journal of Solid-State Circuits，2013，49（1）：41 - 49

2013年10月22日

摘要

This paper is an extension of Hu et al., ISSCC, 2013, and it introduces the 32/28 nm implementations of Godson-3B1500, which are 8-core MIPS-compatible microprocessors with vector extensions. Godson-3B1500 is fabricated in STMicroelectronics 32/28 nm high-κ metal-gate low-power bulk CMOS with 10 metal layers. It contains 1.14 billion transistors and operates at the frequency of 1.0 GHz to 1.5 GHz with the voltage supply ranging from 1.0 V to 1.3 V. Compared to its predecessor (Hu et al., ISSCC, 2011), Godson-3B1500 brings significant power efficiency improvements with enhanced performance (150GFLOPS@1.2 GHz) and reduced power dissipation (<; 40 W), due to not only technology scaling but also a great deal of design efforts.

无

25浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Deterministic Replay: A Survey

ACM Computing Surveys，2015，48（2）：17

2015年09月01日

摘要

Deterministic replay is a type of emerging technique dedicated to providing deterministic executions of computer programs in the presence of nondeterministic factors. The application scopes of deterministic replay are very broad, making it an important research topic in domains such as computer architecture, operating systems, parallel computing, distributed computing, programming languages, verification, and hardware testing. In this survey, we comprehensively review existing studies on deterministic replay by introducing a taxonomy. Basically, existing deterministic replay schemes can be classified into two categories, single-processor (SP) schemes and multiprocessor (MP) schemes. By reviewing the details of these two categories of schemes respectively, we summarize and compare how existing schemes address technical issues such as log size, record slowdown, replay slowdown, implementation cost, and probe effect, which may shed some light on future studies on deterministic replay.

无

36浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】IMR: High-Performance Low-Cost Multi-Ring NoCs

IEEE Transactions on Parallel and Distributed Systems，2015，27（6）： 1700 - 17

2015年08月07日

摘要

A ring topology is a common solution of network-on-chip (NoC) in industry, but is frequently criticized to have poor scalability. In this paper, we present a novel type of multi-ring NoC called isolated multi-ring (IMR), which can even support chip multiprocessors (CMPs) with 1,024 cores. In IMR, any pair of cores are connected via at least one isolated ring, so that each packet can reach the destination without transferring from one ring to another. Therefore, IMR no longer needs expensive routers as mesh, which not only enhances the network performance but also reduces hardware overheads. We utilize simulated evolution to design optimized IMR topologies. We compare these IMR topologies against nine representative NoCs (e.g., traditional mesh, multi mesh, low-cost mesh, Express-virtual-channels mesh (EVC), torus ring, and hierarchical ring). We observe from experiments that IMR significantly outperforms its competitors in both saturation throughput and latency across all scenarios considered. For example, in a 16 × 16 CMP, IMR improves the saturation throughput of a state-of-the-art mesh (EVC) by 265.29 percent on average, and reduces the average packet latency on SPLASH-2 application traces by 71.58 percent, while consuming 5.08 percent less area and 9.76 percent less power. In a 32 × 32 CMP, IMR averagely improves the saturation throughput of EVC by 191.58 percent, and averagely reduces the packet latency on SPLASH-2 application traces by 23.09 percent, while consuming 2.86 percent less area and 10.81 percent less power.

无

34浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Accelerating Architectural Simulation Via Statistical Techniques: A Survey

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems，2015，35（3）：433 - 446

2015年09月24日

摘要

In computer architecture research and development, simulation is a powerful way of acquiring and predicting processor behaviors. While architectural simulation has been extensively utilized for computer performance evaluation, design space exploration, and computer architecture assessment, it still suffers from the high computational costs in practice. Specifically, the total simulation time is determined by the simulator's raw speed and the total number of simulated instructions. The simulator's speed can be improved by enhanced simulation infrastructures (e.g., simulators with high-level abstraction, parallel simulators, and hardware-assisted simulators). Orthogonal to these work, recent studies also managed to significantly reduce the total number of simulated instructions with a slight loss of accuracy. Interestingly, we observe that most of these work are built upon statistical techniques. This survey presents a comprehensive review to such studies and proposes a taxonomy based on the sources of reduction. In addition to identifying the similarities and differences of state-of-the-art approaches, we further discuss insights gained from these studies as well as implications for future research.

无

38浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Leveraging the Error Resilience of Neural Networks for Designing Highly Energy Efficient Accelerators

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems，2015，34（8）：1223 - 123

2015年04月03日

摘要

In recent years, inexact computing has been increasingly regarded as one of the most promising approaches for slashing energy consumption in many applications that can tolerate a certain degree of inaccuracy. Driven by the principle of trading tolerable amounts of application accuracy in return for significant resource savings-the energy consumed, the (critical path) delay, and the (silicon) area-this approach has been limited to application-specified integrated circuits (ASICs) so far. These ASIC realizations have a narrow application scope and are often rigid in their tolerance to inaccuracy, as currently designed; the latter often determining the extent of resource savings we would achieve. In this paper, we propose to improve the application scope, error resilience and the energy savings of inexact computing by combining it with hardware neural networks. These neural networks are fast emerging as popular candidate accelerators for future heterogeneous multicore platforms and have flexible error resilience limits owing to their ability to be trained. Our results in 65-nm technology demonstrate that the proposed inexact neural network accelerator could achieve 1.78-2.67× savings in energy consumption (with corresponding delay and area savings being 1.23 and 1.46×, respectively) when compared to the existing baseline neural network implementation, at the cost of a small accuracy loss (mean squared error increases from 0.14 to 0.20 on average).

无

42浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Iterative optimization for the data center

ACM SIGPLAN Notices，2012，47（4）：

2012年03月01日

摘要

Iterative optimization is a simple but powerful approach that searches for the best possible combination of compiler optimizations for a given workload. However, each program, if not each data set, potentially favors a different combination. As a result, iterative optimization is plagued by several practical issues that prevent it from being widely used in practice: a large number of runs are required for finding the best combination; the process can be data set dependent; and the exploration process incurs significant overhead that needs to be compensated for by performance benefits.Therefore, while iterative optimization has been shown to have significant performance potential, it is seldomly used in production compilers. In this paper, we propose Iterative Optimization for the Data Center (IODC): we show that servers and data centers offer a context in which all of the above hurdles can be overcome. The basic idea is to spawn different combinations across workers and recollect performance statistics at the master, which then evolves to the optimum combination of compiler optimizations. IODC carefully manages costs and benefits, and is transparent to the end user. We evaluate IODC using both MapReduce and throughput compute-intensive server applications. In order to reflect the large number of users interacting with the system, we gather a very large collection of data sets (at least 1000 and up to several million unique data sets per program), for a total storage of 10.7TB, and 568 days of CPU time. We report an average performance improvement of 1.48×, and up to 2.08×, for the MapReduce applications, and 1.14×, and up to 1.39×, for the throughput compute-intensive server applications.

无

23浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】A Small-Footprint Accelerator for Large-Scale Neural Networks

ACM Transactions on Computer Systems，2015，33（2）：6

2015年05月01日

摘要

Machine-learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve toward heterogeneous multicores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have been focusing on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance, and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02mm<sup>2</sup> and 485mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87 × faster, and it can reduce the total energy by 21.08 ×. The accelerator characteristics are obtained after layout at 65nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

无

35浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Robust Design Space Modeling

ACM Transactions on Design Automation of Electronic Systems，2015，20（2）：18

2015年03月01日

摘要

Architectural design spaces of microprocessors are often exponentially large with respect to the pending processor parameters. To avoid simulating all configurations in the design space, machine learning and statistical techniques have been utilized to build regression models for characterizing the relationship between architectural configurations and responses (e.g., performance or power consumption). However, this article shows that the accuracy variability of many learning techniques over different design spaces and benchmarks can be significant enough to mislead the decision-making. This clearly indicates a high risk of applying techniques that work well on previous modeling tasks (each involving a design space, benchmark, and design objective) to a new task, due to which the powerful tools might be impractical. Inspired by ensemble learning in the machine learning domain, we propose a robust framework called ELSE to reduce the accuracy variability of design space modeling. Rather than employing a single learning technique as in previous investigations, ELSE employs distinct learning techniques to build multiple base regression models for each modeling task. This is not a trivial combination of different techniques (e.g., always trusting the regression model with the smallest error). Instead, ELSE carefully maintains the diversity of base regression models and constructs a metamodel from the base models that can provide accurate predictions even when the base models are far from accurate. Consequently, we are able to reduce the number of cases in which the final prediction errors are unacceptably large. Experimental results validate the robustness of ELSE: compared with the widely used artificial neural network over 52 distinct modeling tasks, ELSE reduces the accuracy variability by about 62%. Moreover, ELSE reduces the average prediction error by 27% and 85% for the investigated MIPS and POWER design spaces, respectively.

无

28浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】FreeRider: Non-Local Adaptive Network-on-Chip Routing with Packet-Carried Propagation of Congestion Information

IEEE Transactions on Parallel and Distributed Systems，2014，26（8）：2272 - 228

2014年08月05日

摘要

Non-local adaptive routing techniques, which utilize statuses of both local and distant links to make routing decisions, have recently been shown to be effective solutions for promoting the performance of Network-on-Chip (NoC). The essence of non-local adaptive routing was an additional network dedicated to propagate congestion information of distant links on the NoC. While the dedicated Congestion Propagation Network (CPN) helps routers to make promising routing decisions, it incurs additional wiring and power costs and becomes an unnecessary decoration when the load of NoC is light. Moreover, the CPN has to be extended if one would utilize more sophisticated congestion information to enhance the performance of NoC, bringing in even larger wiring and power costs. This paper proposes an innovative non-local adaptive routing technique called FreeRider, which does not use a dedicated CPN but instead leverages free bits in head flits of existing packets to carry and propagate rich congestion information without introducing additional wires or flits. In order to balance the network load, FreeRider adopts a novel three-stage strategy of output link selection, which adequately utilizes the propagated information to make routing decisions. Experimental results on both synthetic traffic patterns and application traces show that FreeRider achieves better throughput, shorter latency, and smaller power consumption than a state-of-the-art adaptive routing technique with dedicated CPN.

无

40浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Statistical Performance Comparisons of Computers

IEEE Transactions on Computers，2014，65（5）： 1442 - 14

2014年04月04日

摘要

As a fundamental task in computer architecture research, performance comparison has been continuously hampered by the variability of computer performance. In traditional performance comparisons, the impact of performance variability is usually ignored (i.e., the means of performance observations are compared regardless of the variability), or in the few cases directly addressed with i-statistics without checking the number and normality of performance observations. In this paper, we formulate a performance comparison as a statistical task, and empirically illustrate why and how common practices can lead to incorrect comparisons. We propose a non-parametric hierarchical performance testing (HPT) framework for performance comparison, which is significantly more practical than standard i-statistics because it does not require to collect a large number of performance observations in order to achieve a normal distribution of sample mean. In particular, the proposed HPT can facilitate quantitative performance comparison, in which the performance speedup of one computer over another is statistically evaluated. Compared with the HPT, a common practice which uses geometric mean performance scores to estimate the performance speedup has errors of 8.0 to 56.3 percent on SPEC CPU2006 or SPEC MPI2007, which demonstrates the necessity of using appropriate statistical techniques. This HPT framework has been implemented as an open-source software, and integrated in the PARSEC 3.0 benchmark suite.

无

27浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Performance Portability Across Heterogeneous SoCs Using a Generalized Library-Based Approach

ACM Transactions on Architecture and Code Optimization，2014，11（2）：21

2014年06月01日

摘要

Because of tight power and energy constraints, industry is progressively shifting toward heterogeneous system-on-chip (SoC) architectures composed of a mix of general-purpose cores along with a number of accelerators. However, such SoC architectures can be very challenging to efficiently program for the vast majority of programmers, due to numerous programming approaches and languages. Libraries, on the other hand, provide a simple way to let programmers take advantage of complex architectures, which does not require programmers to acquire new accelerator-specific or domain-specific languages. Increasingly, library-based, also called algorithm-centric, programming approaches propose to generalize the usage of libraries and to compose programs around these libraries, instead of using libraries as mere complements. In this article, we present a software framework for achieving performance portability by leveraging a generalized library-based approach. Inspired by the notion of a component, as employed in software engineering and HW/SW codesign, we advocate nonexpert programmers to write simple wrapper code around existing libraries to provide simple but necessary semantic information to the runtime. To achieve performance portability, the runtime employs machine learning (simulated annealing) to select the most appropriate accelerator and its parameters for a given algorithm. This selection factors in the possibly complex composition of algorithms used in the application, the communication among the various accelerators, and the tradeoff between different objectives (i.e., accuracy, performance, and energy). Using a set of benchmarks run on a real heterogeneous SoC composed of a multicore processor and a GPU, we show that the runtime overhead is fairly small at 5.1% for the GPU and 6.4% for the multi-core. We then apply our accelerator selection approach to a simulated SoC platform containing multiple inexact accelerators. We show that accelerator selection together with hardware parameter tuning achieves an average 46.2% energy reduction and a speedup of 2.1× while meeting the desired application error target.

无

33浏览
0点赞
0收藏
0分享
0下载
0评论
引用