中国科技论文在线

上传时间

2020年11月04日

ACM Computing Surveys，2015，48（2）：17

2015年09月01日

Deterministic replay is a type of emerging technique dedicated to providing deterministic executions of computer programs in the presence of nondeterministic factors. The application scopes of deterministic replay are very broad, making it an important research topic in domains such as computer architecture, operating systems, parallel computing, distributed computing, programming languages, verification, and hardware testing. In this survey, we comprehensively review existing studies on deterministic replay by introducing a taxonomy. Basically, existing deterministic replay schemes can be classified into two categories, single-processor (SP) schemes and multiprocessor (MP) schemes. By reviewing the details of these two categories of schemes respectively, we summarize and compare how existing schemes address technical issues such as log size, record slowdown, replay slowdown, implementation cost, and probe effect, which may shed some light on future studies on deterministic replay.

无

0

36浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】IMR: High-Performance Low-Cost Multi-Ring NoCs

IEEE Transactions on Parallel and Distributed Systems，2015，27（6）： 1700 - 17

2015年08月07日

摘要

A ring topology is a common solution of network-on-chip (NoC) in industry, but is frequently criticized to have poor scalability. In this paper, we present a novel type of multi-ring NoC called isolated multi-ring (IMR), which can even support chip multiprocessors (CMPs) with 1,024 cores. In IMR, any pair of cores are connected via at least one isolated ring, so that each packet can reach the destination without transferring from one ring to another. Therefore, IMR no longer needs expensive routers as mesh, which not only enhances the network performance but also reduces hardware overheads. We utilize simulated evolution to design optimized IMR topologies. We compare these IMR topologies against nine representative NoCs (e.g., traditional mesh, multi mesh, low-cost mesh, Express-virtual-channels mesh (EVC), torus ring, and hierarchical ring). We observe from experiments that IMR significantly outperforms its competitors in both saturation throughput and latency across all scenarios considered. For example, in a 16 × 16 CMP, IMR improves the saturation throughput of a state-of-the-art mesh (EVC) by 265.29 percent on average, and reduces the average packet latency on SPLASH-2 application traces by 71.58 percent, while consuming 5.08 percent less area and 9.76 percent less power. In a 32 × 32 CMP, IMR averagely improves the saturation throughput of EVC by 191.58 percent, and averagely reduces the packet latency on SPLASH-2 application traces by 23.09 percent, while consuming 2.86 percent less area and 10.81 percent less power.

无

0

34浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】Accelerating Architectural Simulation Via Statistical Techniques: A Survey

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems，2015，35（3）：433 - 446

2015年09月24日

摘要

In computer architecture research and development, simulation is a powerful way of acquiring and predicting processor behaviors. While architectural simulation has been extensively utilized for computer performance evaluation, design space exploration, and computer architecture assessment, it still suffers from the high computational costs in practice. Specifically, the total simulation time is determined by the simulator's raw speed and the total number of simulated instructions. The simulator's speed can be improved by enhanced simulation infrastructures (e.g., simulators with high-level abstraction, parallel simulators, and hardware-assisted simulators). Orthogonal to these work, recent studies also managed to significantly reduce the total number of simulated instructions with a slight loss of accuracy. Interestingly, we observe that most of these work are built upon statistical techniques. This survey presents a comprehensive review to such studies and proposes a taxonomy based on the sources of reduction. In addition to identifying the similarities and differences of state-of-the-art approaches, we further discuss insights gained from these studies as well as implications for future research.

无

0

38浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】Leveraging the Error Resilience of Neural Networks for Designing Highly Energy Efficient Accelerators

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems，2015，34（8）：1223 - 123

2015年04月03日

摘要

In recent years, inexact computing has been increasingly regarded as one of the most promising approaches for slashing energy consumption in many applications that can tolerate a certain degree of inaccuracy. Driven by the principle of trading tolerable amounts of application accuracy in return for significant resource savings-the energy consumed, the (critical path) delay, and the (silicon) area-this approach has been limited to application-specified integrated circuits (ASICs) so far. These ASIC realizations have a narrow application scope and are often rigid in their tolerance to inaccuracy, as currently designed; the latter often determining the extent of resource savings we would achieve. In this paper, we propose to improve the application scope, error resilience and the energy savings of inexact computing by combining it with hardware neural networks. These neural networks are fast emerging as popular candidate accelerators for future heterogeneous multicore platforms and have flexible error resilience limits owing to their ability to be trained. Our results in 65-nm technology demonstrate that the proposed inexact neural network accelerator could achieve 1.78-2.67× savings in energy consumption (with corresponding delay and area savings being 1.23 and 1.46×, respectively) when compared to the existing baseline neural network implementation, at the cost of a small accuracy loss (mean squared error increases from 0.14 to 0.20 on average).

无

0

42浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】Iterative optimization for the data center

ACM SIGPLAN Notices，2012，47（4）：

2012年03月01日

摘要

Iterative optimization is a simple but powerful approach that searches for the best possible combination of compiler optimizations for a given workload. However, each program, if not each data set, potentially favors a different combination. As a result, iterative optimization is plagued by several practical issues that prevent it from being widely used in practice: a large number of runs are required for finding the best combination; the process can be data set dependent; and the exploration process incurs significant overhead that needs to be compensated for by performance benefits.Therefore, while iterative optimization has been shown to have significant performance potential, it is seldomly used in production compilers. In this paper, we propose Iterative Optimization for the Data Center (IODC): we show that servers and data centers offer a context in which all of the above hurdles can be overcome. The basic idea is to spawn different combinations across workers and recollect performance statistics at the master, which then evolves to the optimum combination of compiler optimizations. IODC carefully manages costs and benefits, and is transparent to the end user. We evaluate IODC using both MapReduce and throughput compute-intensive server applications. In order to reflect the large number of users interacting with the system, we gather a very large collection of data sets (at least 1000 and up to several million unique data sets per program), for a total storage of 10.7TB, and 568 days of CPU time. We report an average performance improvement of 1.48×, and up to 2.08×, for the MapReduce applications, and 1.14×, and up to 1.39×, for the throughput compute-intensive server applications.

无

0

23浏览
0点赞
0收藏
0分享
0下载
0

引用