中国科技论文在线

上传时间

2020年11月04日

【期刊论文】Accelerating Architectural Simulation Via Statistical Techniques: A Survey

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems，2015，35（3）：433 - 446

2015年09月24日

In computer architecture research and development, simulation is a powerful way of acquiring and predicting processor behaviors. While architectural simulation has been extensively utilized for computer performance evaluation, design space exploration, and computer architecture assessment, it still suffers from the high computational costs in practice. Specifically, the total simulation time is determined by the simulator's raw speed and the total number of simulated instructions. The simulator's speed can be improved by enhanced simulation infrastructures (e.g., simulators with high-level abstraction, parallel simulators, and hardware-assisted simulators). Orthogonal to these work, recent studies also managed to significantly reduce the total number of simulated instructions with a slight loss of accuracy. Interestingly, we observe that most of these work are built upon statistical techniques. This survey presents a comprehensive review to such studies and proposes a taxonomy based on the sources of reduction. In addition to identifying the similarities and differences of state-of-the-art approaches, we further discuss insights gained from these studies as well as implications for future research.

无

0

38浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】Deterministic Replay: A Survey

ACM Computing Surveys，2015，48（2）：17

2015年09月01日

摘要

Deterministic replay is a type of emerging technique dedicated to providing deterministic executions of computer programs in the presence of nondeterministic factors. The application scopes of deterministic replay are very broad, making it an important research topic in domains such as computer architecture, operating systems, parallel computing, distributed computing, programming languages, verification, and hardware testing. In this survey, we comprehensively review existing studies on deterministic replay by introducing a taxonomy. Basically, existing deterministic replay schemes can be classified into two categories, single-processor (SP) schemes and multiprocessor (MP) schemes. By reviewing the details of these two categories of schemes respectively, we summarize and compare how existing schemes address technical issues such as log size, record slowdown, replay slowdown, implementation cost, and probe effect, which may shed some light on future studies on deterministic replay.

无

0

37浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】A Small-Footprint Accelerator for Large-Scale Neural Networks

ACM Transactions on Computer Systems，2015，33（2）：6

2015年05月01日

摘要

Machine-learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve toward heterogeneous multicores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have been focusing on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance, and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02mm<sup>2</sup> and 485mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87 × faster, and it can reduce the total energy by 21.08 ×. The accelerator characteristics are obtained after layout at 65nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

无

0

35浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】Linear Time Memory Consistency Verification

IEEE Transactions on Computers，2011，61（4）：502 - 516

2011年02月10日

摘要

Verifying the execution of a parallel program against a given memory consistency model (memory consistency verification) is a crucial problem in the functional validation of Chip Multiprocessor (CMP). In the absence of additional information, the above problem is known to be NP-hard. By adopting the pending period information, this paper proposes the first linear-time software-based approach to memory consistency verification. Our approach relies on a novel technique called reusable cycle checking, which reuses the previous order information when repeatedly checking cycle at different frontiers. In the context of pending period information, this technique significantly reduces the overall computational costs required by cycle checking, enabling linear-time (in the number of memory operations) memory consistency verification for any given multicore system with a constant number of processors. From a practical perspective, an industrial memory consistency verification tool, named XCHECK, has been developed based on our approach. XCHECK is capable of working with neither test program constraint nor dedicated hardware support in postsilicon verifications of many multiprocessor systems. Experimental results show that XCHECK is 3-10 times faster than a state-of-art software-based approach. XCHECK has been integrated into the verification platforms for an industrial multicore processor Godson-3B, and found several bugs of the design.

无

0

35浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】IMR: High-Performance Low-Cost Multi-Ring NoCs

IEEE Transactions on Parallel and Distributed Systems，2015，27（6）： 1700 - 17

2015年08月07日

摘要

A ring topology is a common solution of network-on-chip (NoC) in industry, but is frequently criticized to have poor scalability. In this paper, we present a novel type of multi-ring NoC called isolated multi-ring (IMR), which can even support chip multiprocessors (CMPs) with 1,024 cores. In IMR, any pair of cores are connected via at least one isolated ring, so that each packet can reach the destination without transferring from one ring to another. Therefore, IMR no longer needs expensive routers as mesh, which not only enhances the network performance but also reduces hardware overheads. We utilize simulated evolution to design optimized IMR topologies. We compare these IMR topologies against nine representative NoCs (e.g., traditional mesh, multi mesh, low-cost mesh, Express-virtual-channels mesh (EVC), torus ring, and hierarchical ring). We observe from experiments that IMR significantly outperforms its competitors in both saturation throughput and latency across all scenarios considered. For example, in a 16 × 16 CMP, IMR improves the saturation throughput of a state-of-the-art mesh (EVC) by 265.29 percent on average, and reduces the average packet latency on SPLASH-2 application traces by 71.58 percent, while consuming 5.08 percent less area and 9.76 percent less power. In a 32 × 32 CMP, IMR averagely improves the saturation throughput of EVC by 191.58 percent, and averagely reduces the packet latency on SPLASH-2 application traces by 23.09 percent, while consuming 2.86 percent less area and 10.81 percent less power.

无

0

34浏览
0点赞
0收藏
0分享
0下载
0

引用