中国科技论文在线

上传时间

2020年11月04日

【期刊论文】LDet: Determinizing Asynchronous Transfer for Postsilicon Debugging

IEEE Transactions on Computers，2012，62（9）：1732 - 174

2012年06月05日

To efficiently and effectively debug silicon bugs, a promising solution is to determinize the chip, so that the buggy silicon behaviors can be faithfully reproduced on a RTL simulator. In this paper, we propose a novel scheme, named LDet, to determinize a chip through removing the nondeterminism in transfers crossing different clock domains, even when these clock domains are heterochronous. The key insight of LDet is that we can slightly adjust the frequencies of clocks at runtime so that the actual frequency ratio between two clocks always approaches a rational constant with bounded accumulated error. With the technique called dynamic frequency adjusting, the processing time of each asynchronous transfer can be determinized with deterministic asynchronous fifo (DAF). As a consequence, the behavior of the whole chip is deterministic, thus the chip behavior can be reproduced on the RTL simulator (given the same initial state and input sequence). We implement LDet on the RTL design of a processor chip with many clock domains. Experiments show that on average, LDet only causes about one cycle of additional latency to each asynchronous transfer. As a result, LDet only incurs a negligible performance overhead of about 0.7 percent slowdown. Moreover, LDet only brings less than 0.2 percent additional area to the chip. The low performance and area overheads of LDet well demonstrate its applicability in industry.

无

0

22浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】Iterative optimization for the data center

ACM SIGPLAN Notices，2012，47（4）：

2012年03月01日

摘要

Iterative optimization is a simple but powerful approach that searches for the best possible combination of compiler optimizations for a given workload. However, each program, if not each data set, potentially favors a different combination. As a result, iterative optimization is plagued by several practical issues that prevent it from being widely used in practice: a large number of runs are required for finding the best combination; the process can be data set dependent; and the exploration process incurs significant overhead that needs to be compensated for by performance benefits.Therefore, while iterative optimization has been shown to have significant performance potential, it is seldomly used in production compilers. In this paper, we propose Iterative Optimization for the Data Center (IODC): we show that servers and data centers offer a context in which all of the above hurdles can be overcome. The basic idea is to spawn different combinations across workers and recollect performance statistics at the master, which then evolves to the optimum combination of compiler optimizations. IODC carefully manages costs and benefits, and is transparent to the end user. We evaluate IODC using both MapReduce and throughput compute-intensive server applications. In order to reflect the large number of users interacting with the system, we gather a very large collection of data sets (at least 1000 and up to several million unique data sets per program), for a total storage of 10.7TB, and 568 days of CPU time. We report an average performance improvement of 1.48×, and up to 2.08×, for the MapReduce applications, and 1.14×, and up to 1.39×, for the throughput compute-intensive server applications.

无

0

23浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】Program Regularization in Memory Consistency Verification

IEEE Transactions on Parallel and Distributed Systems，2012，23（11）：2163 - 217

2012年01月31日

摘要

A widely adopted methodology for verifying the memory subsystem of a Chip Multiprocessor (CMP) is to verify executions of parallel test programs on the CMP against the given memory consistency model, which has been long known to be time consuming in both theory and practice. To accelerate memory consistency verification, previous approaches have to bear the cost of availability (e.g., relying on dedicated hardware supports that have not been offered by many commodity CMPs) or completeness (e.g., missing some bugs). In the meantime, the impact of parallel programs on memory consistency verification has more or less been overlooked. One piece of evidence is that few investigations have been dedicated to finding appropriate test programs enabling more efficient verification From a novel perspective of test program, we devise a practical technique called “program regularization,” which can effectively reduce the computation time of memory consistency verification. The key intuition behind program regularization is that any parallel program, if being reformed appropriately, can enable efficient memory consistency verification. More specifically, for an original program, program regularization introduces some auxiliary memory addresses, and periodically inserts load/store operations accessing these addresses to the original program. With the regularized program, memory consistency verification can be accomplished in linear time (with respect to the number of memory operations) when the number of processors is fixed. Experimental results show that program regularization can significantly accelerate memory consistency verification. Last but not least, our technique, which does not rely on concrete verification algorithm or dedicated hardware support, can be smoothly integrated into existing presilicon/postsilicon verification platforms of industrial CMPs to speed up memory consistency verification.

无

0

40浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】Linear Time Memory Consistency Verification

IEEE Transactions on Computers，2011，61（4）：502 - 516

2011年02月10日

摘要

Verifying the execution of a parallel program against a given memory consistency model (memory consistency verification) is a crucial problem in the functional validation of Chip Multiprocessor (CMP). In the absence of additional information, the above problem is known to be NP-hard. By adopting the pending period information, this paper proposes the first linear-time software-based approach to memory consistency verification. Our approach relies on a novel technique called reusable cycle checking, which reuses the previous order information when repeatedly checking cycle at different frontiers. In the context of pending period information, this technique significantly reduces the overall computational costs required by cycle checking, enabling linear-time (in the number of memory operations) memory consistency verification for any given multicore system with a constant number of processors. From a practical perspective, an industrial memory consistency verification tool, named XCHECK, has been developed based on our approach. XCHECK is capable of working with neither test program constraint nor dedicated hardware support in postsilicon verifications of many multiprocessor systems. Experimental results show that XCHECK is 3-10 times faster than a state-of-art software-based approach. XCHECK has been integrated into the verification platforms for an industrial multicore processor Godson-3B, and found several bugs of the design.

无

0

35浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】System Architecture of Godson-3 Multi-Core Processors

Journal of Computer Science and Technology，2010，25（）：181–191

2010年03月16日

摘要

Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem.

无

0

23浏览
0点赞
0收藏
0分享
0下载
0

引用