中国科技论文在线

上传时间

2020年11月04日

【期刊论文】Motion Estimation Without Integer-Pel Search

IEEE Transactions on Image Processing，2012，22（4）：1340 - 135

2012年11月20日

The typical motion estimation (ME) consists of three main steps, including spatial-temporal prediction, integer-pel search, and fractional-pel search. The integer-pel search, which seeks the best matched integer-pel position within a search window, is considered to be crucial for video encoding. It occupies over 50% of the overall encoding time (when adopting the full search scheme) for software encoders, and introduces remarkable area cost, memory traffic, and power consumption to hardware encoders. In this paper, we find that video sequences (especially high-resolution videos) can often be encoded effectively and efficiently even without integer-pel search. Such counter-intuitive phenomenon is not only because that spatial-temporal prediction and fractional-pel search are accurate enough for the ME of many blocks. In fact, we observe that when the predicted motion vector is biased from the optimal motion vector (mainly for boundary blocks of irregularly moving objects), it is also hard for integer-pel search to reduce the final rate-distortion cost: the deviation of reference position could be alleviated with the fractional-pel interpolation and rate-distortion optimization techniques (e.g., adaptive macroblock mode). Considering the decreasing proportion of boundary blocks caused by the increasing resolution of videos, integer-pel search may be rather cost-ineffective in the era of high-resolution. Experimental results on 36 typical sequences of different resolutions encoded with x264, which is a widely-used video encoder, comply with our analysis well. For 1080p sequences, removing the integer-pel search saves 57.9% of the overall H.264 encoding time on average (compared to the original x264 with full integer-pel search using default parameters), while the resultant performance loss is negligible: the bit-rate is increased by only 0.18%, while the peak signal-to-noise ratio is decreased by only 0.01 dB per frame averagely.

无

0

39浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】LDet: Determinizing Asynchronous Transfer for Postsilicon Debugging

IEEE Transactions on Computers，2012，62（9）：1732 - 174

2012年06月05日

摘要

To efficiently and effectively debug silicon bugs, a promising solution is to determinize the chip, so that the buggy silicon behaviors can be faithfully reproduced on a RTL simulator. In this paper, we propose a novel scheme, named LDet, to determinize a chip through removing the nondeterminism in transfers crossing different clock domains, even when these clock domains are heterochronous. The key insight of LDet is that we can slightly adjust the frequencies of clocks at runtime so that the actual frequency ratio between two clocks always approaches a rational constant with bounded accumulated error. With the technique called dynamic frequency adjusting, the processing time of each asynchronous transfer can be determinized with deterministic asynchronous fifo (DAF). As a consequence, the behavior of the whole chip is deterministic, thus the chip behavior can be reproduced on the RTL simulator (given the same initial state and input sequence). We implement LDet on the RTL design of a processor chip with many clock domains. Experiments show that on average, LDet only causes about one cycle of additional latency to each asynchronous transfer. As a result, LDet only incurs a negligible performance overhead of about 0.7 percent slowdown. Moreover, LDet only brings less than 0.2 percent additional area to the chip. The low performance and area overheads of LDet well demonstrate its applicability in industry.

无

0

22浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】Deterministic Replay Using Global Clock

ACM Transactions on Architecture and Code Optimization，2013，10（1）：1

2013年04月01日

摘要

Debugging parallel programs is a well-known difficult problem. A promising method to facilitate debugging parallel programs is using hardware support to achieve deterministic replay on a Chip Multi-Processor (CMP). As a Design-For-Debug (DFD) feature, a practical hardware-assisted deterministic replay scheme should have low design and verification costs, as well as a small log size. To achieve these goals, we propose a novel and succinct hardware-assisted deterministic replay scheme named LReplay. The key innovation of LReplay is that instead of recording the logical time orders between instructions or instruction blocks as previous investigations, LReplay is built upon recording the pending period information infused by the global clock. By the recorded pending period information, about 99% execution orders are inferrable, implying that LReplay only needs to record directly the residual 1% noninferrable execution orders in production run. The 1% noninferrable orders can be addressed by a simple yet cost-effective direction prediction technique, which further reduces the log size of LReplay. Benefiting from the preceding innovations, the overall log size of LReplay over SPLASH-2 benchmarks is about 0.17B/K-Inst (byte per k-instruction) for the sequential consistency, and 0.57B/K-Inst for the Godson-3 consistency. Such log sizes are smaller in an order of magnitude than previous deterministic replay schemes incurring no performance loss. Furthermore, LReplay only consumes about 0.5% area of the Godson-3 CMP, since it requires only trivial modifications to existing components of Godson-3. The features of LReplay demonstrate the potential of integrating hardware support for deterministic replay into future industrial processors.

无

0

21浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】Effective and efficient microprocessor design space exploration using unlabeled design configurations

ACM Transactions on Intelligent Systems and Technology，2014，5（1）：20

2014年01月01日

摘要

Ever-increasing design complexity and advances of technology impose great challenges on the design of modern microprocessors. One such challenge is to determine promising microprocessor configurations to meet specific design constraints, which is called Design Space Exploration (DSE). In the computer architecture community, supervised learning techniques have been applied to DSE to build regression models for predicting the qualities of design configurations. For supervised learning, however, considerable simulation costs are required for attaining the labeled design configurations. Given limited resources, it is difficult to achieve high accuracy. In this article, inspired by recent advances in semisupervised learning and active learning, we propose the COAL approach which can exploit unlabeled design configurations to significantly improve the models. Empirical study demonstrates that COAL significantly outperforms a state-of-the-art DSE technique by reducing mean squared error by 35% to 95%, and thus, promising architectures can be attained more efficiently.

无

0

27浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】An 8-Core MIPS-Compatible Processor in 32/28 nm Bulk CMOS

IEEE Journal of Solid-State Circuits，2013，49（1）：41 - 49

2013年10月22日

摘要

This paper is an extension of Hu et al., ISSCC, 2013, and it introduces the 32/28 nm implementations of Godson-3B1500, which are 8-core MIPS-compatible microprocessors with vector extensions. Godson-3B1500 is fabricated in STMicroelectronics 32/28 nm high-κ metal-gate low-power bulk CMOS with 10 metal layers. It contains 1.14 billion transistors and operates at the frequency of 1.0 GHz to 1.5 GHz with the voltage supply ranging from 1.0 V to 1.3 V. Compared to its predecessor (Hu et al., ISSCC, 2011), Godson-3B1500 brings significant power efficiency improvements with enhanced performance (150GFLOPS@1.2 GHz) and reduced power dissipation (<; 40 W), due to not only technology scaling but also a great deal of design efforts.

无

0

25浏览
0点赞
0收藏
0分享
0下载
0

引用