中国科技论文在线

上传时间

2020年11月04日

【期刊论文】Architecture Support for Task Out-of-Order Execution in MPSoCs

IEEE Transactions on Computers，2014，64（5）：1296 - 131

2014年04月09日

Multi-processor system on chip (MPSoC) has been widely applied in embedded systems in the past decades. However, it has posed great challenges to efficiently design and implement a rapid prototype for diverse applications due to heterogeneous instruction set architectures (ISA), programming interfaces and software tool chains. In order to solve the problem, this paper proposes a novel high level architecture support for automatic out-of-order (OoO) task execution on FPGA based heterogeneous MPSoCs. The architecture support is composed of a hierarchical middleware with an automatic task level OoO parallel execution engine. Incorporated with a hierarchical OoO layer model, the middleware is able to identify the parallel regions and generate the sources codes automatically. Besides, a runtime middleware Task-Scoreboarding analyzes the inter-task data dependencies and automatically schedules and dispatches the tasks with parameter renaming techniques. The middleware has been verified by the prototype built on FPGA platform. Examples and a JPEG case study demonstrate that our model can largely ease the burden of programmers as well as uncover the task level parallelism.

无

0

27浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】Pre-Silicon Bug Forecast

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems，2014，33（3）： 451 - 463

2014年02月13日

摘要

The ever-intensifying time-to-market pressure imposes great challenges on the pre-silicon design phase of hardware. Before the tape-out, a pre-silicon design has to be thoroughly inspected by time-consuming functional verification and code review to exclude bugs. For functional verification and code review, a critical issue determining their efficiency is the allocation of resources (e.g., computational resources and manpower) to different modules of a design, which is conventionally guided by designers' experiences. Such practices, though simple and straightforward, may take high risks of wasting resources on bug-free modules or missing bugs in buggy modules, and thus could affect the success and timeline of the tape-out. In this paper, we propose a novel framework called pre-silicon bug forecast to predict the bug information of hardware designs. In this framework, bug models are built via machine learning techniques to characterize the relationship between design characteristics and the bug information, which can be leveraged to predict how bugs distribute in different modules of the current design. Such predicted bug information is adequate to regulate the resources among different modules to achieve efficient functional verification and code review. To evaluate the effectiveness of the proposed pre-silicon bug forecast framework, we conducted detailed experiments on several open-source hardware projects. Moreover, we also investigate the impacts of different learning techniques and different sets of characteristic on the performance of bug models. Experimental results show that with appropriate learning techniques and characteristics, about 90% modules could be correctly predicted as buggy or clean and the number of bugs of each module could also be accurately predicted.

无

0

9浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】An 8-Core MIPS-Compatible Processor in 32/28 nm Bulk CMOS

IEEE Journal of Solid-State Circuits，2013，49（1）：41 - 49

2013年10月22日

摘要

This paper is an extension of Hu et al., ISSCC, 2013, and it introduces the 32/28 nm implementations of Godson-3B1500, which are 8-core MIPS-compatible microprocessors with vector extensions. Godson-3B1500 is fabricated in STMicroelectronics 32/28 nm high-κ metal-gate low-power bulk CMOS with 10 metal layers. It contains 1.14 billion transistors and operates at the frequency of 1.0 GHz to 1.5 GHz with the voltage supply ranging from 1.0 V to 1.3 V. Compared to its predecessor (Hu et al., ISSCC, 2011), Godson-3B1500 brings significant power efficiency improvements with enhanced performance (150GFLOPS@1.2 GHz) and reduced power dissipation (<; 40 W), due to not only technology scaling but also a great deal of design efforts.

无

0

25浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】Effective and efficient microprocessor design space exploration using unlabeled design configurations

ACM Transactions on Intelligent Systems and Technology，2014，5（1）：20

2014年01月01日

摘要

Ever-increasing design complexity and advances of technology impose great challenges on the design of modern microprocessors. One such challenge is to determine promising microprocessor configurations to meet specific design constraints, which is called Design Space Exploration (DSE). In the computer architecture community, supervised learning techniques have been applied to DSE to build regression models for predicting the qualities of design configurations. For supervised learning, however, considerable simulation costs are required for attaining the labeled design configurations. Given limited resources, it is difficult to achieve high accuracy. In this article, inspired by recent advances in semisupervised learning and active learning, we propose the COAL approach which can exploit unlabeled design configurations to significantly improve the models. Empirical study demonstrates that COAL significantly outperforms a state-of-the-art DSE technique by reducing mean squared error by 35% to 95%, and thus, promising architectures can be attained more efficiently.

无

0

27浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月04日

【期刊论文】Deterministic Replay Using Global Clock

ACM Transactions on Architecture and Code Optimization，2013，10（1）：1

2013年04月01日

摘要

Debugging parallel programs is a well-known difficult problem. A promising method to facilitate debugging parallel programs is using hardware support to achieve deterministic replay on a Chip Multi-Processor (CMP). As a Design-For-Debug (DFD) feature, a practical hardware-assisted deterministic replay scheme should have low design and verification costs, as well as a small log size. To achieve these goals, we propose a novel and succinct hardware-assisted deterministic replay scheme named LReplay. The key innovation of LReplay is that instead of recording the logical time orders between instructions or instruction blocks as previous investigations, LReplay is built upon recording the pending period information infused by the global clock. By the recorded pending period information, about 99% execution orders are inferrable, implying that LReplay only needs to record directly the residual 1% noninferrable execution orders in production run. The 1% noninferrable orders can be addressed by a simple yet cost-effective direction prediction technique, which further reduces the log size of LReplay. Benefiting from the preceding innovations, the overall log size of LReplay over SPLASH-2 benchmarks is about 0.17B/K-Inst (byte per k-instruction) for the sequential consistency, and 0.57B/K-Inst for the Godson-3 consistency. Such log sizes are smaller in an order of magnitude than previous deterministic replay schemes incurring no performance loss. Furthermore, LReplay only consumes about 0.5% area of the Godson-3 CMP, since it requires only trivial modifications to existing components of Godson-3. The features of LReplay demonstrate the potential of integrating hardware support for deterministic replay into future industrial processors.

无

0

21浏览
0点赞
0收藏
0分享
0下载
0

引用