陈云霁
博士 研究员 博士生导师
中国科学院计算技术研究所
微体系结构;机器学习;并行计算;视频处理
个性化签名
- 姓名:陈云霁
- 目前身份:在职研究人员
- 担任导师情况:博士生导师
- 学位:博士
-
学术头衔:
博士生导师
- 职称:高级-研究员
-
学科领域:
计算机系统结构
- 研究兴趣:微体系结构;机器学习;并行计算;视频处理
陈云霁,男,1983年生,江西南昌人,中国科学院计算技术研究所研究员,博士生导师。同时,他担任了中国科学院脑科学卓越中心特聘研究员,以及中国科学院大学岗位教授。他带领其研究组长期开展深度学习处理器研究,被Science杂志刊文评价为“该方向的先驱”和“处于引领者行列”。他在包括ISCA、HPCA、MICRO、ASPLOS、ICSE、ISSCC、Hot Chips、IJCAI、FPGA、SPAA、IEEE Micro以及8种IEEE/ACM Trans.在内的学术会议及期刊上发表论文60余篇。陈云霁曾获国家杰出青年科学基金、中国青年科技奖、全国创新争先奖、国家自然科学基金委“优秀青年基金”,被评为最美科技工作者,荣获科学探索奖,并被MIT技术评论评为全球35位杰出青年创新者(2015年度)。
教育经历
2002.9-2007.7 硕博连读 中国科学院计算技术研究所
1997.9-2002.7 本科生 中国科学技术大学少年班
1992.7-1997.9 中学生 南昌市第十中学少年班
研究经历
2012.9- 研究员 中国科学院计算技术研究所
2009.9-2012.9 副研究员 中国科学院计算技术研究所
2007.7-2009.9 助理研究员 中国科学院计算技术研究所
研究兴趣
微体系结构;机器学习;并行计算;视频处理
-
主页访问
225
-
关注数
0
-
成果阅读
663
-
成果数
22
【期刊论文】Godson-3: A Scalable Multicore RISC Processor with x86 Emulation
IEEE Micro,2009,29(2):17 - 29
2009年04月07日
The Godson-3 microprocessor aims at high-throughput server applications, high-performance scientific computing, and high-end embedded applications. It offers a scalable network on chip, hardware support for x86 emulation, and a reconfigurable architecture. The four-core Godson-3 chip is fabricated with 65-nm CMOS technology. Eight- and 16-core Godson-3 chips are in development.
无
0
-
39浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
【期刊论文】Global Clock, Physical Time Order and Pending Period Analysis in Multiprocessor Systems
arXiv,2009,():
2009年07月12日
In multiprocessor systems, various problems are treated with Lamport's logical clock and the resultant logical time orders between operations. However, one often needs to face the high complexities caused by the lack of logical time order information in practice. In this paper, we utilize the \emph{global clock} to infuse the so-called \emph{pending period} to each operation in a multiprocessor system, where the pending period is a time interval that contains the performed time of the operation. Further, we define the \emph{physical time order} for any two operations with disjoint pending periods. The physical time order is obeyed by any real execution in multiprocessor systems due to that it is part of the truly happened operation orders restricted by global clock, and it is then proven to be independent and consistent with traditional logical time orders. The above novel yet fundamental concepts enables new effective approaches for analyzing multiprocessor systems, which are named \emph{pending period analysis} as a whole. As a consequence of pending period analysis, many important problems of multiprocessor systems can be tackled effectively. As a significant application example, complete memory consistency verification, which was known as an NP-hard problem, can be solved with the complexity of O(n2) (where n is the number of operations). Moreover, the two event ordering problems, which were proven to be Co-NP-Hard and NP-hard respectively, can both be solved with the time complexity of O(n) if restricted by pending period information.
无
0
-
21浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
【期刊论文】System Architecture of Godson-3 Multi-Core Processors
Journal of Computer Science and Technology,2010,25():181–191
2010年03月16日
Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem.
无
0
-
23浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
【期刊论文】Linear Time Memory Consistency Verification
IEEE Transactions on Computers,2011,61(4):502 - 516
2011年02月10日
Verifying the execution of a parallel program against a given memory consistency model (memory consistency verification) is a crucial problem in the functional validation of Chip Multiprocessor (CMP). In the absence of additional information, the above problem is known to be NP-hard. By adopting the pending period information, this paper proposes the first linear-time software-based approach to memory consistency verification. Our approach relies on a novel technique called reusable cycle checking, which reuses the previous order information when repeatedly checking cycle at different frontiers. In the context of pending period information, this technique significantly reduces the overall computational costs required by cycle checking, enabling linear-time (in the number of memory operations) memory consistency verification for any given multicore system with a constant number of processors. From a practical perspective, an industrial memory consistency verification tool, named XCHECK, has been developed based on our approach. XCHECK is capable of working with neither test program constraint nor dedicated hardware support in postsilicon verifications of many multiprocessor systems. Experimental results show that XCHECK is 3-10 times faster than a state-of-art software-based approach. XCHECK has been integrated into the verification platforms for an industrial multicore processor Godson-3B, and found several bugs of the design.
无
0
-
35浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
【期刊论文】Program Regularization in Memory Consistency Verification
IEEE Transactions on Parallel and Distributed Systems,2012,23(11):2163 - 217
2012年01月31日
A widely adopted methodology for verifying the memory subsystem of a Chip Multiprocessor (CMP) is to verify executions of parallel test programs on the CMP against the given memory consistency model, which has been long known to be time consuming in both theory and practice. To accelerate memory consistency verification, previous approaches have to bear the cost of availability (e.g., relying on dedicated hardware supports that have not been offered by many commodity CMPs) or completeness (e.g., missing some bugs). In the meantime, the impact of parallel programs on memory consistency verification has more or less been overlooked. One piece of evidence is that few investigations have been dedicated to finding appropriate test programs enabling more efficient verification From a novel perspective of test program, we devise a practical technique called “program regularization,” which can effectively reduce the computation time of memory consistency verification. The key intuition behind program regularization is that any parallel program, if being reformed appropriately, can enable efficient memory consistency verification. More specifically, for an original program, program regularization introduces some auxiliary memory addresses, and periodically inserts load/store operations accessing these addresses to the original program. With the regularized program, memory consistency verification can be accomplished in linear time (with respect to the number of memory operations) when the number of processors is fixed. Experimental results show that program regularization can significantly accelerate memory consistency verification. Last but not least, our technique, which does not rely on concrete verification algorithm or dedicated hardware support, can be smoothly integrated into existing presilicon/postsilicon verification platforms of industrial CMPs to speed up memory consistency verification.
无
0
-
40浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
【期刊论文】Motion Estimation Without Integer-Pel Search
IEEE Transactions on Image Processing,2012,22(4):1340 - 135
2012年11月20日
The typical motion estimation (ME) consists of three main steps, including spatial-temporal prediction, integer-pel search, and fractional-pel search. The integer-pel search, which seeks the best matched integer-pel position within a search window, is considered to be crucial for video encoding. It occupies over 50% of the overall encoding time (when adopting the full search scheme) for software encoders, and introduces remarkable area cost, memory traffic, and power consumption to hardware encoders. In this paper, we find that video sequences (especially high-resolution videos) can often be encoded effectively and efficiently even without integer-pel search. Such counter-intuitive phenomenon is not only because that spatial-temporal prediction and fractional-pel search are accurate enough for the ME of many blocks. In fact, we observe that when the predicted motion vector is biased from the optimal motion vector (mainly for boundary blocks of irregularly moving objects), it is also hard for integer-pel search to reduce the final rate-distortion cost: the deviation of reference position could be alleviated with the fractional-pel interpolation and rate-distortion optimization techniques (e.g., adaptive macroblock mode). Considering the decreasing proportion of boundary blocks caused by the increasing resolution of videos, integer-pel search may be rather cost-ineffective in the era of high-resolution. Experimental results on 36 typical sequences of different resolutions encoded with x264, which is a widely-used video encoder, comply with our analysis well. For 1080p sequences, removing the integer-pel search saves 57.9% of the overall H.264 encoding time on average (compared to the original x264 with full integer-pel search using default parameters), while the resultant performance loss is negligible: the bit-rate is increased by only 0.18%, while the peak signal-to-noise ratio is decreased by only 0.01 dB per frame averagely.
无
0
-
38浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
【期刊论文】LDet: Determinizing Asynchronous Transfer for Postsilicon Debugging
IEEE Transactions on Computers,2012,62(9):1732 - 174
2012年06月05日
To efficiently and effectively debug silicon bugs, a promising solution is to determinize the chip, so that the buggy silicon behaviors can be faithfully reproduced on a RTL simulator. In this paper, we propose a novel scheme, named LDet, to determinize a chip through removing the nondeterminism in transfers crossing different clock domains, even when these clock domains are heterochronous. The key insight of LDet is that we can slightly adjust the frequencies of clocks at runtime so that the actual frequency ratio between two clocks always approaches a rational constant with bounded accumulated error. With the technique called dynamic frequency adjusting, the processing time of each asynchronous transfer can be determinized with deterministic asynchronous fifo (DAF). As a consequence, the behavior of the whole chip is deterministic, thus the chip behavior can be reproduced on the RTL simulator (given the same initial state and input sequence). We implement LDet on the RTL design of a processor chip with many clock domains. Experiments show that on average, LDet only causes about one cycle of additional latency to each asynchronous transfer. As a result, LDet only incurs a negligible performance overhead of about 0.7 percent slowdown. Moreover, LDet only brings less than 0.2 percent additional area to the chip. The low performance and area overheads of LDet well demonstrate its applicability in industry.
无
0
-
22浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
【期刊论文】Deterministic Replay Using Global Clock
ACM Transactions on Architecture and Code Optimization,2013,10(1):1
2013年04月01日
Debugging parallel programs is a well-known difficult problem. A promising method to facilitate debugging parallel programs is using hardware support to achieve deterministic replay on a Chip Multi-Processor (CMP). As a Design-For-Debug (DFD) feature, a practical hardware-assisted deterministic replay scheme should have low design and verification costs, as well as a small log size. To achieve these goals, we propose a novel and succinct hardware-assisted deterministic replay scheme named LReplay. The key innovation of LReplay is that instead of recording the logical time orders between instructions or instruction blocks as previous investigations, LReplay is built upon recording the pending period information infused by the global clock. By the recorded pending period information, about 99% execution orders are inferrable, implying that LReplay only needs to record directly the residual 1% noninferrable execution orders in production run. The 1% noninferrable orders can be addressed by a simple yet cost-effective direction prediction technique, which further reduces the log size of LReplay. Benefiting from the preceding innovations, the overall log size of LReplay over SPLASH-2 benchmarks is about 0.17B/K-Inst (byte per k-instruction) for the sequential consistency, and 0.57B/K-Inst for the Godson-3 consistency. Such log sizes are smaller in an order of magnitude than previous deterministic replay schemes incurring no performance loss. Furthermore, LReplay only consumes about 0.5% area of the Godson-3 CMP, since it requires only trivial modifications to existing components of Godson-3. The features of LReplay demonstrate the potential of integrating hardware support for deterministic replay into future industrial processors.
无
0
-
21浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
ACM Transactions on Intelligent Systems and Technology,2014,5(1):20
2014年01月01日
Ever-increasing design complexity and advances of technology impose great challenges on the design of modern microprocessors. One such challenge is to determine promising microprocessor configurations to meet specific design constraints, which is called Design Space Exploration (DSE). In the computer architecture community, supervised learning techniques have been applied to DSE to build regression models for predicting the qualities of design configurations. For supervised learning, however, considerable simulation costs are required for attaining the labeled design configurations. Given limited resources, it is difficult to achieve high accuracy. In this article, inspired by recent advances in semisupervised learning and active learning, we propose the COAL approach which can exploit unlabeled design configurations to significantly improve the models. Empirical study demonstrates that COAL significantly outperforms a state-of-the-art DSE technique by reducing mean squared error by 35% to 95%, and thus, promising architectures can be attained more efficiently.
无
0
-
27浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
【期刊论文】An 8-Core MIPS-Compatible Processor in 32/28 nm Bulk CMOS
IEEE Journal of Solid-State Circuits,2013,49(1):41 - 49
2013年10月22日
This paper is an extension of Hu et al., ISSCC, 2013, and it introduces the 32/28 nm implementations of Godson-3B1500, which are 8-core MIPS-compatible microprocessors with vector extensions. Godson-3B1500 is fabricated in STMicroelectronics 32/28 nm high-κ metal-gate low-power bulk CMOS with 10 metal layers. It contains 1.14 billion transistors and operates at the frequency of 1.0 GHz to 1.5 GHz with the voltage supply ranging from 1.0 V to 1.3 V. Compared to its predecessor (Hu et al., ISSCC, 2011), Godson-3B1500 brings significant power efficiency improvements with enhanced performance (150GFLOPS@1.2 GHz) and reduced power dissipation (<; 40 W), due to not only technology scaling but also a great deal of design efforts.
无
0
-
25浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用