Hoseung Kim (Integrated Ph.D Student)
Repository Commit History
IntroductionFull Bio SketchMr. Kim is currently doing integrated Ph.D degree in Electronics Engineering at Kyungpook National University, Daegu, republic of Korea. His research interests design low-power and high-speed micro-processor architecture, such as accelerator and powerful multi-core processor, for improvements performance of computational devices in various usage conditions based on RISC-V. And, He is interested in designing parallel processing data structure and reordering memory structure based on various computer design architecture for efficient (low power and high performance) system operation. He is pursuing his research to apply specially purposed embedded system on chip devices from customized small-scale application (IOT) to powerful large-scale application (large data processing computation devices). Furthermore, He is seeking flexible Hardware-software friendly system structure to design powerfully optimized system on chip. For doing this, he is currently studying on the various computer architecture (RISC, CISC), data flow level ? RTL chip design and synthesis (FPGA), low level software algorithm structure (Kernel programming, OS), and bottom-up full stack Soc design (System build). Finally, His research objective is to integrate the low-performance edge devices and high-performance data server cloud for connecting the AI system in characterized embedded device environments. And, He is aimed at the professional of architecture and System designer. Research Topiclow power high speed CNN accelerator with matrix reordering techniques for small footprint memory access
For efficient matrix/tensor data set processing, used parallelism and independence matrix arithmetic property, matrix operation can be conducted in parallel processing way. By implementing processing unit and memory structure conducted this method, compared to sequential processing way, the number of iterative memory access sequence and ALU operations of same data elements can be drastically reduced. These characteristics naturally can decrease the power consumption of system and increase the operation performance used this hardware module. Furthermore, embedded system software of this hardware module and developed new compile method, such as pruning, key data extraction, and matrix data compression conceptions, it is possible to minimize the size of data to be processed in advance. So, this algorithm can accelerate and lessen burden hardware processing information data volumes in compile, preprocessing, and operation step. This can eventually enable the minimized hardware module implementation and improve the hardware operation performance. As a result, High speed and low memory access footprint but low power consumption, namely, highly efficient system module chip for matrix operation can be implemented. This chip is based on RISC-V, this can easily fit in existing hardware board. So, widely used in various situation and hardware device, where huge volumes of matrix/tensor data process needed. Low power CNN Accelerator Memory Interface with Small Footprint Memory Access
To address this, an application-specific CNN accelerator memory interface is required. The proposed Parallel Memory Bank Layer (PMBL) reshapes data layout to reduce the memory-access footprint, reconstructing compressed CNN matrix data into a linear access order. Operation parameters?such as kernel size, input dimensions, and stride?determine the reordering sequence. Leveraging a multi-bank memory structure, PMBL can supply multiple operands simultaneously. Consequently, the accelerator’s memory interface significantly reduces the memory-access footprint and alleviates bottlenecks at the accelerator?memory boundary, leading to lower power consumption and improved CNN performance. Data Allocation Rearrangement on CNN Accelerator based on Reshaping Systolic Tile Array using Planarized Matrix Reordering Techniques.
To improve efficiency, the proposed accelerator employs a systolic tile array of multiply-accumulate (MAC) units. The array includes two specialized MAC tile types: performance tiles, which support dual-thread or high-resolution modes and are optimized for throughput, and efficiency tiles, which are optimized for low power. The inherently parallel dataflow of this tile array accelerates matrix multiplication. Even if the computation runs quickly, it’s meaningless if the data aren’t ready just as fast. To balance compute and I/O in accelerator and memory interface, the accelerator integrates a high-throughput memory interface. A planarized matrix-reordering stage reshapes tensors and redistributes data across banks, converting compressed CNN matrices into contiguous, linear access streams. Operation parameters (e.g., kernel size, input dimensions, stride) determine the reordering schedule. This reduces the memory-access footprint, improves burst efficiency, and supplies multiple operands per cycle. In conclusion, we aim to reformulate CNN operations as lower-dimensional matrix multiplications to increase processing performance and reduce energy consumption. With this accelerator architecture, our goal is to enhance the efficiency of AI workloads so they run seamlessly even on mobile devices with limited computational resources. PublicationsConference Publications (SCI 1, KCI 1)
Conference Publications (Intl. 4)
Participation in International Conference
Last Updated, 2025.09.09 |