avatar

Yi-Ping You

Associate Professor
CS @ NYCU

   +886-3-5712121 ext. 56688
   ypyou (at) cs.nycu.edu.tw

Publications

Journal Papers Conference Papers Patents Others


 
 

Journal Papers

  1. Cyber-Physical Systems (CPS) are increasingly used in many complex applications, such as autonomous delivery drones, the automotive CPS design, power grid control systems, and medical robotics. However, existing programming languages lack certain design patterns for CPS designs, including temporal semantics and concurrency models. Future research directions may involve programming language extensions to support CPS designs. However, JSF++, MISRA, and MISRA C++ are providing specifications intended to increase the reliability of safety-critical systems. This article also describes the development of rule checkers based on the MISRA C++ specification using the Clang open-source tool, which allows for the annotation of code and the easy extension of the MISRA C++ specification to other programming languages and systems. This is potentially useful for future CPS language research extensions to work with reliability software specifications using the Clang tool. Experiments were performed using key C++ benchmarks to validate our method in comparison with the well-known Coverity commercial tool. We illustrate key rules related to class, inheritance, template, overloading, and exception handling. Open-source benchmarks that violate the rules detected by our checkers are also illustrated. A random graph generator is further used to generate diamond case with multiple inheritance test data for our software validations. The experimental results demonstrate that our method can provide information that is more detailed than that obtained using Coverity for nine open-source C++ benchmarks. Since the Clang tool is widely used, it will further allow developers to annotate their own extensions.

    Link
  2. With the rapid growth of deep learning models and deep learning-based applications, how to accelerate the inference of deep neural networks, especially neural network operators, has become an increasingly important research area. As a bridge between a front-end deep learning framework and a back-end hardware platform, deep learning compilers aim to optimize various deep learning models for a range of hardware platforms with model- and hardware-specific optimizations. Apache TVM (or TVM for short), a well-known open-source deep learning compiler, uses a customized domain-specific language, called Tensor Expression Language, to define hardware-specific optimizations for neural network operators. TVM also allows users to write tensor expressions to design customized optimizations for specific operators. However, TVM does not assist users with supporting information, such as what computations are performed within an operator, and tools for optimizing the operators in a deep learning model. In addition, tensor expressions have an entirely different syntax from imperative languages and are not easy to get started with. Furthermore, although TVM comes with an auto-tuning module, called AutoTVM, which facilitates the tuning of optimization configurations (e.g., tiling size and loop order), AutoTVM takes quite a long time to search the optimum configurations for a set of optimizations. In this paper, we present DLOOPT, an optimization assistant that assists optimization developers in designing effective optimizations for neural network operators and/or obtaining optimum optimization configurations in a timely manner. DLOOPT specifically addresses three key aspects: (1) developers can focus only on designing optimizations by using DLOOPT, which offers sufficient information about the operators of a given model and provides an easier way to write optimizations, (2) the number of optimizations that developers need to design can be minimized by using DLOOPT, which allows optimizations to be reused, and (3) the tuning process can be greatly simplified by using DLOOPT, which implements a set of tuning strategies in AutoTVM. The evaluation results showed that DLOOPT reduced more than 99% of time in terms of developing adequate optimizations for operators in a model. We believe that DLOOPT is friendly to optimization developers and allows them to quickly develop effective optimizations for neural network operators.

    Link
  3. Finding a good compiler autotuning methodology, particularly for selecting the right set of optimisations and finding the best ordering of these optimisations for a given code fragment has been a long-standing problem. As the rapid development of machine learning techniques, tackling the problem of compiler autotuning using machine learning or deep learning has become increasingly common in recent years. There have been many deep learning models proposed to solve problems, such as predicting the optimal heterogeneous mapping or thread-coarsening factor; however, very few have revisited the problem of optimisation phase tuning. In this paper, we intend to revisit and tackle the problem using deep learning techniques. Unfortunately, the problem is too complex to be addressed in its full scope. We present a new problem, called reduced O3 subsequence labelling – a reduced O3 subsequence is defined as a subsequence of O3 optimisation passes that contains no useless passes, which is a simplified problem and expected to be a stepping stone towards solving the optimisation phase tuning problem. We formulated the problem and attempted to solve the problem by using genetic algorithms. We believe that with mature deep learning techniques, a machine learning model that predicts the reduced O3 subsequences or even the best O3 subsequence for a given code fragment could be developed and used to prune the search space of the problem of the optimisation phase tuning, thereby shortening the tuning process and also providing more effective tuning results.

    Link
  4. Binary translators, which translate the binary executables from one instruction set to another, are useful tools. Indirect branches are one of the key factors that affect the efficiency of binary translators. In the previous research, our lab developed an LLVM-based binary translation framework, called Rabbit. Rabbit introduces novel optimisations: platform-dependent hyperchaining and platform-independent hyperchaining for improving the emulation of the indirect branch instructions. Indirect branch instructions may have several destinations, and these destinations are not known until runtime. Both platform-independent and platform-dependent hyperchaining establish a search table for each indirect branch instruction to record the visited branch destinations at runtime. In this work, we focus on the translation from AArch64 binary to RISC-V binary and further develop the profile-guided optimisation for indirect branch, which collects runtime information, including branch destinations and execution frequency of each destination for each indirect branch instruction, and then use the information to improve hyperchaining (i.e. accelerate the process of finding the branch destination). The profile-guided optimisation can be divided to profile-guided platform-independent hyperchaining and profile-guided platform-dependent hyperchaining. We finally use SPEC CPU 2006 CINT benchmark to evaluate the optimisations. The experiment results indicate that compared with (1) no chaining, (2) platform-independent hyperchaining and (3) platform-dependent hyperchaining, profile-guided platform-independent hyperchaining provides 1.123×, 1.066× and 1.098× speedup respectively. Similarly, profile-guided platform-dependent hyperchaining achieves 1.106×, 1.047× and 1.083× speedup with respect to the above three configurations, respectively.

    Link
  5. Heterogeneous computing has become popular in the past decade. Many frameworks have been proposed to provide a uniform way to program for accelerators, such as GPUs, DSPs, and FPGAs. Among them, an open and royalty-free standard, OpenCL, is widely adopted by the industry. However, many OpenCL-enabled accelerators and the standard itself do not support preemptive multitasking. To the best of our knowledge, previously proposed techniques are not portable or cannot handle ill-designed kernels (the codes that are executed on the accelerators), which will never ever finish. This paper presents a framework (called CLPKM) that provides an abstraction layer between OpenCL applications and the underlying OpenCL runtime to enable preemption of a kernel execution instance based on a software checkpointing mechanism. CLPKM includes (1) an OpenCL runtime library that intercepts OpenCL API calls, (2) a source-to-source compiler that performs the preemption-enabling transformation, and (3) a daemon that schedules OpenCL tasks using priority-based preemptive scheduling techniques. Experiments demonstrated that CLPKM reduced the slowdown of high-priority processes from 4.66x to 1.52–2.23x under up to 16 low-priority, heavy-workload processes running in the background and caused an average of 3.02–6.08x slowdown for low-priority processes.

    Link
  6. Graphics processing units (GPUs) are now widely used in embedded systems for manipulating computer graphics and even for general-purpose computation. However, many embedded systems have to manage highly restricted hardware resources in order to achieve high performance or energy efficiency. The number of registers is one of the common limiting factors in an embedded GPU design. Programs that run with a low number of registers may suffer from high register pressure if register allocation is not properly designed, especially on a GPU in which a register is divided into four elements and each element can be accessed separately, because allocating a register for a vector-type variable that does not contain values in all elements wastes register spaces. In this article, we present a vector-aware register allocation framework to improve register utilization on shader architectures. The framework involves two major components: (1) element-based register allocation that allocates registers based on the element requirement of variables and (2) register packing that rearranges elements of registers in order to increase the number of contiguous free elements, thereby keeping more live variables in registers. Experimental results on a cycle-approximate simulator showed that the proposed framework decreased 92% of register spills in total and made 91.7% of 14 common shader programs spill free. These results indicate an opportunity for energy management of the space that is used for storing spilled variables, with the framework improving the performance by a geometric mean of 8.3%, 16.3%, and 29.2% for general shader processors in which variables are spilled to memory with 5-, 10-, and 20-cycle access latencies, respectively. Furthermore, the reduction in the register requirement of programs enabled another 11 programs with high register pressure to be runnable on a lightweight GPU.

    Link
  7. The Dalvik virtual machine (VM) is an integral component used to execute applications in Android, which is one of the leading operating systems for mobile devices. The Dalvik VM is an interpreter and is equipped with a trace-based just-in-time compiler for enhancing the execution performance of frequently executed paths, or traces. However, traces generated by the Dalvik VM can be stopped in a conditional branch or a method call/return, which means that these traces usually have a short lifetime, decreasing the effectiveness of the compiler optimizations applied to them. Furthermore, the just-in-time compiler applies only a few simple optimizations because of performance considerations. In this article we present a traces-to-region (T2R) framework that extends traces to regions and statically compiles these regions into native binaries so as to improve the execution of Android applications. The T2R framework involves three main stages: (i) the profiling stage, in which the run-time trace information of an application is extracted; (ii) the compilation stage, in which regions are constructed from the extracted traces and are statically compiled into a native binary; and (iii) the execution stage, in which the compiled binary is loaded into the code cache when the application starts to execute. Experiments performed on an Android tablet demonstrated that the T2R framework was effective in improving the execution performance of applications by 10.5–16.2% and decreasing the size of the code cache by 4.6–28.5%. Copyright © 2015 John Wiley & Sons, Ltd.

    Link
  8. The interest in using multiple graphics processing units (GPUs) to accelerate applications has increased in recent years. However, the existing heterogeneous programming models (e.g., OpenCL) abstract details of GPU devices at the per-device level and require programmers to explicitly schedule their kernel tasks on a system equipped with multiple GPU devices. Unfortunately, multiple applications running on a multi-GPU system may compete for some of the GPU devices while leaving other GPU devices unused. Moreover, the distributed memory model defined in OpenCL, where each device has its own memory space, increases the complexity of managing the memory among multiple GPU devices. In this article we propose a framework (called VirtCL) that reduces the programming burden by acting as a layer between the programmer and the native OpenCL run-time system for abstracting multiple devices into a single virtual device and for scheduling computations and communications among the multiple devices. VirtCL comprises two main components: (1) a front-end library, which exposes primary OpenCL APIs and the virtual device, and (2) a back-end run-time system (called CLDaemon) for scheduling and dispatching kernel tasks based on a history-based scheduler. The front-end library forwards computation requests to the back-end CLDaemon, which then schedules and dispatches the requests. We also propose a history-based scheduler that is able to schedule kernel tasks in a contention- and communication-aware manner. Experiments demonstrated that the VirtCL framework introduced a small overhead (mean of 6%) but outperformed the native OpenCL run-time system for most benchmarks in the Rodinia benchmark suite, which was due to the abstraction layer eliminating the time-consuming initialization of OpenCL contexts. We also evaluated different scheduling policies in VirtCL with a real-world application (clsurf) and various synthetic workload traces. The results indicated that the VirtCL framework provides scalability for multiple kernel tasks running on multi-GPU systems.

    Link
  9. Multithread programming is widely adopted in novel embedded system applications due to its high performance and flexibility. This article addresses compiler optimization for reducing the power consumption of multithread programs. A traditional compiler employs energy management techniques that analyze component usage in control-flow graphs with a focus on single-thread programs. In this environment the leakage power can be controlled by inserting on and off instructions based on component usage information generated by flow equations. However, these methods cannot be directly extended to a multithread environment due to concurrent execution issues.This article presents a multithread power-gating framework composed of multithread power-gating analysis (MTPGA) and predicated power-gating (PPG) energy management mechanisms for reducing the leakage power when executing multithread programs on simultaneous multithreading (SMT) machines. Our multithread programming model is based on hierarchical bulk-synchronous parallel (BSP) models. Based on a multithread component analysis with dataflow equations, our MTPGA framework estimates the energy usage of multithread programs and inserts PPG operations as power controls for energy management. We performed experiments by incorporating our power optimization framework into SUIF compiler tools and by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. The experimental results show that the total energy consumption of a system with PPG support and our power optimization method is reduced by an average of 10.09% for BSP programs relative to a system without a power-gating mechanism on leakage contribution set to 30%; and the total energy consumption is reduced by an average of 4.27% on leakage contribution set to 10%. The results demonstrate our mechanisms are effective in reducing the leakage energy of BSP multithread programs.

    Link
  10. SUMMARYThe importance of heterogeneous multicore programming is increasing, and Open Computing Language (OpenCL) is an open industrial standard for parallel programming that provides a uniform programming model for programmers to write efficient, portable code for heterogeneous computing devices. However, OpenCL is not supported in the system virtualization environments that are often used to improve resource utilization. In this paper, we propose an OpenCL virtualization framework based on Kernel-based Virtual Machine with API remoting to enable multiplexing of multiple guest virtual machines (guest VMs) over the underlying OpenCL resources. The framework comprises three major components: (i) an OpenCL library implementation in guest VMs for packing/unpacking OpenCL requests/responses; (ii) a virtual device, called virtio-CL, that is responsible for the communication between guest VMs and the hypervisor (also called the VM monitor); and (iii) a thread, called CL thread, that is used for the OpenCL API invocation. Although the overhead of the proposed virtualization framework is directly affected by the amount of data to be transferred between the OpenCL host and devices because of the primitive nature of API remoting, experiments demonstrated that our virtualization framework has a small virtualization overhead (mean of 6.8%) for six common device-intensive OpenCL programs and performs well when the number of guest VMs involved in the system increases. These results indirectly infer that the framework allows for effective resource utilization of OpenCL devices.Copyright © 2012 John Wiley & Sons, Ltd.

    Link
  11. Graphics processing units (GPUs) are now being widely adopted in system-on-a-chip designs, and they are often used in embedded systems for manipulating computer graphics or even for general-purpose computation. Energy management is of concern to both hardware and software designers. In this article, we present an energy-aware code-motion framework for a compiler to generate concentrated accesses to input and output (I/O) buffers inside a GPU. Our solution attempts to gather the I/O buffer accesses into clusters, thereby extending the time period during which the I/O buffers are clock or power gated. We performed experiments in which the energy consumption was simulated by incorporating our compiler-analysis and code-motion framework into an in-house compiler tool. The experimental results demonstrated that our mechanisms were effective in reducing the energy consumption of the shader processor by an average of 13.1% and decreasing the energy-delay product by 2.2%.

    Link
  12. Embedded processors developed within the past few years have employed novel hardware designs to reduce the ever-growing complexity, power dissipation, and die area. Although using a distributed register file architecture is considered to have less read/write ports than using traditional unified register file structures, it presents challenges in compilation techniques to generate efficient codes for such architectures. This paper presents a novel scheme for register allocation that includes global and local components on a VLIW DSP processor with distributed register files whose port access is highly restricted. In the scheme, an optimization phase performed prior to conventional global/local register allocation, named global/local register file assignment (RFA), is used to minimize various register file communication costs. A heuristic algorithm is proposed for global RFA to make suitable decisions based on local RFA. Experiments were performed by incorporating our schemes on a novel VLIW DSP processor with non-uniform register files. The results indicate that the compilation based on our proposed approach delivers significant performance improvements, compared with the solution without using our proposed global register allocation scheme. Copyright © 2008 John Wiley & Sons, Ltd.

    Link
  13. The compiler is generally regarded as the most important software component that supports a processor design to achieve success. This paper describes our application of the open research compiler infrastructure to a novel VLIW DSP (known as the PAC DSP core) and the specific design of code generation for its register file architecture. The PAC DSP utilizes port-restricted, distributed, and partitioned register file structures in addition to a heterogeneous clustered data-path architecture to attain low power consumption and a smaller die. As part of an effort to overcome the new challenges of code generation for the PAC DSP, we have developed a new register allocation scheme and other retargeting optimization phases that allow the effective generation of high quality code. Our preliminary experimental results indicate that our developed compiler can efficiently utilize the features of the specific register file architectures in the PAC DSP. Our experiences in designing compiler support for the PAC VLIW DSP with irregular resource constraints may also be of interest to those involved in developing compilers for similar architectures.

    Link
  14. A wide variety of register file architectures—developed for embedded processors—have recently been used with the aim of reducing power dissipation and die size, in contrast with the traditional unified register file structures. This article presents a novel register allocation scheme for a clustered VLIW DSP, which is designed with distinctively banked register files in which port access is highly restricted. Whilst the organization of the register files is designed to decrease power consumption by using fewer port connections, the cluster-based design makes register access across clusters an additional issue, and the switched-access nature of the register file demands further investigation into the use of optimizing register assignment as a means of increasing instruction-level parallelism. We propose a heuristic algorithm, named ping-pong aware local favorable (PALF) register allocation, to obtain a register allocation that is expected to better utilize irregular register file architectures. The results of experiments performed using a compiler based on the Open Research Compiler (ORC) showed significant performance improvement over the original ORC's approach, which is considered to be an optimized approach for common register file architectures. Copyright © 2007 John Wiley & Sons, Ltd.

    Link
  15. Dynamic voltage scaling (DVS) and power gating (PG) have become mainstream technologies for low-power optimization in recent years. One issue that remains to be solved is integrating these techniques in correlated domains operating with multiple voltages. This article addresses the problem of power-aware task scheduling on a scalable cryptographic processor that is designed as a heterogeneous and distributed system-on-a-chip, with the aim of effectively integrating DVS, PG, and the scheduling of resources in multiple voltage domains (MVD) to achieve low energy consumption. Our approach uses an analytic model as the basis for estimating the performance and energy requirements between different domains and addressing the scheduling issues for correlated resources in systems. We also present the results of performance and energy simulations from transaction-level models of our security processors in a variety of system configurations. The prototype experiments show that our proposed methods yield significant energy reductions. The proposed techniques will be useful for implementing DVS and PG in domains with multiple correlated resources.

    Link
  16. Power leakage constitutes an increasing fraction of the total power consumption in modern semiconductor technologies due to the continuing size reductions and increasing speeds of transistors. Recent studies have attempted to reduce leakage power using integrated architecture and compiler power-gating mechanisms. This approach involves compilers inserting instructions into programs to shut down and wake up components, as appropriate. While early studies showed this approach to be effective, there are concerns about the large amount of power-control instructions being added to programs due to the increasing amount of components equipped with power-gating controls in SoC design platforms. In this article we present a sink-n-hoist framework for a compiler to generate balanced scheduling of power-gating instructions. Our solution attempts to merge several power-gating instructions into a single compound instruction, thereby reducing the amount of power-gating instructions issued. We performed experiments by incorporating our compiler analysis and scheduling policies into SUIF compiler tools and by simulating the energy consumption using Wattch toolkits. The experimental results demonstrate that our mechanisms are effective in reducing the amount of power-gating instructions while further reducing leakage power compared to previous methods.

    Link
  17. Power leakage constitutes an increasing fraction of the total power consumption in modern semiconductor technologies. Recent research efforts indicate that architectures, compilers, and software can be optimized so as to reduce the switching power (also known as dynamic power) in microprocessors. This has lead to interest in using architecture and compiler optimization to reduce leakage power (also known as static power) in microprocessors. In this article, we investigate compiler-analysis techniques that are related to reducing leakage power. The architecture model in our design is a system with an instruction set to support the control of power gating at the component level. Our compiler provides an analysis framework for utilizing instructions to reduce the leakage power. We present a framework for analyzing data flow for estimating the component activities at fixed points of programs whilst considering pipeline architectures. We also provide equations that can be used by the compiler to determine whether employing power-gating instructions in given program blocks will reduce the total energy requirements. As the duration of power gating on components when executing given program routines is related to the number and complexity of program branches, we propose a set of scheduling policies and evaluate their effectiveness. We performed experiments by incorporating our compiler analysis and scheduling policies into SUIF compiler tools and by simulating the energy consumptions on Wattch toolkits. The experimental results demonstrate that our mechanisms are effective in reducing leakage power in microprocessors.

    Link


 
 

Conference Papers

  1. Reducing the size of code is a significant concern in modern industrial settings. This has led to the exploration of various strategies, including the use of function call inlining via compiler optimizations. However, modern compilers like GCC and LLVM often rely on heuristics, which occasionally yield suboptimal outcomes. As a response to this challenge, autotuning mechanisms have been introduced, one of which is the local inlining autotuner that has received attention in previous research. This autotuner has been found to reduce code size by 4.9% compared to LLVM’s -Oz optimization level on SPEC2017 by fine-tuning function inlining decisions. However, the local inlining autotuner has limitations since it refines each function inlining decision individually before combining them, which can lead to complications arising from potential interference between function calls, increasing tuning durations and resulting in larger code sizes. Empirical investigations have revealed that in most cases, the impact of inlining a function call affects nearby function calls, which are referred to as “neighbors.” From this observation, we can substantially reduce the recompilation overheads entailed by the autotuner. To tackle the interference problem and expedite the tuning process, we propose an enhanced autotuner for function inlining, called the interference-aware inlining autotuner. This autotuner considers the repercussions of inlining a function call when formulating subsequent decisions and exploits the neighbor relationships between function calls to augment tuning efficiency. Experimental evaluations have validated the effectiveness of the interference-aware inlining autotuner, delivering an average code size reduction of 0.4% (up to 1.5%) across the SPEC2017 benchmark suite compared to the local inlining autotuner. Furthermore, the interference-aware autotuner achieved an average code size reduction of 5.3% compared to LLVM’s -Oz optimization level. In terms of tuning time, the serial interference-aware inlining autotuner exhibited a 2.9x acceleration (3.5x for resource-intensive tasks) compared to the parallel local inlining autotuner.

    Link
  2. With the increasing demand for heterogeneous computing, OpenMP has introduced an offloading feature that allows programmers to offload a task to a device (e.g., a GPU or an FPGA) by adding appropriate directives to the task since version 4.0. Compared to other low-level programming models, such as CUDA and OpenCL, OpenMP significantly reduces the burden on programmers to ensure that tasks are performed correctly on the device. However, OpenMP still has a data-mapping problem, which arises from the separate memory spaces between the host and the device. It is still necessary for programmers to specify data-mapping directives to indicate how data are transferred between the host and the device. When using complex data structures such as linked lists and graphs, it becomes more difficult to compose reliable and efficient data-mapping directives. Moreover, the OpenMP runtime library may incur substantial overhead due to data-mapping management. In this paper, we propose a compiler and runtime collaborative framework, called OpenMP-UM, to address the data-mapping problem. Using the CUDA unified memory mechanism, OpenMP-UM eliminates the need for data-mapping directives and reduces the overhead associated with data-mapping management. The key concept behind OpenMP-UM is to use unified memory as the default memory storage for all host data, including automatic, static, and dynamic data. Experiments have demonstrated that OpenMP-UM not only removed programmers’ burden in writing data-mapping to offload in OpenMP applications but also achieved an average of 7.3x speedup for applications that involve deep copies and an average of 1.02x speedup for regular applications.

    Link
  3. More and more applications are shifting from traditional desktop applications to web applications due to the prevalence of mobile devices and recent advances in wireless communication technologies. The Web Workers API has been proposed to allow for offloading computation-intensive tasks from applications’ main browser thread, which is responsible for managing user interfaces and interacting with users, to other worker threads (or web workers) and thereby improving user experience. Prior studies have further offloaded computation-intensive tasks to remote servers by dispatching web workers to the servers and demonstrated their effectiveness in improving the performance of web applications. However, the approaches proposed by these prior studies expose potential vulnerabilities of servers due to their design and implementation and do not consider multiple web workers executing in a concurrent or parallel manner. In this paper, we propose an offloading framework (called Offworker) that transparently enables concurrent web workers to be offloaded to edge or cloud servers and provides a more secure execution environment for web workers. We also design a benchmark suite (called Rodinia-JS), which is a JavaScript version of the Rodinia parallel benchmark suite, to evaluate the proposed framework. Experiments demonstrated that Offworker effectively improved the performance of parallel applications (with up to 4.8x of speedup) when web workers were offloaded from a mobile device to a server. Offworker introduced only a geometric mean overhead of 12.1% against the native execution for computation-intensive applications. We believe Offworker offers a promising and secure solution for computation offloading of parallel web applications.

    Link
  4. Today, mobile applications use thousands of concurrent tasks to process multiple sensor inputs to ensure a better user experience. With this demand, the ability to manage these concurrent tasks efficiently and easily is becoming a new challenge, especially in their lifetimes. Structured concurrency is a technique that reduces the complexity of managing a large number of concurrent tasks. There have been several languages or libraries (e.g., Kotlin, Swift, and Trio) that support such a paradigm for better concurrency management. It is worth noting that structured concurrency has been consistently implemented on top of coroutines across all these languages and libraries. However, there are no documents or studies in the literature that indicate why and how coroutines are relevant to structured concurrency. In contrast, the mainstream community views structured concurrency as a successor to structured programming; that is, the concept of “structure” extends from ordinary programming to concurrent programming. Nevertheless, such a viewpoint does not explain, as the concept of structured concurrency came out more than 40 years later after structured programming was introduced in the early 1970s, whereas concurrent programming started in the 1960s. In this paper, we introduce a new theory to complement the origin of structured concurrency from historical and technical perspectives—it is the foundation established by coroutines that gives birth to structured concurrency.

    Link
  5. Binary translation translates binary programs from one instruction set to another. It is widely used in virtual machines and emulators. We extend mc2llvm, which is an LLVM-based retargetable 32-bit binary translator developed in our lab in the past several years, to support 64-bit ARM instruction set. In this paper, we report the translation of AArch64 floating-point instructions in our mc2llvm. For floating-point instructions, due to the lack of floating-point support in LLVM [13, 14], we add support for the flush-to-zero mode, not-a-number processing, floating-point exceptions, and various rounding modes. On average, mc2llvm-translated binary can achieve 47% and 24.5% of the performance of natively compiled x86-64 binary on statically translated EEMBC benchmark and dynamically translated SPEC CINT2006 benchmarks, respectively. Compared to QEMU-translated binary, mc2llvm-translated binary runs 2.92x, 1.21x and 1.41x faster on statically translated EEMBC benchmark, dynamically translated SPEC CINT2006, and CFP2006 benchmarks, respectively. (Note that the benchmarks contain both floating-point instructions and other instructions, such as load and store instructions.)

    Link
  6. Heterogeneous computing has become popular in the past decade. Many frameworks have been proposed to provide a uniform way to program for accelerators, such as GPUs, DSPs, and FPGAs. Among them, an open and royalty-free standard, OpenCL, is widely adopted by the industry. However, many OpenCL-enabled accelerators and the standard itself do not support preemptive multitasking. To the best of our knowledge, previously proposed techniques are not portable or cannot handle ill-designed kernels (the codes that are executed on the accelerators), which will never ever finish. This paper presents a framework (called CLPKM) that provides an abstraction layer between OpenCL applications and the underlying OpenCL runtime to enable preemption of a kernel execution instance based on a software checkpointing mechanism. CLPKM includes (1) an OpenCL runtime library that intercepts OpenCL API calls, (2) a source-to-source compiler that performs the preemption-enabling transformation, and (3) a daemon that schedules OpenCL tasks using priority-based preemptive scheduling techniques. Experiments demonstrated that CLPKM reduced the slowdown of high-priority processes from 4.66x to 1.52--2.23x under up to 16 low-priority, heavy-workload processes running in the background and caused an average of 3.02--6.08x slowdown for low-priority processes.

    Link
  7. The interest in using multiple graphics processing units (GPUs) to accelerate applications has increased in recent years. However, the existing heterogeneous programming models (e.g., OpenCL) abstract details of GPU devices at the per-device level and require programmers to explicitly schedule their kernel tasks on a system equipped with multiple GPU devices. Unfortunately, multiple applications running on a multi-GPU system may compete for some of the GPU devices while leaving other GPU devices unused. Moreover, the distributed memory model defined in OpenCL, where each device has its own memory space, increases the complexity of managing the memory among multiple GPU devices. In this article we propose a framework (called VirtCL) that reduces the programming burden by acting as a layer between the programmer and the native OpenCL run-time system for abstracting multiple devices into a single virtual device and for scheduling computations and communications among the multiple devices. VirtCL comprises two main components: (1) a front-end library, which exposes primary OpenCL APIs and the virtual device, and (2) a back-end run-time system (called CLDaemon) for scheduling and dispatching kernel tasks based on a history-based scheduler. The front-end library forwards computation requests to the back-end CLDaemon, which then schedules and dispatches the requests. We also propose a history-based scheduler that is able to schedule kernel tasks in a contention- and communication-aware manner. Experiments demonstrated that the VirtCL framework introduced a small overhead (mean of 6%) but outperformed the native OpenCL run-time system for most benchmarks in the Rodinia benchmark suite, which was due to the abstraction layer eliminating the time-consuming initialization of OpenCL contexts. We also evaluated different scheduling policies in VirtCL with a real-world application (clsurf) and various synthetic workload traces. The results indicated that the VirtCL framework provides scalability for multiple kernel tasks running on multi-GPU systems.

    Link
  8. Graphics processing units (GPUs) are now widely used in embedded systems for manipulating computer graphics and even for general-purpose computation. However, many embedded systems have to manage highly restricted hardware resources in order to achieve high performance or energy efficiency. The number of registers is one of the common limiting factors in an embedded GPU design. Programs that run with a low number of registers may suffer from high register pressure if register allocation is not properly designed, especially on a GPU in which a register is divided into four elements and each element can be accessed separately, because allocating a register for a vector-type variable that does not contain values in all elements wastes register spaces. In this paper we present a vector-aware register allocation framework to improve register utilization on shader architectures. The framework involves two major components: (1) element-based register allocation that allocates registers based on the element requirement of variables and (2) register packing that rearranges elements of registers in order to increase the number of contiguous free elements, thereby keeping more live variables in registers. Experimental results on a cycle-approximate simulator showed that the proposed framework decreased 92% of register spills in total and made 91.7% of 14 common shader programs spill-free. These results indicate an opportunity for energy management of the space that is used for storing spilled variables, with the framework improving the performance by a geometric mean of 8.3%, 16.3%, and 29.2% for general shader processors in which variables are spilled to memory with 5-, 10-, and 20-cycle access latencies, respectively. Furthermore, the reduction in the register requirement of programs enabled another 11 programs with high register pressure to be runnable on a lightweight GPU.

    Link
  9. CUDA is a C-extended programming model that allows programmers to write code for both central processing units and graphics processing units (GPUs). In general, GPUs require high thread-level parallelism (TLP) to reach their maximal performance, but the TLP of a CUDA program is deeply affected by the resource allocation of GPUs, including allocation of shared memory and registers since these allocation results directly determine the number of active threads on GPUs. There were some research work focusing on the management of memory allocation for performance enhancement, but none proposed an effective approach to speed up programs in which TLP is limited by insufficient registers. In this paper, we propose a TLP-aware register-pressure reduction framework to reduce the register requirement of a CUDA kernel to a desired degree so as to allow more threads active and thereby to hide the long-latency global memory accesses among these threads. The framework includes two schemes: register rematerialization and register spilling to shared memory. The experimental results demonstrate that the framework is effective in performance improvement of CUDA kernels with a geometric average of 14.8%, while the geometric average performance improvement for CUDA programs is 5.5%.

    Link
  10. Embedded processors developed in recent years have attempted to employ novel hardware design to reduce ever-growing complexity, power dissipation, and die area. While using a distributed register file architecture with irregular accessing constraints is considered to be an effective approach rather than traditional unified register file structures, conventional compilation techniques are not adequate to utilize such new register file organizations for optimal performance. This paper presents a novel scheme for register allocation which composes of global and local register allocation, on a VLIW DSP processor with distributed register files whose port access is highly restricted. In the scheme, a sub-phase prior to original global/local register allocation, named global/local RFA (register file assignment), is introduced to minimize various register file communication costs. For featured register file structure where each cluster contains heterogeneous register files, conventional register allocation scheme with cluster assignment only have to be enhanced to cope both inter-cluster and intra-cluster communications. Due to potential but heavy influences of global RFA on local RFA, a heuristic algorithm is proposed where global RFA manages to make suitable decisions on communication for local RFA. Experiments were done with a developing compiler based on the Open Research Compiler (ORC), and the results indicate that the compilation with the proposed approach delivers significant performance improvement, comparable to the solution using only the PALF scheme developed in our previous work.

    Link
  11. High-performance and low-power VLIW DSP processors are increasingly deployed on embedded devices to process video and multimedia applications. For reducing power and cost in designs of VLIW DSP processors, distributed register files and multi-bank register architectures are being adopted to eliminate the amount of read/write ports in register files. This presents new challenges for devising compiler optimization schemes for such architectures. In this paper, we address the compiler optimization issues for PAC architecture, which is a 5-way issue DSP processor with distributed register files. We present an integrated flow to address several phases of compiler optimizations in interacting with distributed register files and multi-bank register files in the layer of instruction scheduling, software pipelining, and data flow optimizations. Our experiments on a novel 32-bit embedded VLIW DSP (known as the PAC DSP core) exhibit the state of the art performance for embedded VLIW DSP processors with distributed register files by incorporating our proposed schemes in compilers.

    Link
  12. To support high-performance and low-power for multimedia applications and for hand-held devices, embedded VLIW DSP processors are of research focus. With the tight resource constraints, distributed register files, variable-length encodings for instructions, and special data paths are frequently adopted. This creates challenges to deploy software toolkits for new embedded DSP processors. This article presents our methods and experiences to develop software and toolkit flows for PAC (parallel architecture core) VLIW DSP processors. Our toolkits include compilers, assemblers, debugger and DSP micro-kernels. We first retarget open research compiler (ORC) and toolkit chains for PAC VLIW DSP processor and address the issues to support distributed register files and ping-pong data paths for embedded VLIW DSP processors. Second, the linker and assembler are able to support variable length encoding schemes for DSP instructions. In addition, the debugger and DSP micro-kernel were designed to handle dual-core environments. The footprint of micro-kernel is also around 10K to address the code-size issues for embedded devices. We also present the experimental result in the compiler framework by incorporating software pipeline (SWP) policies for distributed register files in PAC architecture. Results indicated that our compiler framework gains performance improvement around 2.5 times against the code generated without our proposed optimizations

    Link
  13. A variety of new register file architectures have been developed for embedded processors in recent years, promoting hardware design to achieve low-power dissipation and reduced die size over traditional unified register file structures. This paper presents a novel register allocation scheme for a clustered VLIW DSP processor which is designed with distinctively banked register files in which port access is highly restricted. With the specific register file organizations considered to decrease the power consumption because of fewer port connections, not only does the clustered design make register access across clusters an additional issue, but the switched access nature of the register file demands further investigations into optimizing register assignment for increasing instruction level parallelism. We propose a heuristic algorithm to obtain preferable register allocation that is expected to well utilize the irregular register file architectures. Experiments were done with a developing compiler based on the Open Research Compiler (ORC), and the results showed that the compilation with the proposed approach delivering significant performance improvement, comparable to a simulated annealing approach which is considered not as a near-optimal but an exhaustive solution.

    Link
  14. PAC DSP is a novel VLIW DSP processor exceedingly utilized with port-restricted, distinct partitioned register file structures in addition to the heterogeneous clustered datapath architecture to attain low power consumption and reduced die size; however, these architectural features lend new challenges to the compiler construction. This paper describes our employment of the Open Research Compiler (ORC) infrastructure on PAC DSP architectures and the specific compilation design. Preliminary results indicated that our compiler development for PAC DSP is effective for the architecture and the evaluation is useful for the refinement of the architecture. Our experiences in designing the compiler support for heterogeneous VLIW DSP processors with irregular resource constraints may benefit the similar architectures.

    Link
  15. Power leakage constitutes an increasing fraction of the total power consumption in modern semiconductor technologies. Recent research efforts have tried to integrate architecture and compiler solutions to employ power-gating mechanisms to reduce leakage power. This approach is to have compilers perform data-flow analysis and insert instructions at programs to shut down and wake up components whenever appropriate for power reductions. While this approach has been shown to be effective in early studies, there are concerns for the amount of power-control instructions being added to programs with the increasing amount of components equipped with power-gating control in a SoC design platform. In this paper, we present a Sink-N-Hoist framework in the compiler solution to generate balanced scheduling of power-gating instructions. Our solution will attempt to merge power-gating instructions as one compound instruction. Therefore, it will reduce the amount of power-gating instructions issued.We perform experiments by incorporating our compiler analysis and scheduling policies into SUIF compiler tools and by simulating the energy consumptions on Wattch toolkits. The experimental results demonstrate that our mechanisms are effective in reducing the amount of power-gating instructions while further in reducing leakage power compared to previous methods.

    Link
  16. Power leakage constitutes an increasing fraction of the total power consumption in modern semiconductor technologies. Recent research efforts have tried to integrate architecture and compiler solutions to employ power-gating mechanisms to reduce leakage power. This approach is to have compilers perform data-flow analysis and insert instructions at programs to shut down and wake up components whenever appropriate for power reductions. While this approach has been shown to be effective in our early studies [1, 2], there are concerns for the amount of power-control instructions being added to programs with the increasing amount of components equipped with power-gating control in a SoC design platform. In this poster, we have a quick review on our previous work and present a Sink-N-Hoist framework in the compiler solution to generate balanced scheduling of power-gating instructions. The main idea of the Sink-N-Hoist framework is to abate the problem of too many instructions being added with code motion techniques. The approach attempts to merge several power-gating instructions into one compound instruction by ‘sinking’ power-off instructions and ‘hoisting’ power-on instructions, i.e., postponing the issue of power-off instructions late and advancing the issue of power-on instructions early. This will result in profits mainly for code size, but also in performance and energy via grouping effects. For instance, a power-off instruction can be postponed for some cycles to be merged with other adjacent power-off instructions. Nevertheless, there should be a limitation on the number of cycles to be sank or hoisted since sinking or hoisting a power-gating instruction will cause more leakage dissipation. A cost model combining with architecture and power information is proposed to determine the feasibility. The proposed Sink-N-Hoist Analysis mainly consists of three phases: 1) the Sinkable Analysis and Hoistable Analysis, which compute the information of possible positions for each power-gating instruction, 2) the Grouping-Off Analysis and Grouping-On Analysis, which partition the turn-off and turn-on instructions into groups, and we can then use this information to group them by selecting emitting instructions, and 3) the power-gating instruction placement, which use the outcome of the former two analyses to determine how to place power-gating instructions. We evaluate the Sink-N-Hoist framework by incorporating our compiler analysis and scheduling policies into SUIF compiler tools and by simulating the energy consumptions on Wattch toolkits. The experimental results done with DSPstone benchmarks demonstrate that our mechanisms are effective in reducing the amount of power-gating instructions as well as producing power reductions over previous methods. It results in average 31.2% of reduction in the amount of power-gating instructions over the scheme without incorporating our Sink-N-Hoist framework for merging power-gating instructions. In fact, we further reduce the energy consumption in our framework. This is due to that the effect of a block version of power-gating instructions gives better power and performance effects than the pointwise version of power-gating instructions.

    Link
  17. The power dissipation is the concern for SoC designs and embedded systems to extend battery life. Techniques like dynamic voltage scaling (DVS), power gating (PG), and multiple domain partitioning help provide mechanisms to reduce dynamic and static powers. Based on those techniques, the control system can be system software or hardware monitors. In ether cases, energy estimations in the architecture level are needed to facilitate the design space explorations of architecture and software scheduling designs for energy reductions. In this paper, we propose an architecture-level simulation environment for security processors with power estimation. The simulation environment includes a transaction-level modeling (TLM) simulator implemented in SystemC with multiple power domains, an analytical model, a workload generator, power parameter banks, versatile outputs, and succinct GUIs. Architecture developers can use it to evaluate different architecture configurations and retrieve performance results and power estimations. System software developers can use these tools for experiments in devising power-aware scheduling methods on security processors for power dissipation reduction.

    Link
  18. In this paper, we present several techniques for low-power design, including a descriptor-based low-power scheduling algorithm, design of dynamic voltage generator, and dual threshold voltage assignments, for network security processors. The experiments show that the proposed methods and designs provide the opportunity for network security processors to achieve the goals of both high performance and low power.

    Link
  19. Techniques to reduce power dissipation for embedded systems have recently come into sharp focus in the technology development. Among these techniques, dynamic voltage scaling (DVS), power gating (PG), and multiple-domain partitioning are regarded as effective schemes to reduce dynamic and static power. In this paper, we investigate the problem of power-aware scheduling tasks running on a scalable encryption processor, which is equipped with heterogeneous distributed SOC designs and needs the effective integration of the elements of DVS, PG, and the scheduling for correlations of multiple domain resources. We propose a novel heuristic that integrates the utilization of DVS and PG and increases the total energy-saving. Furthermore, we propose an analytic model approach to make an estimate about its performance and energy requirements between different components in systems. These proposed techniques are essential and needed to perform DVS and PG on multiple domain resources that are of correlations. Experiments are done in the prototypical environments for our security processors and the results show that significant energy reductions can be achieved by our algorithms.

    Link
  20. Power leakage constitutes an increasing fraction of the total power consumption in modern semiconductor technologies. Recent research efforts also indicate architecture, compiler, and software participations can help reduce the switching activities (also known as dynamic power) on microprocessors. This raises interests on the issues to employ architecture and compiler efforts to reduce leakage power (also known as static power) on microprocessors. In this paper, we investigate the compiler analysis techniques related to reducing leakage power. The architecture model in our design is a system with an instruction set to support the control of power gating in the component levels. Our compiler gives an analysis framework to utilize the instruction to reduce the leakage power. We present a data flow analysis framework to estimate the component activities at fixed points of programs with the consideration of pipelines of architectures. We also give the equation for the compiler to decide if the employment of the power gating instructions on given program blocks will benefit the total energy reductions. As the duration of power gating on components on given program routines is related to program branches, we propose a set of scheduling policy include Basic_Blk_Sched, MIN_Path_Sched, and AVG_Path_Sched mechanisms and evaluate the effectiveness of those schemes. Our experiment is done by incorporating our compiler analysis and scheduling policy into SUIF compiler tools [32] and by simulating the energy consumptions on Wattch toolkits [6]. Experimental results show our mechanisms are effective in reducing leakage powers on microprocessors.

    Link
  21. With the demands of power-constrained mobile ap-plications and devices, it becomes a crucial and chal-lenging issue to reduce the power consumptions of em-bedded systems. In this paper, we focus on the issue of scheduling problems with variable voltage processor core to optimize power consumption in real-time sys-tems. We model the problem with real-time and on-line problems, and our solution is to incorporate the reservation list scheme for variable voltage schedulings. Our decision algorithm consists of a variety of selection criteria including the best effort, average computation time, average power consumption, average energy con-sumption, pre-defined threshold value, and weighted hy-brid schemes for scheduling task. We think our scheme gives a comprehensive study for the problem of schedul-ing real-time tasks to reduce energy consumptions.

    Link
  22. Tough Huffman codes [2,3,4,5,9] have shown their power in data compression, there are still some issues that are not noticed. In the present paper, we address the issue on the random property of compressed via Huffman coding. Randomized computation is the only known method for many notoriously difficult #P-complete problems such as permanent, and some network reliability problems, etc [1,7,8,10]. According to Kolmogorov complexity [6,10], a truly random binary string is very difficult to compress and for any fixed length there exist incompressible strings. In other words, incompressible strings tend to carry higher degree of randomness. We study this phenomenon via Huffman coding. We take compressed data as random source that provides coin flips for randomized algorithms. A simple randomized algorithm is proposed to calculate the value of π with the compressed data as random number. Experimental results show that compressed data via Huffman coding does provide a better approximation for calculating π, especially in the first few round of the compressions. We try several different types of files and obtain similar results that compressed data is random for the testing example.

    Link


 
 

Patents

  1. Jia-Rung CHANG, Yi-Chiao SU, Tien-Yuan Hsieh, and Yi-Ping You, "Optimization method, optimization system for computer programming code and electronic device using the same", United States Patent, No. US20220129254A1, issued on 2022/04/28. Link
  2. 張家榮, 蘇意喬, 謝天元, and 游逸平, "電腦程式碼之優化方法、優化系統及應用其之電子裝置", Taiwan Patent, No. TW202217552A, issued on 2022/05/01. Link
  3. 游逸平, 吴晞浩, 郑育镕, and 陈静芳, "针对软件程序的变量推论系统及方法", China Patent, No. CN105700932B, issued on 2019/02/05. Link
  4. Yi-Ping You, Si-Hao WU, Yu-Jung Cheng, and Jing-Fung CHEN, "Variable inference system and method for software program", United States Patent, No. US9747087B2, issued on 2017/08/29. Link
  5. 游逸平, 吳晞浩, 鄭育鎔, and 陳靜芳, "針對軟體程式之變數推論系統及方法", Taiwan Patent, No. TWI571802B, issued on 2017/02/21. Link
  6. 游逸平, 陈柏裕, and 陈静芳, "Hybrid dynamic code compiling device, method and service system thereof", China Patent, No. CN104657189B, issued on 2017/12/22. Link
  7. 游逸平, 陳柏裕, and 陳靜芳, "Hybrid dynamic code compiling device, method, and service system thereof", Taiwan Patent, No. TWI525543B, issued on 2016/03/11. Link
  8. Yi-Ping You, Po-Yu Chen, and Jing-Fung CHEN, "Hybrid dynamic code compiling device, method, and service system thereof", United States Patent, No. US9182953B2, issued on 2015/11/10. Link
  9. Szu-Chieh Chen, Yi-Ping You, and Ming-Yung KO, "System and method for configuring graphics register data and recording medium", United States Patent, No. US8988444B2, issued on 2015/03/24. Link
  10. Shenhung Wang, Yiping You, Yiting Lin, Mingyung Ko, Chiaming Chang, and Yujung Cheng, "低功率程式編譯方法、裝置以及儲存其之電腦可讀取紀錄媒體", Taiwan Patent, No. TWI425419B, issued on 2014/02/01. Link
  11. Shen-Hung WANG, Yi-Ping You, Yi-Ting Lin, Ming-Yung KO, Chia-Ming Chang, and Yu-Jung Cheng, "Low power program compiling device, method and computer readable storage medium for storing thereof", United States Patent, No. US20120151456A1, issued on 2012/06/14. Link
  12. Yi-Ping You, Jeng Kuen Lee, Kuo Yu Chuang, and Chung Hsien Wu, "Multi-thread power-gating design", Taiwan Patent, No. TWI361345B, issued on 2012/04/01. Link
  13. Yi-Ping You, Jeng Kuen Lee, Kuo Yu Chuang, and Chung-Hsien Wu, "Multi-thread power-gating control design", United States Patent, No. US7904736B2, issued on 2011/03/08. Link
  14. Yung-Chia Lin, Yi-Ping You, Chung-Wen Huang, and Jenq-Kuen Lee, "Task scheduling method for low power dissipation in a system chip", United States Patent, No. US7779412B2, issued on 2010/08/17. Link
  15. Jenq Kuen Lee, Yung Chia Lin, and Yi Ping Yu, "Method for allocating registers for a processor", United States Patent, No. US7650598B2, issued on 2010/01/19. Link
  16. Jenq Kuen Lee, Yung Chia Lin, and Yi Ping Yu, "Method for allocating registers for a processor", Taiwan Patent, No. TWI320906B, issued on 2010/02/21. Link
  17. Yi-Ping You, Chung Wen Huang, Jeng Kuen Lee, Chi-Lung Wang, and Kuo Yu Chuang, "Power-gating instruction scheduling for power leakage reduction", United States Patent, No. US7539884B2, issued on 2009/05/26. Link
  18. Yi Ping Yu, Chung Wen Huang, Jeng Kuen Lee, Chi Lung Wang, and Kuo Yu Chuang, "Power-gating control placement for leakage power reduction", Taiwan Patent, No. TWI304547B, issued on 2008/12/21. Link
  19. Jenq-Kuen Lee, Yung-Chia Lin, Yi-Ping Yu, and Chung-Wen Huang, "Processor employing a power managing mechanism and method of saving power for the same", United States Patent, No. US7398410B2, issued on 2008/07/08. Link
  20. Yi Ping Yu, Chung Wen Huang, Jeng Kuen Lee, Chi Lung Wang, and Kuo Yu Chuang, "Power-gating control placement for leakage power reduction", Taiwan Patent, No. TWI304547B, issued on 2008/12/21. Link
  21. Yi-Ping You, Chung Wen Huang, Jeng Kuen Lee, Chi-Lung Wang, and Kuo Yu Chuang, "Power-gating instruction scheduling for power leakage reduction", United States Patent, No. US20070157044A1, issued on 2007/07/05. Link
  22. Jenq-Kuen Lee, Yung-Chia Lin, Yi-Ping Yu, and Chung-Wen Huang, "Processor employing a power managing mechanism and method of saving power for the same", Taiwan Patent, No. TWI291650B, issued on 2007/12/21. Link
  23. Jenq-Kuen Lee, Yung-Chia Lin, Yi-Ping You, and Chung-Wen Huang, "Aufgabenplanungsverfahren für geringe Leistungsableitung in einem Systemchip", Germany Patent, No. DE102005044533A1, issued on 2006/03/30. Link
  24. Yung-Chia Lin, Yi-Ping Yu, Chung-Wen Huang, and Jenq-Kuen Lee, "Task scheduling method with low power consumption and a SOC using the method", Taiwan Patent, No. TWI251171B, issued on 2006/03/11. Link
  25. Jenq-Kuen Lee, Yi-Ping You, and Chi Wu, "Variable voltage scheduling method for system components", Taiwan Patent, No. TWI221227B, issued on 2004/09/21. Link


 
 

Others

  1. Chung-Yi Chen and Yi-Ping You, "Compact LLVM IR-Based Program Representation", in Proceedings of the 29th Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '24), Taipei, Taiwan, 2024.
  2. Mu-En Huang and Yi-Ping You, "Enhanced Compression-Aware Register Allocation", in Proceedings of the 29th Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '24), Taipei, Taiwan, 2024.
  3. Ya-Ran Guo and Yi-Ping You, "MLIR-Based Program Embeddings", in Proceedings of the 28th Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '23), Tainan, Taiwan, 2023.
  4. An-Chi Liu and Yi-Ping You, "Offloading Support for Parallel JavaScript Programs", in Proceedings of the 27th Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '22), Taoyuan, Taiwan, 2022.
  5. Yung-Chia Lin, Yi-Ping You, Chung-Ju Wu, Bo-Syun Hsu, Chung-Lin Tang, Sheng-Yuan Chen, Ya-Chiao Moo, and Jenq-Kuen Lee, "PAC DSP Software Development Suite", in SoC Technical Journal, Vol. 2, pp. 22-35, 2022.
  6. Yi-Ping You and Jenq-Kuen Lee, "Compilers for Low Power", in Communications of IICM, Vol. 5, Issue 4, pp. 39-46, 2022.
  7. Yi-Chiao Su and Yi-Ping You, "On Subsequences of O3 Compiler Optimization Sequences", in Proceedings of the 26th Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '21), Taipei, Taiwan, 2021.
  8. Yun-Wei Lee and Yi-Ping You, "Automated Syntactic Refactoring", in Proceedings of the 25th Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '19), Taipei, Taiwan, 2019.
  9. Jia-Rung Chang and Yi-Ping You, "Selecting Function-Level Optimization Options with Reinforcement Learning", in Proceedings of the 24th Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '18), Chiayi, Taiwan, 2018.
  10. Ming-Tsung Chiu and Yi-Ping You, "Enabling Preemptive Execution of OpenCL Kernels.", in Proceedings of the 24th Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '18), Chiayi, Taiwan, 2018.
  11. Nai-Jia Dong and Yi-Ping You, "Constructing Generic and Efficient Containers with C Preprocessor Macros", in Proceedings of the 23rd Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '17), Taichung, Taiwan, 2017.
  12. Yen-Ting Chao and Yi-Ping You, "Capability-Aware Workload Partition on Multi-GPU Systems", in Proceedings of the 22nd Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '16), Hsinchu, Taiwan, 2016.
  13. Po-Hsiang Chiu and Yi-Ping You, "LLVM-based AOT Compilation for Dynamic Languages: JavaScript as a Case Study ", in Proceedings of the 21st Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '15), Tainan, Taiwan, 2015.
  14. Yi-Ping You, Hen-Jung Wu, Yeh-Ning Tsai, and Yen-Ting Chao, "VirtCL: A Framework for OpenCL Device Abstraction and Management", in Proceedings of the 21st Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '15), Tainan, Taiwan, 2015.
  15. Poyu Chen and Yi-Ping You, "JSComp: A Static Compiler for Hybrid Execution of JavaScript Programs", in Proceedings of the 20th Workshop on Compiler Techniques for High-Performance Computing (CTHPC '14), Hsinchu, Taiwan, 2014.
  16. Yu-Shiuan Tsai, Pen-Yung Yu, and Yi-Ping You, "Compiler-Assisted Resource Management for CUDA Programs", in Proceedings of the 19th Workshop on Compiler Techniques for High-Performance Computing (CTHPC '13), Taipei, Taiwan, 2013.
  17. Yi-Ping You and Szu-Chieh Chen, "Register Allocation Techniques for GPU Shader Processors", in Proceedings of the 18th Workshop on Compiler Techniques for High-Performance Computing (CTHPC '12), Chiayi, Taiwan, 2012.
  18. Yi-Ping You, Shen-Hong Wang, and I-Ting Lin, "Energy-aware Code Motion for GPU Shader Processors", in Proceedings of the 17th Workshop on Compiler Techniques for High-Performance Computing (CTHPC '11), Taichung, Taiwan, 2011.
  19. Yi-Ping You, "Compiler Optimizations on Embedded Processors for Low Power", Ph.D. Dissertation, 2007.
  20. Yi-Ping You, Chung-Wen Huang, and Jenq-Kuen Lee, "Compact Power-Gating Control Placement", in Proceedings of the 12th Workshop on Compiler Techniques for High-Performance Computing (CTHPC '06), Tainan, Taiwan, 2006.
  21. Yi-Ping You, Chingren Lee, and Jenq-Kuen Lee, "Compiler Optimization for Low Power on Power Gating", in Proceedings of the 8th Workshop on Compiler Techniques for High-Performance Computing (CTHPC '02), Hualien, Taiwan, 2002.
  22. Yi-Ping You, "Compiler and OS Optimizations for Low Power", Master Thesis, 2002.
  23. Yi-Ping You and Shi-Chun Tsai, "On The Time V.S. Space Complexity of Adaptive Huffman Coding", Project Report of NSC-882815C260002E, 1999.