Publications

2026

DAC 26

Cachence: Fine-Grained Cache Partitioning in Both Time and Space

63rd ACM/IEEE Design Automation Conference (DAC 26)

Liujia Li, Yuanlong Li, Yiming Yao, Jianyu Wu, Yi Fan, Jinhao Guo, Liren Zhu, Jie Zhang, Xiaolin Wang, Yingwei Luo, Zhenlin Wang, Diyu Zhou

DOI

Cache partitioning mechanisms improve system performance by judiciously allocating cache space among diverse applications. We found that the effectiveness of cache partitioning can be significantly improved for modern applications through textitfine-grained partitioning in both time and space. However, existing software and hardware solutions are limited to only one dimension. This paper presents CACHENCE, a novel cache partitioning mechanism that achieves textitboth textbftimely and textbfhigh-granularity partitioning with two novel techniques. First, CACHENCE profiles cache access patterns by implementing a refined cache analytical model in hardware, enabling fine-grained miss-ratio curve profiling. The hardware-friendly optimization of the model provides high accuracy with textittheoretical guarantees in less than textittens of million cycles, supporting timely partition while incurring only a small, about two kilobytes space overhead for each core. Second, CACHENCE leverages the characteristics of modern out-of-order processors and runtime textitmicroarchitectural statistics to propose a lightweight yet precise performance model, which supports a textitthousand of kilobytes-level cache partitioning. CACHENCE also proposes a special hardware cache partitioning tool, boosting the textithit rate while minimizing textitstorage and latency overhead. Our evaluation shows that, on a 16-core system, CACHENCE outperforms existing approaches by an average of 9.5% to 28.2%, and up to 48.8%.

2025

SOSP 25	CortenMM: Efficient Memory Management with Strong Correctness Guarantees CM SIGOPS 31st Symposium on Operating Systems Principles (SOSP 25) Junyang Zhang, Xiangcan Xu, Yonghao Zou, Zhe Tang, Xinyi Wan, Kang Hu, Siyuan Wang, Wenbo Xu, Di Wang, Hao Chen, Lin Huang, Shoumeng Yan, Yuval Tamir, Yingwei Luo, Xiaolin Wang, Huashan Yu, Zhenlin Wang, Hongliang Tian, Diyu Zhou DOI Code Modern memory management systems suffer from poor performance and subtle concurrency bugs, slowing down applications while introducing security vulnerabilities. We observe that both issues stem from the conventional design omemory management systems with two levels of abstraction: a software-level abstraction (e.g., VMA trees in Linux) and a hardware-level abstraction (typically, page tables). This design increases portability but requires correctly and efficiently synchronizing two drastically different and complex data structures, which is generally challenging. We present CortenMM, a memory management system with a clean-slate design to achieve both high performance and synchronization correctness. Our key insight is that most OSes no longer need the software-level abstraction, since mainstream ISAs use nearly identical hardware MMU formats. Therefore, departing from prior designs, CortenMM eliminates the software-level abstraction to achieve sweeping simplicity. Exploiting this simplicity, CortenMM proposes a transactional interface with scalable locking protocols to program the MMU, achieving high performance by avoiding the extra contention in the software-level abstraction. The one-level design further enables us to formally verify the correctness of concurrent code operating on the MMU (correctness of basic operations and locking protocols), thereby offering strong correctness guarantees. Our evaluation shows that the formally verified CortenMM outperforms Linux by 1.2x to 26x on real-world applications.
SOSP 25	Aeolia: A Fast and Secure Userspace Interrupt-Based Storage Stack CM SIGOPS 31st Symposium on Operating Systems Principles (SOSP 25) Chuandong Li, Ran Yi, Zonghao Zhang, Jing Liu, Changwoo Min, Jie Zhang, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, Diyu Zhou PDF DOI Code Polling-based userspace storage stacks achieve great I/O performance. However, they cannot efficiently and securely share disks and CPUs among multiple tasks. In contrast, interrupt-based kernel stacks inherently suffer from subpar I/O performance but achieve advantages in resource sharing.We present Aeolia, a novel storage stack that achieves great I/O performance while offering efficient and secure resource sharing. Aeolia is an interrupt-based userspace storage stack, representing a new point in the design space previously considered unfeasible. Our main observation is that, contrary to conventional wisdom, polling offers only marginal disk performance improvements over interrupts. Aeolia exploits user interrupt, an emerging hardware feature commonly used for userspace IPIs, in a novel way to deliver storage interrupts directly to userspace, thereby achieving high I/O performance with direct access. Aeolia leverages the hardware intra-process isolation features and sched_ext, an eBPF-based userspace scheduling framework, to efficiently and securely share CPUs and disks among multiple tasks, challenging the common belief that these are inherent disadvantages of userspace storage stacks. The above design enables Aeolia to realize AeoFS, a high-performance library file system that securely and directly accesses disks. Our evaluation shows that Aeolia outperforms Linux by 2脳 and AeoFS outperforms ext4 by up to 19.1脳, respectively.
NDSS 25	Blackbox Fuzzing of Distributed Systems with Multi-Dimensional Inputs and Symmetry-Based Feedback Pruning Network and Distributed System Security (NDSS) Symposium Yonghao Zou, Jia-Ju Bai, Zu-Ming Jiang, Ming Zhao, Diyu Zhou PDF Code Video This paper presents DistFuzz, which, to our knowledge, is the first feedback-guided blackbox fuzzing framework for distributed systems. The novelty of DistFuzz comes from two conceptual contributions on key aspects of distributed system fuzzing: the input space and feedback metrics. Specifically, unlike prior work that focuses on systematically mutating faults, exploiting the request-driven and timing-dependence nature of distributed systems, DistFuzz proposes a multi-dimensional input space by incorporating regular events and relative timing among events as the other two dimensions. Furthermore, observing that important state changes in distributed systems can be indicated by network messages among nodes, DistFuzz utilizes the sequences of network messages with symmetry-based pruning as program feedback, which departs from the conventional wisdom that effective feedback requires code instrumentation/analysis and/or user inputs. DistFuzz finds 52 real bugs in ten popular distributed systems in C/C++, Go, and Java. Among these bugs, 28 have been confirmed by the developers, 20 were unknown before, and 4 have been assigned with CVEs.
IEEE Micro	Ginkgo: A Learned Index Enhanced Tiered Memory System Xiran Yang, Yifei Yu, Chuandong li, Jianqiang Zeng, Ke Zhou, Diyu Zhou, Xiaolin Wang, Zhenlin Wang, Yingwei Luo PDF Compute Express Link (CXL), as an emerging high-speed interconnect protocol, offers a promising approach to memory expansion. Organizing fast DDR DRAM and large CXL memory as a tiered memory system is preferred to obtain the large memory capacity while maintaining high memory performance. Additionally, such a tiered memory system also adopts huge pages to alleviate address translation overhead. However, due to memory bloating issues in huge pages, existing hardware-based tiered memory management systems suffer from severe cache conflicts and DDR DRAM underutilization, resulting in significant performance degradation. We introduce Ginkgo, a learned-index enhanced tiered memory system, which builds a distribution-aware cache index mechanism while leveraging the learned index to minimize metadata overhead. Evaluation reveals that Ginkgo reduces average memory access latency with an average of 3.1% and 13.5% compared to state-of-the-art Alloy Cache and Unison Cache, respectively.
HPCA 25	Criticality-Aware Instruction-Centric Bandwidth Partitioning for Data Center Applications 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA) Liren Zhu, Liujia Li, Jianyu Wu, Yiming Yao, Zhan Shi, Jie Zhang, Zhenlin Wang, Xiaolin Wang, Yingwei Luo, Diyu Zhou PDF Code Slides To reduce operational costs, modern data centers co-locate high-priority latency-critical (LC) tasks and low-priority best-effort (BE) tasks on the same physical node to increase resource utilization. However, such co-location leads to contention for memory bandwidth, resulting in priority inversion, where BE tasks severely slow down LC tasks. This priority inversion often leads to violations of the quality of service (QoS) requirements for LC tasks, defeating the purpose of co-location. Prior approaches to this issue either fail to enforce the QoS requirements for LC tasks or underutilize memory bandwidth.We present Pivot, a novel bandwidth partitioning system that overcomes the limitations of prior approaches based on two key insights. First, memory accesses from LC tasks must be prioritized across all the components on the memory path rather than a single component, as done in prior work. Second, only the scheduling of a selective portion of performance-critical loads (i.e., those causing a long stall on the re-order buffer), instead of all memory accesses from LC tasks, should be prioritized. To leverage these insights, Pivot overcomes the key challenge of accurately identifying performance-critical loads while incurring minimal runtime overhead by proposing a two-phase profiling technique. Our extensive evaluation shows that Pivot improves effective machine utilization by up to 34.5% while increasing the throughput of the BE applications by up to 2.76脳 compared to state-of-the-art approaches.
USENIX ATC 25	Asterinas: A Linux ABI-Compatible, Rust-Based Framekernel OS with a Small and Sound TCB 2025 USENIX Annual Techinical Conference Yuke Peng, Hongliang Tian, Junyang Zhang, Ruihan Li, Chengjun Chen, Jianfeng Jiang, Jinyi Xian, Yingwei Luo, Xiaolin Wang, Chenren Xu, Diyu Zhou, Shoumeng Yan, Yinqian Zhang Code How can one build a feature-rich, Rust-based operating system (OS) with a minimal and sound Trusted Computing Base (TCB) for memory safety? Existing Rust-based OSes fall short due to their improper usage of unsafe Rust in kernel development. To address these challenges, we propose a novel framekernel architecture that leverages Rust鈥檚 memory safety features to enable intra-kernel privilege separation, ensuring TCB minimality and soundness. We present OSTD, a streamlined framework for safe Rust OS development, and Asterinas, a Linux ABI-compatible framekernel OS implemented entirely in safe Rust using OSTD. Supporting over 180 Linux system calls, Asterinas delivers performance on par with Linux, while maintaining a memory safe TCB of just 10K lines of code鈥攁bout 17% of its total codebase. These results underscore the practicality and benefits of the framekernel architecture in building safe and efficient OSes

2024

TOCS	Hardware-Software Collaborative Tiered-Memory Management Framework for Virtualization ACM Transactions on Computer Systems Sai Sha, Chuandong Li, Xiaolin Wang, Zhenlin Wang, Yingwei Luo PDF The tiered-memory system can effectively expand the memory capacity for virtual machines (VMs). However, virtualization introduces new challenges specifically in enforcing performance isolation, minimizing context switching, and providing resource overcommit. None of the state-of-the-art designs consider virtualization and address these challenges; we observe that a VM with tiered memory incurs up to a 2脳 slowdown compared to a DRAM-only VM. We propose vTMM, a hardware-software collaborative tiered-memory management framework for virtualization. A key insight in vTMM is to leverage the unique system features in virtualization to meet the above challenges. vTMM automatically determines page hotness and migrates pages between fast and slow memory to achieve better performance. Specially, vTMM optimizes page tracking and migration based on page-modification logging (PML), a hardware-assisted virtualization mechanism, and adaptively distinguishes hot/cold pages through the page "temperature" sorting. vTMM also dynamically adjusts fast memory among multi-VMs on demand by using a memory pool. Further, vTMM tracks huge pages at regular-page granularity in hardware and splits/merges pages in software, realizing hybrid-grained page management and optimization. We implement and evaluate vTMM with single-grained page management on an Intel processor, and the hybrid-grained page management on a Sunway processor with hardware mode supporting hardware/software co-designs. Experiments show that vTMM outperforms existing tiered-memory management designs in virtualization.
Euro-Par 24	EKRM: Efficient Key-Value Retrieval Method to Reduce Data Lookup Overhead for Redis European Conference on Parallel and Distributed Processing Yiming Yao, Xiaolin Wang, Diyu Zhou, Liujia Li, Jianyu Wu, Liren Zhu, Zhenlin Wang, Yingwei Luo PDF As an open-source key-value system, Redis has been widely used in internet service stations. A key-value lookup in Redis usually involves several chained memory accesses, and the address translation overhead can significantly increase the lookup latency. This paper introduces a new software-based approach that can reduce chained memory accesses and total address translation overhead of lookup requests by placing key-value entries in a specially managed memory space organized as huge pages with a fast hash table and enabling a fast lookup approach with simple hash functions, while keeping the integrity of Redis data structure. The new approach brings up to 1.38脳 average speedup for the key-value retrieval process, and significantly reduces misses in TLB and last-level cache. It outperforms SLB, an address caching software approach and has match the performance to STLT, a software-hardware co-designed address-centric design.
BDMA 24	Building a High-Performance Graph Storage on Top of Tree-Structured Key-Value Stores Big Data Mining and Analytics Heng Lin, Zhiyong Wang, Shipeng Qi, Xiaowei Zhu, Chuntao Hong, Wenguang Chen, Yingwei Luo PDF Graph databases have gained widespread adoption in various industries and have been utilized in a range of applications, including financial risk assessment, commodity recommendation, and data lineage tracking. While the principles and design of these databases have been the subject of some investigation, there remains a lack of comprehensive examination of aspects such as storage layout, query language, and deployment. The present study focuses on the design and implementation of graph storage layout, with a particular emphasis on tree-structured key-value stores. We also examine different design choices in the graph storage layer and present our findings through the development of TuGraph, a highly efficient single-machine graph database that significantly outperforms well-known Graph DataBase Management System (GDBMS). Additionally, TuGraph demonstrates superior performance in the Linked Data Benchmark Council (LDBC) Social Network Benchmark (SNB) interactive benchmark.
ATC 24	Taming Hot Bloat Under Virtualization with HUGESCOPE 2024 USENIX Annual Technical Conference (ATC) Chuandong Li, Sai Sha, Yangqing Zeng, Xiran Yang, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, Diyu Zhou PDF Code Video Huge pages are effective in reducing the address translation overhead under virtualization. However, huge pages suffer from the hot bloat problem, where accesses to a huge page are skewed towards a few base pages (i.e., 4KB page), making the hypervisor (mistakenly) classify the whole huge page as hot. Hot bloat renders several critical techniques used in virtualization ineffective, including tiered memory and page sharing. Prior work addressing hot bloat either requires hardware modification or targets a specific scenario and is not applicable to a hypervisor. This paper presents HugeScope, a lightweight, effective and generic system that addresses the hot bloat problem under virtualization based on commodity hardware. HugeScope includes an efficient and precise page tracking mechanism, leveraging the other level of indirect memory translation in the hypervisor. HugeScope provides a generic framework to support page splitting and coalescing policies, considering the memory pressure, as well as the recency, frequency, and skewness of page access. Moreover, HugeScope is general and modular, thereby can be easily applied to various scenarios concerning hot bloat, including tiered memory management (HS-TMM) and page sharing (HS-Share). Evaluation shows that HugeScope incurs less than 4% overhead, and by addressing hot bloat, HS-TMM improves performance by up to 61% over vTMM while HS-Share saves 41% more memory than Ingens while offering comparable performance.

2023

IEEE TCC	Multi-Tenant In-memory Key-Value Cache Partitioning Using Efficient Random Sampling-Based LRU Model IEEE Transactions on Cloud Computing Yuchen Wang, Junyao Yang, Zhenlin Wang PDF In-memory key-value caches are widely used as a performance-critical layer in web applications, disk-based storage, and distributed systems. The Least Recently Used (LRU) replacement policy has become the de facto standard in those systems since it exploits workload locality well. However, the LRU implementation can be costly due to the rigid data structure in maintaining object priority, as well as the locks for object order updating. Redis as one of the most effective and prevalent deployed commercial systems adopts an approximated LRU policy, where the least recently used item from a small, randomly sampled set of items is chosen to evict. This random sampling-based policy is lightweight and shows its flexibility. We observe that there can exist a significant miss ratio gap between exact LRU and random sampling-based LRU under different sampling size Ks. Therefore existing LRU miss ratio curve (MRC) construction techniques cannot be directly applied without loss of accuracy. In this article, we introduce a new probabilistic stack algorithm named KRR to accurately model random sampling based-LRU, and extend it to handle both fixed and variable objects in key-value caches. We present an efficient stack update algorithm that reduces the expected running time of KRR significantly. To improve the performance of the in-memory multi-tenant key-value cache that utilizes random sampling-based replacement, we propose kRedis, a reference locality- and latency-aware memory partitioning scheme. kRedis guides the memory allocation among the tenants and dynamically customizes K to better exploit the locality of each individual tenant. Evaluation results over diverse workloads show that our model generates accurate miss ratio curves for both fixed and variable object size workloads, and enables practical, low-overhead online MRC prediction. Equipped with KRR, kRedis delivers up to a 50.2% average access latency reduction, and up to a 262.8% throughput improvement compared to Redis.
EuroSys 23	FLORIA: A Fast and Featherlight Approach for Predicting Cache Performance European Conference on Computer Systems (EuroSys) Jun Xiao, Yaocheng Xiang, Xiaolin Wang, Yingwei Luo, Andy Pimentel, Zhenlin Wang PDF The cache Miss Ratio Curve (MRC) serves a variety of purposes such as cache partitioning, application profiling and code tuning. In this work, we propose a new metric, called cache miss distribution, that describes cache miss behavior over cache sets, for predicting cache MRCs. Based on this metric, we present FLORIA, a software-based, online approach that approximates cache MRCs on commodity systems. By polluting a tunable number of cache lines in some selected cache sets using our designed microbenchmark, the cache miss distribution for the target workload is obtained via hardware performance counters with the support of precise event based sampling (PEBS). A model is developed to predict the MRC of the target workload based on its cache miss distribution. We evaluate FLORIA for systems consisting of a single application as well as a wide range of different workload mixes. Compared with the state-of-the-art approaches in predicting online MRCs, FLORIA achieves the highest average accuracy of 97.29% with negligible overhead. It also allows fast and accurate estimation of online MRC within 5ms, 20X faster than the state-of-the-art approaches. We also demonstrate that FLORIA can be applied to guiding cache partitioning for multiprogrammed workloads, helping to improve overall system performance.
EuroSys 23	vTMM: Tiered Memory Management for Virtual Machines European Conference on Computer Systems Sai Sha, Chuandong Li, Yingwei Luo, Xiaolin Wang, Zhenlin Wang PDF The memory demand of virtual machines (VMs) is increasing, while the traditional DRAM-only memory system has limited capacity and high power consumption. The tiered memory system can effectively expand the memory capacity and increase the cost efficiency. Virtualization introduces new challenges for memory tiering, specifically enforcing performance isolation, minimizing context switching, and providing resource overcommit. However, none of the state-of-the-art designs consider virtualization and thus address these challenges; we observe that a VM with tiered memory incurs up to a 2脳 slowdown compared to a DRAM-only VM. This paper proposes vTMM, a tiered memory management system specifically designed for virtualization. vTMM automatically determines page hotness and migrates pages between fast and slow memory to achieve better performance. A key insight in vTMM is to leverage the unique system characteristics in virtualization to meet the above challenges. Specifically, vTMM tracks memory accesses with page-modification logging (PML) and a multi-level queue design. Next, vTMM quantifies the page "temperature" and makes a fine-grained page classification with bucket-sorting. vTMM performs page migration with PML while providing resource overcommit by transparently resizing VM memory through the two-dimensional page tables. In combination, the above techniques minimize overhead, ensure performance isolation and provide dynamic memory partitioning to improve the overall system performance. We evaluate vTMM on a real DRAM+NVM system and a simulated CXL-Memory system. The results show that vTMM outperforms NUMA balancing, Intel Optane memory mode and Nimble (an OS-level tiered memory management system) for VM tiered memory management. Multi-VM co-running results show that vTMM improves the performance of a DRAM+NVM system by 50%--140% and a CXL-Memory system by 16% -- 40%, respectively.
CoRR	FHPM: Fine-grained Huge Page Management For Virtualization Computing Research Repository Chuandong Li, Sai Sha, Yangqing Zeng, Xiran Yang, Yingwei Luo, Xiaolin Wang, Zhenlin Wang PDF As more data-intensive tasks with large footprints are deployed in virtual machines (VMs), huge pages are widely used to eliminate the increasing address translation overhead. However, once the huge page mapping is established, all the base page regions in the huge page share a single extended page table (EPT) entry, so that the hypervisor loses awareness of accesses to base page regions. None of the state-of-the-art solutions can obtain access information at base page granularity for huge pages. We observe that this can lead to incorrect decisions by the hypervisor, such as incorrect data placement in a tiered memory system and unshared base page regions when sharing pages. This paper proposes FHPM, a fine-grained huge page management for virtualization without hardware and guest OS modification. FHPM can identify access information at base page granularity, and dynamically promote and demote pages. A key insight of FHPM is to redirect the EPT huge page directory entries (PDEs) to new companion pages so that the MMU can track access information within huge pages. Then, FHPM can promote and demote pages according to the current hot page pressure to balance address translation overhead and memory usage. At the same time, FHPM proposes a VM-friendly page splitting and collapsing mechanism to avoid extra VM-exits. In combination, FHPM minimizes the monitoring and management overhead and ensures that the hypervisor gets fine-grained VM memory accesses to make the proper decision. We apply FHPM to improve tiered memory management (FHPM-TMM) and to promote page sharing (FHPM-Share). FHPM-TMM achieves a performance improvement of up to 33% and 61% over the pure huge page and base page management. FHPM-Share can save 41% more memory than Ingens, a state-of-the-art page sharing solution, with comparable performance.

2022

TPDS	Astraea: A Fair Deep Learning Scheduler for Multi-Tenant GPU Clusters IEEE Transactions on Parallel and Distributed Systems Zhisheng Ye, Peng Sun, Wei Gao, Tianwei Zhang, Xiaolin Wang, Shengen Yan, Yingwei Luo PDF Modern GPU clusters are designed to support distributed Deep Learning jobs from multiple tenants concurrently. Each tenant may have varied and dynamic resource demands. Unfortunately, existing GPU schedulers fail to thoroughly consider the fairness among the tenants and jobs, which can result in unbalanced resource allocation and unfair user experience. In this article, we present an efficient solution to provide strong fairness while maintaining high scheduling effectiveness in multi-tenant GPU clusters. First, we introduce a novel Long-Term GPU-time Fairness metric, which can comprehensively evaluate the fairness at both the tenant and job levels, based on both the temporal and spatial impacts of resource allocation. Second, we design a new and practical GPU scheduler, Astraea, to enforce the desired fairness among tenants and jobs. Large-scale evaluations show that Astraea can improve tenant fairness by up to 9.42× compared to state-of-the-art schedulers, without sacrificing the average job completion time.
IPDPS 22	GSpecPal: Speculation-Centric Finite State Machine Parallelization on GPUs 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Yuguang Wang, Robbie Watling, Junqiao Qiu, Zhenlin Wang PDF Finite State Machine (FSM) plays a critical role in many real-world applications, ranging from pattern matching to network security. In recent years, significant research efforts have been made to accelerate FSM computations on different parallel platforms, including multicores, GPUs, and DRAM-based accelerators. A popular direction is the speculation-centric parallelization. Despite their abundance and promising results, the benefits of speculation-centric FSM parallelization on GPUs heavily depend on high speculation accuracy and are greatly limited by the inefficient sequential recovery. Inspired by speculative data forwarding used in Thread Level Speculation (TLS), this work addresses the existing bottlenecks by introducing speculative recovery with two heuristics for thread scheduling, which can effectively remove redundant computations and increase the GPU thread utilization. To maximize the performance of running FSMs on GPUs, this work integrates different speculative parallelization schemes into a latency-sensitive framework, GSpecPal, along with a scheme selector which aims to automatically configure the optimal GPU-based parallelization for a given FSM. Evaluation on a set of real-world FSMs with diverse characteristics confirms the effectiveness of GSpecPal. Experimental results show that GSpecPal can obtain 7.2脳 speedup on average (up to 20脳) over the state-of-the-art on an Nvidia GeForce RTX 3090 GPU.
TC	Accelerating Address Translation for Virtualization by Leveraging Hardware Mode IEEE Transactions on Computers Sai Sha, Chuandong Li, Yingwei Luo, Xiaolin Wang, Zhenlin Wang PDF The overhead of memory virtualization remains nontrivial. The traditional shadow paging (TSP) resorts to a shadow page table (SPT) to achieve the native page walk speed, but page table updates require hypervisor interventions. Alternatively, nested paging enables low-overhead page table updates, but utilizes the hardware MMU to perform a long-latency two-dimensional page walk. This paper proposes new memory virtualization solutions based on hardware (machine) mode—the highest CPU privilege level in some architectures like Sunway and RISC-V. A programming interface, running in hardware mode, enables software-implementation of hardware support functions. We first propose Software-based Nested Paging (SNP), which extends the software MMU to perform a two-dimensional page walk in hardware mode. Second, we present Swift Shadow Paging (SSP), which accomplishes page table synchronization by intercepting TLB flushing in hardware mode. Finally we propose Accelerated Shadow Paging (ASP) combining SSP and SNP. ASP handles the last-level SPT page faults by walking two-dimensional page tables in hardware mode, which eliminates most hypervisor interventions. This paper systematically compares multiple memory virtualization models by analyzing their designs and evaluating their performance both on a real system and a simulator. The experiments show that the virtualization overhead of ASP is less than 4.5% for all workloads.
ICPC 22	Exploring GNN based program embedding technologies for binary related tasks International Conference on Program Comprehension Yixin Guo, Pengcheng Li, Yingwei Luo, Xiaolin Wang, Zhenlin Wang PDF With the rapid growth of program scale, program analysis, maintenance and optimization become increasingly diverse and complex. Applying learning-assisted methodologies onto program analysis has attracted ever-increasing attention. However, a large number of program factors including syntax structures, semantics, running platforms and compilation configurations block the effective realization of these methods. To overcome these obstacles, existing works prefer to be on a basis of source code or abstract syntax tree, but unfortunately are sub-optimal for binary-oriented analysis tasks closely related to the compilation process. To this end, we propose a new program analysis approach that aims at solving program-level and procedure-level tasks with one model, by taking advantage of the great power of graph neural networks from the level of binary code. By fusing the semantics of control flow graphs, data flow graphs and call graphs into one model, and embedding instructions and values simultaneously, our method can effectively work around emerging compilation-related problems. By testing the proposed method on two tasks, binary similarity detection and dead store prediction, the results show that our method is able to achieve as high accuracy as 83.25%, and 82.77%.
ICCD 22	Tear Up the Bubble Boom: Lessons Learned From a Deep Learning Research and Development Cluster International Conference on Computer Design (ICCD) Zehua Yang, Zhisheng Ye, Tianhao Fu, Jing Luo, Xiong Wei, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, Tianwei Zhang PDF With the proliferation of deep learning, there exists a strong need to efficiently operate GPU clusters for deep learning production in giant AI companies, as well as for research and development (R&D) in small-sized research institutes and universities. Existing works have performed thorough trace analysis on large-scale production-level clusters in giant companies, which discloses the characteristics of deep learning production jobs and motivates the design of scheduling frameworks. However, R&D clusters significantly differ from production-level clusters in both job properties and user behaviors, calling for a different scheduling mechanism. In this paper, we present a detailed workload characterization of an R&D cluster, CloudBrain-I, in a research institute, Peng Cheng Laboratory. After analyzing the fine-grained resource utilization, we discover a severe problem for R&D clusters, resource underutilization, which is especially important in R&D clusters while not characterised by existing works. We further investigate two specific underutilization phenomena and conclude several implications and lessons on R&D cluster scheduling. The traces will be open-sourced to motivate further studies in the community.
SC 22	Graph Neural Networks Based Memory Inefficiency Detection Using Selective Sampling 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC22) Pengcheng Li, Yixin Guo, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, Xu Liu PDF Production software of data centers oftentimes suffers from unnecessary memory inefficiencies caused by inappropriate use of data structures, conservative compiler optimizations, and so forth. Nevertheless, whole-program monitoring tools often incur incredibly high overhead due to fine-grained memory access instrumentation. Consequently, the fine-grained monitoring tools are not viable for long-running, large-scale data center applications due to strict latency criteria (e.g., service-level agreement or SLA). To this end, this work presents a novel learning-aided system, namely Puffin, to identify three kinds of unnecessary memory operations including dead stores, silent loads and silent stores, by applying gated graph neural networks onto fused static and dynamic program semantics with respect to relative positional embedding. To deploy the system in large-scale data centers, this work explores a sampling-based detection infrastructure with high efficacy and negligible overhead. We evaluate Puffin upon the well-known SPEC CPU 2017 benchmark suite for four compilation options. Experimental results show that the proposed method is able to capture the three kinds of memory inefficiencies with as high accuracy as 96% and a reduced checking overhead by 5.66脳 over the state-of-the-art tool.
ICASSP 22	A Question-Oriented Propagation Network for News Reading Comprehension. International Conference on Acoustics, Speech and Signal Processing Liang Wen, Houfeng Wang, Dehong Ma, Jun Fan, Yingwei Luo, Xiaolin Wang, Daiting Shi, Zhicong Cheng, Dawei Yin PDF Machine reading comprehension of news articles remains to be a challenging task since the lengths of its context documents are long. Such reading comprehension task usually requires document-level language understanding while state-of-the-art, pretrained question answering models can only encode sequences with a predefined length limit. In this paper, we propose a novel Question-Oriented Propagation Network (QOPN) model for such task. Specifically, our proposed QOPN first uses a context encoding module to find local question-related clues. Then, it employs a multi-step reasoning module to aggregate question-focused information for iterative reasoning. The novel design put emphasis on capturing question-related information and allow long-range information integration, which is especially beneficial for long-context reading comprehension task. Experiments on two challenging machine comprehension datasets show that the proposed QOPN significantly outperforms previous state-of-the-art models.
EMNLP 22	M3: A Multi-View Fusion and Multi-Decoding Network for Multi-Document Reading Comprehension Empirical Methods in Natural Language Processing Liang Wen, Houfeng Wang, Yingwei Luo, Xiaolin Wang PDF Multi-document reading comprehension task requires collecting evidences from different documents for answering questions. Previous research works either use the extractive modeling method to naively integrate the scores from different documents on the encoder side or use the generative modeling method to collect the clues from different documents on the decoder side individually. However, any single modeling method cannot make full of the advantages of both. In this work, we propose a novel method that tries to employ a multi-view fusion and multi-decoding mechanism to achieve it. For one thing, our approach leverages question-centered fusion mechanism and cross-attention mechanism to gather fine-grained fusion of evidence clues from different documents in the encoder and decoder concurrently. For another, our method simultaneously employs both the extractive decoding approach and the generative decoding method to effectively guide the training process. Compared with existing methods, our method can perform both extractive decoding and generative decoding independently and optionally. Our experiments on two mainstream multi-document reading comprehension datasets (Natural Questions and TriviaQA) demonstrate that our method can provide consistent improvements over previous state-of-the-art methods.
CoRR	HMM-V: Heterogeneous Memory Management for Virtualization Computing Research Repository Sai Sha, Chuandong Li, Yingwei Luo, Xiaolin Wang, Zhenlin Wang PDF The memory demand of virtual machines (VMs) is increasing, while DRAM has limited capacity and high power consumption. Non-volatile memory (NVM) is an alternative to DRAM, but it has high latency and low bandwidth. We observe that the VM with heterogeneous memory may incur up to a 1.5× slowdown compared to a DRAM VM, if not managed well. However, none of the state-of-the-art heterogeneous memory management designs are customized for virtualization on a real system. In this paper, we propose HMM-V, a Heterogeneous Memory Management system for Virtualization. HMM-V automatically determines page hotness and migrates pages between DRAM and NVM to achieve performance close to the DRAM system. First, HMM-V tracks memory accesses through page table manipulation, but reduces the cost by leveraging Intel page-modification logging (PML) and a multi-level queue. Second, HMM-V quantifies the ``temperature' of page and determines the hot set with bucket-sorting. HMM-V then efficiently migrates pages with minimal access pause and handles dirty pages with the assistance of PML. Finally, HMM-V provides pooling management to balance precious DRAM across multiple VMs to maximize utilization and overall performance. HMM-V is implemented on a real system with Intel Optane DC persistent memory. The four-VM co-running results show that HMM-V outperforms NUMA balancing and hardware management (Intel Optane memory mode) by 51% and 31%, respectively.
COLING 22	Original Content Is All You Need! an Empirical Study on Leveraging Answer Summary for WikiHowQA Answer Selection Task International Conference on Computational Linguistics Liang Wen, Juan Li, Houfeng Wang, Yingwei Luo, Xiaolin Wang, Xiaodong Zhang, Zhicong Cheng, Dawei Yin PDF Answer selection task requires finding appropriate answers to questions from informative but crowdsourced candidates. A key factor impeding its solution by current answer selection approaches is the redundancy and lengthiness issues of crowdsourced answers. Recently, Deng et al. (2020) constructed a new dataset, WikiHowQA, which contains a corresponding reference summary for each original lengthy answer. And their experiments show that leveraging the answer summaries helps to attend the essential information in original lengthy answers and improve the answer selection performance under certain circumstances. However, when given a question and a set of long candidate answers, human beings could effortlessly identify the correct answer without the aid of additional answer summaries since the original answers contain all the information volume that answer summaries contain. In addition, pretrained language models have been shown superior or comparable to human beings on many natural language processing tasks. Motivated by those, we design a series of neural models, either pretraining-based or non-pretraining-based, to check wether the additional answer summaries are helpful for ranking the relevancy degrees of question-answer pairs on WikiHowQA dataset. Extensive automated experiments and hand analysis show that the additional answer summaries are not useful for achieving the best performance.

2021

VEE 21	Swift shadow paging (SSP): no write-protection but following TLB flushing International Conference on Virtual Execution Environments Sai Sha, Yi Zhang, Yingwei Luo, Xiaolin Wang, Zhenlin Wang PDF Virtualization is a key technique for supporting cloud services and memory virtualization is a major component of virtualization technology. Common memory virtualization mechanisms include shadow paging and hardware-assisted paging. The shadow paging model needs to synchronize shadow/guest page tables whenever there is a guest page table update. In the design of traditional shadow paging (TSP), the guest page table pages are write-protected so the updates can be intercepted by the hypervisor to ensure synchronization. Frequent page table updates cause lots of VM_Exits. Researchers have developed hardware-assisted paging to eliminate this overhead. However, address translation needs to walk a two-dimensional page table. This design significantly increases the overhead of page walk. This paper proposes SSP, a Swift Shadow Paging model which leverages the privileged hardware mode. In this design, the write protection mechanism is no longer needed. Rather, SSP accomplishes lazy page table synchronization by intercepting TLB flushing, which must be initiated by the guest OS when there is a page table update. The hardware mode, such as RISC-V鈥檚 machine mode and Sunway鈥檚 hardware mode, with the highest privilege, opens a new door for communication between the host OS and a guest OS. In addition, by using a shadow page table base address buffer, SSP eliminates the VM_Exits generated by guest process context switching. SSP inherits the advantage of TSP as it remains as a software-only solution and does not incur the excessive overhead of page walk when compared to hardware-assisted paging. We implement SSP in a Sunway machine. Our evaluation demonstrates SSP鈥檚 advantage for multiple workloads. Compared with TSP, SSP reduces VM_Exits caused by memory virtualization by 23%-56%. And the virtualization overhead of SSP is less than 5.5% for all workloads.
TOS	Penalty- and Locality-aware Memory Allocation in Redis Using Enhanced AET ACM Transactions on Storage Cheng Pan, Xiaolin Wang, Yingwei Luo, Zhenlin Wang PDF Due to large data volume and low latency requirements of modern web services, the use of an in-memory key-value (KV) cache often becomes an inevitable choice (e.g., Redis and Memcached). The in-memory cache holds hot data, reduces request latency, and alleviates the load on background databases. Inheriting from the traditional hardware cache design, many existing KV cache systems still use recency-based cache replacement algorithms, e.g., least recently used or its approximations. However, the diversity of miss penalty distinguishes a KV cache from a hardware cache. Inadequate consideration of penalty can substantially compromise space utilization and request service time. KV accesses also demonstrate locality, which needs to be coordinated with miss penalty to guide cache management. In this article, we first discuss how to enhance the existing cache model, the Average Eviction Time model, so that it can adapt to modeling a KV cache. After that, we apply the model to Redis and propose pRedis, Penalty- and Locality-aware Memory Allocation in Redis, which synthesizes data locality and miss penalty, in a quantitative manner, to guide memory allocation and replacement in Redis. At the same time, we also explore the diurnal behavior of a KV store and exploit long-term reuse. We replace the original passive eviction mechanism with an automatic dump/load mechanism, to smooth the transition between access peaks and valleys. Our evaluation shows that pRedis effectively reduces the average and tail access latency with minimal time and space overhead. For both real-world and synthetic workloads, our approach delivers an average of 14.0%∼52.3% latency reduction over a state-of-the-art penalty-aware cache management scheme, Hyperbolic Caching (HC), and shows more quantitative predictability of performance. Moreover, we can obtain even lower average latency (1.1%∼5.5%) when dynamically switching policies between pRedis and HC.
ICPP 21	Efficient Modeling of Random Sampling-Based LRU 50th International Conference on Parallel Processing (ICPP) Junyao Yang, Yuchen Wang, Zhenlin Wang PDF The Miss Ratio Curve (MRC) is an important metric and effective tool for caching system performance prediction and optimization. Since the Least Recently Used (LRU) replacement policy is the de facto policy for many existing caching systems, most previous studies on efficient MRC construction are predominantly focused on the LRU replacement policy. Recently, the random sampling-based replacement mechanism, as opposed to replacement relying on the rigid LRU data structure, gains more popularity due to its lightweight and flexibility. To approximate LRU, at replacement times, the system randomly selects K objects and replaces the least recently used object among the sample. Redis implements this approximated LRU policy. We observe that there can exist a significant miss ratio gap between exact LRU and random sampling-based LRU under different sampling size K; therefore existing LRU MRC construction techniques cannot be directly applied to random sampling based LRU cache without loss of accuracy. In this work, we present a new probabilistic stack algorithm named KRR which can be used to accurately model random sampling based-LRU under arbitrary sampling size K. We propose two efficient stack update algorithms which reduce the expected running time of KRR from O(NM) to O(Nlog2M) and O(N*logM), respectively, where N is the workload length and M is the number of distinct objects. Furthermore, we adopt spatial sampling which further reduces the running time of KRR by several orders of magnitude, and thus enables practical, low overhead online application of KRR.
DAC 21	GRAPHSPY: Fused Program Semantic Embedding through Graph Neural Networks for Memory Efficiency ACM/IEEE Design Automation Conference Yixin Guo, Pengcheng Li, Yingwei Luo, Xiaolin Wang, Zhenlin Wang PDF Production software oftentimes suffers from unnecessary memory inefficiencies caused by inappropriate use of data structures, programming abstractions, or conservative compiler optimizations. Unfortunately, existing works often adopt a whole-program fine-grained monitoring method incurring incredibly high overhead. This work proposes a learning-aided approach to identify unnecessary memory operations, by applying several prevalent graph neural network models to extract program semantics with respect to program structure, execution semantics and dynamic states. Results show that the proposed approach captures memory inefficiencies with high accuracy of 95.27% and only around 17% overhead of the state-of-the-art.

2020

MemSys 20	Dynamically Configuring LRU Replacement Policy in Redis The International Symposium on Memory Systems (MemSys) Yuchen Wang, Junyao Yang, Zhenlin Wang PDF To reduce the latency of accessing backend servers, today鈥檚 web services usually adopt in-memory key-value stores in the front end which cache the frequently accessed objects. Memcached and Redis are two most popular key-value cache systems. Due to the limited size of memory, an in-memory key-value store needs to be configured with a fixed amount of memory, i.e., cache size, and cache replacement is unavoidable when the footprint of accessed objects is larger than the cache size. Memcached implements the least recently used (LRU) policy. Redis adopts an approximated LRU policy to avoid maintaining LRU list structures. On a replacement, Redis samples pre-configured K keys, adds them to the eviction pool, and then chooses the LRU key from the eviction pool for eviction. We name this policy approx-K-LRU. We find that approx-K-LRU behaves close to LRU when K is large. However, different Ks can yield different miss ratios. On the other hand, the sampling and replacement decision itself results in an overhead that is related to K. This paper proposes DLRU (Dynamic LRU), which explores this configurable parameter and dynamically sets K. DLRU utilizes a low-overhead miniature cache simulator to predict miss ratios of different Ks and adopts a cost model to estimate the performance trade-offs. Our experimental results show that DLRU is able to improve Redis throughput over the recommended, default approx-5-LRU by up to 32.5% for a set of storage traces.
JCT	Huge Page Friendly Virtualized Memory Management Journal of Computer Science and Technology Sai Sha, Jingyuan Hu, Yingwei Luo, Xiaolin Wang, Zhenlin Wang PDF With the rapid increase of memory consumption by applications running on cloud data centers, we need more efficient memory management in a virtualized environment. Exploiting huge pages becomes more critical for a virtual machine’s performance when it runs large working set size programs. Programs with large working set sizes are more sensitive to memory allocation, which requires us to quickly adjust the virtual machine’s memory to accommodate memory phase changes. It would be much more efficient if we could adjust virtual machines’ memory at the granularity of huge pages. However, existing virtual machine memory reallocation techniques, such as ballooning, do not support huge pages. In addition, in order to drive effective memory reallocation, we need to predict the actual memory demand of a virtual machine. We find that traditional memory demand estimation methods designed for regular pages cannot be simply ported to a system adopting huge pages. How to adjust the memory of virtual machines timely and effectively according to the periodic change of memory demand is another challenge we face. This paper proposes a dynamic huge page based memory balancing system (HPMBS) for efficient memory management in a virtualized environment. We first rebuild the ballooning mechanism in order to dispatch memory in the granularity of huge pages. We then design and implement a huge page working set size estimation mechanism which can accurately estimate a virtual machine’s memory demand in huge pages environments. Combining these two mechanisms, we finally use an algorithm based on dynamic programming to achieve dynamic memory balancing. Experiments show that our system saves memory and improves overall system performance with low overhead.
CoRR	GRAPHSPY: Fused Program Semantic-Level Embedding via Graph Neural Networks for Dead Store Detection Computing Research Repository Yixin Guo, Pengcheng Li, Yingwei Luo, Xiaolin Wang, Zhenlin Wang PDF Production software oftentimes suffers from the issue of performance inefficiencies caused by inappropriate use of data structures, programming abstractions, and conservative compiler optimizations. It is desirable to avoid unnecessary memory operations. However, existing works often use a whole-program fine-grained monitoring method with incredibly high overhead. To this end, we propose a learning-aided approach to identify unnecessary memory operations intelligently with low overhead. By applying several prevalent graph neural network models to extract program semantics with respect to program structure, execution order and dynamic states, we present a novel, hybrid program embedding approach so that to derive unnecessary memory operations through the embedding. We train our model with tens of thousands of samples acquired from a set of real-world benchmarks. Results show that our model achieves 90% of accuracy and incurs only around a half of time overhead of the state-of-art tool.