Xiaoyou Wu

Xiaoyou Wu

Computer Engineering · Georgia Institute of Technology

I'm a Computer Engineering student at Georgia Tech working on efficient inference for large generative models and video generation for world models. My research spans diffusion language model acceleration, GPU–PIM hardware co-design, and autoregressive world modeling for closed-loop robot control.

Interests
LLM Inference Diffusion LMs Quantization GPU–PIM Systems World Modeling

Research Papers

  1. Under SubmissionMay 2026

    BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference

    Xiaoyou Wu, Cheng-Jhih Shih, Binfei Ji, Yong Liu, Yingyan (Celine) Lin · Georgia Institute of Technology

    Abstract

    Diffusion language models (dLLMs) generate text by iteratively denoising multiple token positions in parallel, offering an attractive alternative to strictly autoregressive decoding. In practice, however, block-wise dLLM inference exposes a difficult granularity trade-off: small blocks preserve local conditioning but require many denoising steps, whereas large blocks expose more parallelism but can make premature commitments and accumulate cache error. Existing acceleration methods typically choose a single block size per request, leaving the complementarity among block sizes unused. We show that block size itself is a useful branching dimension. Different block sizes induce related but non-identical KV-cache trajectories: branches often share an initial prefix, bifurcate at semantically decisive positions, and later agree on syntactically lightweight tokens. Motivated by this structure, we propose BlockBatch, a training-free online inference framework that executes multiple block-size branches for the same request inside a batched forward pass. BlockBatch coordinates these branches through confidence-gated token merging, leader-based synchronization, and periodic full-sequence refreshes that re-anchor local block updates to a globally consistent KV state. Across 3 representative dLLMs and 4 datasets, BlockBatch reduces denoising NFEs by 26.6% on average and achieves a 1.33× average end-to-end speedup over Fast-dLLM while preserving accuracy.

  2. Under SubmissionApril 2026

    Hy²: Accelerating Hybrid Mamba-Transformer Serving Through Adaptive GPU–PIM Co-Design

    Cheng-Jhih Shih, Jishnu Das, Sung Bin Kim, Chung-En Ho, Wei-Hsing Huang, Xiaoyou Wu, Binfei Ji, How-Shing Wang, Riley Corzine, Shimeng Yu, Yingyan (Celine) Lin · Georgia Institute of Technology

    Abstract

    Hybrid Mamba-Transformer models have emerged as a promising alternative to pure Transformers because they combine the constant-time decoding of state space models (SSMs) with the in-context recall of attention, promising higher decode throughput, smaller KV-cache footprint, and lower serving cost. However, they create a new serving challenge: different operators within the same model have different hardware affinities, the model maintains two structurally different state types (growing KV caches and fixed-size SSM states), and those affinities shift at runtime under continuous batching. GPU-only execution underutilizes GPU compute on bandwidth-bound SSM scans and attention general matrix-vector multiplications (GEMVs), while compute-bound operators in the same model still demand GPU-class throughput—making GPU-PIM architectures a natural fit. However, existing GPU serving engines assume single-device execution, and prior GPU-PIM systems largely target more uniform Transformer workloads with fixed or limited operator partitioning. We present Hy² (Hybrid model on Hybrid hardware), a co-design framework for serving hybrid models on GPU-PIM hardware. Hy² jointly addresses three coupled challenges: tailoring PIM microarchitecture to the hybrid operator mix, managing heterogeneous KV-cache and SSM states across GPU and PIM memory, and adapting operator placement per layer and per iteration as continuous batching shifts batch composition. Hy² integrates workload-guided PIM design-space exploration, a unified paged-slab memory manager, and an adaptive scheduler with per-request SLO enforcement. Across five hybrid models, Hy² achieves 1.51× average throughput over prior GPU-PIM baselines and 2.47× over GPU-only baselines, while reducing P99 tail latency to 0.61× that of prior GPU-PIM baselines.

Currently Thinking

  • Does block-size diversity reveal something fundamental about how diffusion models represent uncertainty?

    Different block sizes don't just trade speed for quality — they carve different trajectories through the same probability landscape. The bifurcation points where branches diverge are exactly where the model is most uncertain: semantically loaded tokens where the distribution is wide. The re-agreement at the end is the model's confidence collapsing back to a mode. BlockBatch is, in some sense, a structured way to sample that uncertainty and extract signal from it without the cost of full independent samples.

  • Why is memory bandwidth — not compute — the binding constraint in modern LLM serving?

    During decode, each forward pass moves the entire model weight matrix across HBM for a single token — arithmetic intensity collapses to near zero. The roofline isn't a theoretical toy; it's an indictment. GPUs that achieve 312 TFLOP/s sit idle at 3% utilization because the HBM bus (2 TB/s) can't feed them fast enough. PIM's value isn't raw bandwidth — it's that it puts compute where the data lives, turning the memory wall into a foundation rather than a ceiling.

  • Why does negative feedback deserve more credit than any algorithm in modern engineering?

    Every elegant system I've studied traces back to the same mechanism: measure the error, oppose it. The Laplace transform turns this loop into algebra, but the intuition is older — it's how thermostats work, how pupils dilate, how markets (imperfectly) clear. What makes it profound is that a system doesn't need to understand its goal; it only needs to know how far it is from it. Intelligence, reduced to a signed difference.

  • Is there a deeper analogy between the SSM state and working memory in biological cognition?

    SSMs maintain a fixed-size recurrent state that's updated in-place at every token — it has to compress the entire causal history into a bounded representation. Biological working memory does the same: selective, lossy, capacity-constrained. The difference is that attention can selectively retrieve from an unbounded KV cache, while SSM must decide what to keep as it goes. Hybrid models aren't just an engineering compromise — they might be probing something real about the architecture of memory itself.