Xiaoyou Wu
Home Blog Projects

Research Projects

BlockBatch: Multi-Scale Consensus Decoding for Efficient dLLM Inference

BlockBatch: Multi-Scale Consensus Decoding for Efficient dLLM Inference

A training-free inference framework for diffusion language models that executes multiple block-size branches in a single batched forward pass, reducing denoising NFEs by 26.6% and achieving 1.33× speedup over Fast-dLLM.

May 2026
Diffusion LLMNLPInference OptimizationMachine LearningResearchPython
View Project
Hy²: Accelerating Hybrid Mamba-Transformer Serving Through Adaptive GPU–PIM Co-Design

Hy²: Accelerating Hybrid Mamba-Transformer Serving Through Adaptive GPU–PIM Co-Design

A co-design framework for serving hybrid Mamba-Transformer models on GPU–PIM heterogeneous hardware, achieving 1.51× throughput over prior art and 2.47× over GPU-only baselines while reducing tail latency to 0.61×.

May 2026
GPU-PIMLLM ServingMambaHardware ArchitectureSystemsResearch
View Project
HiDeS: Hierarchical Delta Sparsity for Efficient Diffusion LLM Inference

HiDeS: Hierarchical Delta Sparsity for Efficient Diffusion LLM Inference

An algorithm–system co-design that exploits cross-step temporal redundancy at three independent granularities — layers, tokens, and columns — achieving 2.66× and 2.30× throughput over dense baselines on A100 and H100 with negligible accuracy loss.

May 2026
Diffusion LLMSparse ExecutionGPU KernelsInference OptimizationSystemsResearch
View Project

Legacy Projects

0 projects

Robotics, embedded systems, computer vision, AI, and more

View all →