Research

Research at The Semantic AI spans semantic routing, multimodal inference, evaluation, and agent systems.

This page highlights publications from the community and its incubation ecosystem.

Publications

Selected papers, vision papers, and technical reports across the community and incubation ecosystem.

2026

Research Publications

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

Authors: vLLM Semantic Router Team

Venue: arXiv Technical Report

Signal-driven decision routing for Mixture-of-Modality deployments across cost, privacy, latency, and safety constraints.

Paper

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

Authors: Huamin Chen, Xunzhuo Liu, Bowei He, Fuyuan Lyu, Yankai Chen, Xue Liu, Yuhan Liu, Junchen Jiang

Venue: arXiv Technical Report

A synthesis of routing, fleet, multimodal, and governance results into the Workload-Router-Pool architecture.

Paper

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

Venue: arXiv Technical Report

A security treatment of perception failures in computer-using agents with a dual-channel guardrail for click targets and action reasoning.

Paper

Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference

Authors: Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu

Venue: arXiv Technical Report

OATS improves semantic-router tool ranking under single-digit millisecond CPU budgets without serving-time model inference.

Paper

Adaptive Vision-Language Model Routing for Computer Use Agents

Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

Venue: arXiv Technical Report

Adaptive VLM Routing estimates action difficulty and routes each computer-use step to the cheapest model that meets a reliability target.

Paper

98× Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

Venue: arXiv Technical Report

Flash Attention, prompt compression, and near-streaming body processing reduce routing latency from seconds to tens of milliseconds.

Paper

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

Authors: Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

Venue: arXiv Technical Report

A queueing-theory-grounded fleet planner and simulator for sizing multi-pool GPU fleets against P99 TTFT targets.

Paper

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

Authors: Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

Venue: arXiv Technical Report

An analytical method for deriving minimum-cost two-pool fleets directly from workload CDFs and P99 TTFT targets.

Paper

The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

Authors: Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

Venue: arXiv Technical Report

An analytical result showing context-length routing topology can matter more than pure GPU generation upgrades for tokens-per-watt.

Paper

Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL

Authors: Xunzhuo Liu, Hao Wu, Huamin Chen, Bowei He, Xue Liu

Venue: arXiv Technical Report

A framework for conflict detection and prevention when probabilistic ML predicates can silently co-fire in routing policy languages.

Paper

From Inference Routing to Agent Orchestration: Declarative Policy Compilation with Cross-Layer Verification

Authors: Huamin Chen, Xunzhuo Liu, Bowei He, Xue Liu

Venue: arXiv Technical Report

A cross-layer extension of the Semantic Router DSL from stateless request routing into multi-step agent workflows.

Paper

Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents

Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

Venue: arXiv Technical Report

Conversational memory and retrieval-grounded routing recover most of a 235B model’s performance while cutting effective inference cost by 96%.

Paper

Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems

Authors: Xunzhuo Liu, Bowei He, Xue Liu, Haichen Zhang, Huamin Chen

Venue: SIGIR 2026 Industry Track

A real-time verification component for long-document RAG that preserves grounding checks without falling back to truncated validation.

Paper

2025

Research Publications

When to Reason: Semantic Router for vLLM

Authors: Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen

Venue: NeurIPS - MLForSys

A semantic router that classifies queries by reasoning need and selectively applies reasoning only when beneficial.

Paper

Category-Aware Semantic Caching for Heterogeneous LLM Workloads

Authors: Chen Wang, Xunzhuo Liu, Yue Zhu, Alaa Youssef, Priya Nagpurkar, Huamin Chen

A category-aware semantic caching architecture where similarity thresholds, TTLs, and quotas vary by workload class.

Paper