vllm

[Bug]: MiniMax-M2.5-NVFP4: KeyError 'layers.0.self_attn.qkv_proj.k_scale' during load_weights — checkpoint uses split q/k/v, loader expects fused qkv_proj

#39314 · Apr 8, 2026

Easy

bug

[Performance] DSV3.2 Indexer: Overlap indexer k+w path || q path on separate CUDA streams

#39309 · Apr 8, 2026

Easy

performance

[Performance] DSV3.2 Indexer: Overlap indexer op || q_b_proj + MLA RoPE on separate CUDA streams

#39308 · Apr 8, 2026

Easy

performance

[RFC]: Selective KV Cache offload

#39305 · Apr 8, 2026

Easy

RFC

[Feature]: Can you achieve this

#39304 · Apr 8, 2026

Easy

feature request

[Bug]: aiter.ops.triton.attention.pa_mqa_logits.deepgemm_fp8_paged_mqa_logits_stage1` returns random topk for `context_len > 2048` on ROCm (gfx950), breaks GLM-5.1-FP8 decode

#39303 · Apr 8, 2026

Easy

bug rocm

[Performance] DSV3.2 Indexer: Overlap indexer k+w path || q path on separate CUDA streams

#39299 · Apr 8, 2026

Easy

feature request

[Bug]: Eagle3 speculative decoding CUDA device-side assert crash with gpt-oss-120b under concurrent requests (TP=8, H20)

#39295 · Apr 8, 2026

Easy

[Bug]: FlashInfer CUTLASS MoE backend causes CUDA illegal memory access on H100 during CUDA graph capture (Qwen3-Next-80B BF16)

#39288 · Apr 8, 2026

Easy

[RFC]: Handle GDN prefill kernel JIT compilation failures - seeking community input

#39287 · Apr 8, 2026

Easy

RFC

[Feature]: Qwen3.5 训练是否计划支持 Multi-Token Prediction (MTP)

#39282 · Apr 8, 2026

Easy

feature request

[Bug]: Ngram speculative decoding produces corrupted output on hybrid GDN (Qwen3.5) models

#39273 · Apr 8, 2026

Easy

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

#39272 · Apr 8, 2026

Easy

[Bug]: Qwen3.5 crashes when using suffix-decoding

#39271 · Apr 8, 2026

Easy

bug

[Bug]: Gemma 4 offline inference outputs gibberish

#39265 · Apr 8, 2026

Easy

bug

[Bug]: Kimi K2.5 multimodal inference broken — media_placeholder_token_id mismatch with runtime tokenizer

#39261 · Apr 8, 2026

Easy

bug

[Bug]: vLLM 0.19.0 fails to load Cohere ASR mode

#39252 · Apr 8, 2026

Easy

bug

[Bug]: CUDA illegal memory access when using extract_hidden_states with multiple generate() calls

#39247 · Apr 8, 2026

Easy

bug

[Feature]: Add LoRA support for Gemma4ForConditionalGeneration / Gemma 4 models

#39246 · Apr 8, 2026

Easy

good first issue feature request

[Bug]: VLLM_USE_FLASHINFER_MOE_INT4 broken

#39245 · Apr 7, 2026

Easy

bug

[Bug]: CUDA illegal memory access with FlashInfer MoE FP8 on Qwen3.5-397B (num_tokens > 256)

#39244 · Apr 7, 2026

Easy

bug

KV cache compression via E8 lattice VQ — 10-33x with PagedAttention integration

#39241 · Apr 7, 2026

Easy

[Bug]: Qwen3.5 Text Only Model (Qwen3_5ForCausalLM)

#39231 · Apr 7, 2026

Easy

bug

[Bug]: Nemotron 3 super has corrupted output on 0.19.0, no issues on 0.18.1

#39223 · Apr 7, 2026

Easy

bug

[Bug]: Inconsistent tool-calling behavior between Chat Completions and Responses API when tool parsing params is not set

#39221 · Apr 7, 2026

Easy

bug

[Bug]: vLLM 0.19.0 on PyPI pins transformers<5, but Gemma 4 support requires transformers>=5.5.0

#39216 · Apr 7, 2026

Easy

bug

[Feature]: Use torch.compile Dynamo to see full trace for model forward pass

#39215 · Apr 7, 2026

Easy

performance feature request torch.compile

FP8 MoE ep_scatter Triton illegal-address on H200 in GLM-5-FP8 prefill path

#39211 · Apr 7, 2026

Easy

[Bug] Embedding/pooling models crash on B200 (SM 10.0) — encoder attention hardcodes FA2 which lacks SM100 support

#39210 · Apr 7, 2026

Easy

bug

[Bug]: Tokenization/Generation fails for Ministral3 with documented configuration

#39207 · Apr 7, 2026

Easy

bug

[Installation]: New 0.19.0 docker build to run gemma4: transformers outdated.

#39204 · Apr 7, 2026

Easy

installation

[Bug]: Crash on Transcription (size for tensor a must match the size of tensor b) with reproduce

#39202 · Apr 7, 2026

Easy

bug

[Security] Unpinned Third-Party GitHub Action in macOS Workflow

#39199 · Apr 7, 2026

Easy

bug

[Bug]: HFValidationError when trying to run a GGUF model with quants

#39198 · Apr 7, 2026

Easy

bug

[Bug]: NCCL Error: unhandled cuda error

#39196 · Apr 7, 2026

Easy

bug

[Feature]: Tri-attention merge?

#39193 · Apr 7, 2026

Easy

feature request

[Bug]: GLM5 on B300 generates garbage output

#39179 · Apr 7, 2026

Easy

bug

[Intel-GPU]: Using docker image at intel/vllm:0.17.0-xpu -> RuntimeError: PyTorch was compiled without CUDA support

#39170 · Apr 7, 2026

Easy

installation intel-gpu

[Bug]: First request after startup is unexpectedly slow with Qwen3.5-27B-FP8

#39163 · Apr 7, 2026

Easy

bug

[Bug]: There is "rocprofiler_configure" in libtorch_cpu.so

#39162 · Apr 7, 2026

Easy

bug

[RFC][Test]: Unified Platform-Aware Test Skip Mechanism

#39158 · Apr 7, 2026

Easy

RFC

[Bug]: Segfault in Triton LLVM (MachineCSE / translateLLVMIRToASM) when serving Qwen3.5-4B on RTX 4090 (WSL2) with vLLM 0.19.0

#39149 · Apr 7, 2026

Easy

bug

[Bug]: KV block corruption in base scheduler, Non-deterministic output at temperature=0 without prefix caching

#39146 · Apr 7, 2026

Easy

bug

[Bug]: fp8_e5m2 kv-cache gate in _init_kv_cache_quant fires on any quantized checkpoint, not only fp8 checkpoints

#39137 · Apr 7, 2026

Easy

[Bug]: Gemma 4 31B INT4 on 2×24GB GPUs (TP=2): GPU KV cache size is 25,200 tokens at max_model_len=131072, gpu_memory_utilization=0.96, BF16 KV

#39133 · Apr 7, 2026

Easy

[Usage]: Qwen3-VL inference on video complains of lack of metadata

#38811 · Apr 2, 2026

Medium

usage

[Feature]: How to disable chat template when using vllm serve

#38809 · Apr 2, 2026

Medium

feature request

[Bug]: Disaggregate prefill script cannot work due to inconsistent request id between P node and D node.

#38808 · Apr 2, 2026

Medium

bug

[Usage]: Does vllm support online infer for qwen3_asr_forced_aligner now? I only found offline example

#38805 · Apr 2, 2026

Medium

usage

[Usage]: I encountered an error while deploying deepseek-ai/DeepSeek-OCR-2using vLLM. The logs show:

#38797 · Apr 2, 2026

Medium

usage

[Feature]: Does P2pNcclConnector support PD separation for the GLM5 model dsa? Testing on the 0.15.1 branch has failed.

#38793 · Apr 2, 2026

Medium

feature request

[Bug]: Qwen3ReasoningParser leaks </think> into content when streaming with `stop` sequences (Related to #17468)

#38789 · Apr 2, 2026

Medium

bug

[vLLM IR] Op test & benchmark infra

#38782 · Apr 2, 2026

Medium

vllm-ir

[Feature]: support nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 for turing and ampere

#38776 · Apr 2, 2026

Medium

feature request

[Feature]: General LL GEMMs with PDL Support

#38772 · Apr 2, 2026

Medium

feature request

[vLLM IR] OOT migration guide

#38765 · Apr 2, 2026

Medium

[RFC]: Per-iteration forward pass metrics with accurate engine-level timing

#38760 · Apr 1, 2026

Medium

RFC

[vLLM IR] Port RoPE ops to IR

#38756 · Apr 1, 2026

Medium

vllm-ir

[Bug]: GPT OSS Router GEMM Causing NaNs

#38754 · Apr 1, 2026

Medium

bug

[vLLM IR] Port QuantFP8 to IR op

#38745 · Apr 1, 2026

Easy

vllm-ir

[RFC][vLLM IR]: Automatically compile native impl for IR ops

#38744 · Apr 1, 2026

Easy

RFC vllm-ir

[Transformers v5] NemotronParseForConditionalGeneration

#38740 · Apr 1, 2026

Medium

help wanted good first issue

[Bug]: Anthropic Messages API + Mistral model: "Invalid assistant message" on multi-turn tool calling

#38738 · Apr 1, 2026

Medium

bug

[Transformers v5] ColBERTJinaRobertaModel

#38737 · Apr 1, 2026

Easy

help wanted good first issue

[Transformers v5] Tarsier2ForConditionalGeneration

#38736 · Apr 1, 2026

Medium

help wanted good first issue

[Transformers v5] Ernie4_5_VLMoeForConditionalGenerati

#38735 · Apr 1, 2026

Medium

help wanted good first issue

[Transformers v5] SarvamMLAForCausalLM

#38734 · Apr 1, 2026

Easy

help wanted good first issue

[Bug] All models hang on GB300 (SM103) with FlashInfer 0.6.7

#38729 · Apr 1, 2026

Medium

[Proposal] Topology-Aware KV Cache Compression for Memory-Efficient Inference

#38725 · Apr 1, 2026

Medium

performance

[Bug]: NVFP4 MoE produces garbage output on SM120 (RTX 5080) with CPU Weight Offloading — Nemotron-Cascade-2-30B-A3B

#38718 · Apr 1, 2026

Medium

[Bug]: Bench Serve encounter utf-8 UnicodeDecodeError

#38717 · Apr 1, 2026

Medium

bug

[Bug]: RuntimeError: failed to map GGUF parameters (18288)

#38716 · Apr 1, 2026

Medium

bug

[Bug]: Error when trying to serve MiniMax 2.5 on 4 H100 nodes with 4 GPUS

#38713 · Apr 1, 2026

Medium

bug

[Bug]: heterogeneous disaggregated serving XPU (Prefill) + CPU (Decode) accuracy issue

#38710 · Apr 1, 2026

Medium

bug

[Usage]: How to launch the Qwen3.5 service using vLLM on a V100 GPU

#38706 · Apr 1, 2026

Medium

usage

[Bug]: vLLM fails to start with LMCache + Qwen3-Coder-Next-FP8 (nightly image)

#38700 · Apr 1, 2026

Medium

bug

[Performance]: llmcompressor W8A8 Inference: decoding stage speed is lower than FP16

#38697 · Apr 1, 2026

Medium

performance

[Bug]: qwen3.5 when enable response_format json_schema outputs garbled spaces

#38696 · Apr 1, 2026

Medium

bug

[RFC]: O(1) KV Cache for vLLM: 4.8x Speedup & 22x More Accurate than TurboQuant on Qwen2.5-7B

#38694 · Apr 1, 2026

Medium

feature request

[Feature]: Parity with CUDA: vLLM router should have ROCm CI

#38693 · Apr 1, 2026

Medium

feature request rocm

[Bug]: parity with CUDA & parity with rocm sglang: vLLM router doesn't current support MoRI kvcache connector

#38692 · Apr 1, 2026

Medium

bug rocm

[Bug]: parity with CUDA: ROCm nightly & release docker images aren't built with Pollara AINIC or Broadcom Thor-2 NICs

#38687 · Apr 1, 2026

Medium

bug rocm

[Bug]: data_parallel_rpc_port is not robust to invalid traffic and can crash multi-node startup

#38677 · Apr 1, 2026

Easy

bug

[Bug]: Jamba tool parser crashes on Mistral-style [TOOL_CALLS] models with standard HF tokenizer (e.g., Apriel-Nemotron-15b)

#38674 · Apr 1, 2026

Easy

bug

[Bug]: Regression can no longer load Qwen 3.5 397B nvfp4 model - CUBLAS_STATUS_NOT_INITIALIZED

#38666 · Mar 31, 2026

Easy

bug

[Bug]: CUDA assert in triton attention for MolmoWeb models (Molmo2 architecture with different max_position_embeddings)

#38660 · Mar 31, 2026

Easy

[Bug]: MLA attention casts activations to int32 when using Marlin FP8 on GPUs without native FP8 support (sm < 89)

#38658 · Mar 31, 2026

Easy

bug

[Bug]: qwen 3.5 model launch get stuck for quite a long time

#38656 · Mar 31, 2026

Easy

bug

[Bug]: kimi-k2 tool parser regex is off a tiny bit

#38441 · Mar 28, 2026

Easy

bug

[Bug]: NVFP4 + MLA error during processing

#38439 · Mar 28, 2026

Easy

bug

[Installation]: torch 2.11 is not supported

#38431 · Mar 28, 2026

Easy

installation

[Bug]: V1 Engine: EngineDeadError (AssertionError) on max_model_len overflow during realtime audio streaming

#38428 · Mar 28, 2026

Easy

bug

[Transformers v5] InternVL2

#38425 · Mar 28, 2026

Easy

help wanted good first issue

[Bug]: _C_stable_libtorch fails to build: const& references violate stable ABI trivially_copyable requirement

#38420 · Mar 28, 2026

Easy

[Bug]: Vision encoder crashes on SM100 (Jetson Thor) — `_vllm_fa2_C` compiled for SM80-only, no override available for vision encoder

#38411 · Mar 28, 2026

Easy

bug

[Bug]: vllm 0.18 kimi k2.5 way worse than h200 single node

#38406 · Mar 27, 2026

Easy

bug rocm

[Transformers v5] IsaacForConditionalGeneration

#38389 · Mar 27, 2026

Easy

help wanted good first issue

[Transformers v5] HCXVisionForCausalLM

#38387 · Mar 27, 2026

Medium

help wanted good first issue

[Transformers v5] Base model and LoRA used in test has incorrect `tokenizer_config.json`

#38386 · Mar 27, 2026

Medium

help wanted good first issue

[Transformers v5] MiniCPMV cannot apply processor

#38385 · Mar 27, 2026

Easy

help wanted good first issue

[Transformers v5] Distributed shutdown test timtout

#38384 · Mar 27, 2026

Easy

help wanted good first issue

[Transformers v5] Mistral multimodal models

#38382 · Mar 27, 2026

Medium

help wanted good first issue

[Bug]: glm 4.7 fp8 crashes (Worker_TP3 pid=457501) ERROR 03-27 17:11:15 [multiproc_executor.py:852] AttributeError: '_OpNamespace' '_C' object has no attribute 'per_token_group_fp8_quant'

#38376 · Mar 27, 2026

Easy

bug

[Bug]: IndexError when `--renderer-num-workers` + `--mm-processor-cache-type shm`

#38375 · Mar 27, 2026

Medium

bug

[Bug]: When use_audio_in_video is enabled in qwen3-omni, the output may exhibit issues such as empty or repetitive output.

#38351 · Mar 27, 2026

Easy

bug

[Bug]: 使用swift rollout启动vllm，推理结果乱码

#38349 · Mar 27, 2026

Easy

bug

[Bug] Step-3.5-Flash MTP Speculative Decoding Has Extremely Low Acceptance Rate (2.4%-4.6%)

#38339 · Mar 27, 2026

Easy

bug speculative-decoding

[Usage]: for the tpye: UniformTypeKVCacheSpecs , the calc of num_blocks = available_memory // kv_cache_groups[0].kv_cache_spec.page_size_bytes, no need to divide num_layers ?

#38331 · Mar 27, 2026

Easy

usage

[Bug]: microsoft/Phi-4-reasoning-vision-15B Fails to startup

#38309 · Mar 27, 2026

Easy

bug

[Feature]: support affinity settings in helm chart

#38308 · Mar 27, 2026

Easy

feature request

[Feature]: vllm.ai docs should should instructions for rocm nightly docker & rocm nightly wheel

#38304 · Mar 27, 2026

Medium

feature request rocm

[Bug]: minimax nvfp4 model crash

#38303 · Mar 27, 2026

Medium

bug

Energy Efficiency: 10 Mathematical Techniques for 60-70% AI Energy Reduction (Phi6Simple, FFT-Mix, Phi MoE)

#38298 · Mar 27, 2026

Easy

[Bug]: Gemma3n concurrent audio requests crash EngineCore — missing dynamic_dims on audio sequence dimension

#38297 · Mar 27, 2026

Easy

[Feature]: Add Rotorquant support

#38291 · Mar 26, 2026

Easy

feature request

[Feature]: PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

#38279 · Mar 26, 2026

Easy

feature request

[Bug]: tokenizing long redundant sequences causes API server deadlock (harmony and others)

#38266 · Mar 26, 2026

Easy

bug

[RFC]: Multi-tier KV offloading via the vLLM offloading connector

#38260 · Mar 26, 2026

Easy

RFC

[Usage]: How to do offline inference on one rank in a distributed environment?

#38258 · Mar 26, 2026

Easy

usage

[Bug]: Qwen3-VL-235B OOM with multi-image long multiturn inputs

#38257 · Mar 26, 2026

Easy

bug

[RFC]: Incremental MoE Expert Offloading — GPU Cache + Async Pipeline

#38256 · Mar 26, 2026

Medium

[Bug]: VLLM_CPU_OMP_THREADS_BIND=nobind cannot be used with tp>1 on CPU backends

#38250 · Mar 26, 2026

Medium

bug

[Bug] UVA CPU offload completely broken on WSL with NVFP4 MoE (Qwen3.5-35B-A3B): three distinct crashes across all parameter combinations

#37883 · Mar 23, 2026

Easy

bug

[Bug]: After upgrading to v0.18.0, the logs no longer display token output speed

#37876 · Mar 23, 2026

Easy

bug

[Bug]: bge-m3 /pooling endpoint breaks in the latest main branch

#37868 · Mar 23, 2026

Easy

bug

[Bug]: sleep mode not releasing GPU memory

#37860 · Mar 23, 2026

Easy

bug

[Bug]: does not have the attribute 'FakeTensorMode'

#37858 · Mar 23, 2026

Easy

bug

[Bug]: RoutedExpertsCapturer.capture() assertion failure with DP>1 when supports_internal_mk=True

#37857 · Mar 23, 2026

Easy

[Bug]: Shared Expert output is incorrect under Sequence Parallel MoE (EP + TP > 1 + DP > 1) for Qwen3.5 MoE models

#37856 · Mar 23, 2026

Easy

bug

[Bug]: Qwen3-VL-Embedding-8B Image embedding failed

#37855 · Mar 23, 2026

Easy

bug

[Bug]: NGC vLLM 26.02 rejects Nemotron-3-Super-120B-A12B-NVFP4 — quant_algo MIXED_PRECISION not in whitelist

#37854 · Mar 23, 2026

Easy

[Bug]: Phi qk_layernorm appears to be unsupported in vLLM

#37852 · Mar 23, 2026

Easy

bug

[RFC]: Unify the function of getting device count

#37849 · Mar 23, 2026

Easy

RFC

[Installation]: Documented v0.18.0 cu128 release wheel URL returns 404

#37847 · Mar 23, 2026

Easy

installation

[Bug]: 0.17.0rc1在A2部署GLM-4.7，开启MTP后工具调用异常

#37846 · Mar 23, 2026

Easy

bug

_update_request_as_session does not update max_tokens from StreamingUpdate

#37842 · Mar 23, 2026

Easy

[Feature Request] Support chat_template in tokenizer_config.json for DeepSeekV32

#37839 · Mar 23, 2026

Easy

Why is an assertion used here?

#37837 · Mar 23, 2026

Easy

[Bug] Potential incorrect tokenizer source path in RunAI object storage pull

#37836 · Mar 23, 2026

Easy

bug

[Performance]: Deepseek performance regressing with norm fusion enabled

#37832 · Mar 23, 2026

Easy

performance

[Bug]: Intel ARC 140v not supported as XE2 cutlass kernel

#37828 · Mar 22, 2026

Easy

bug intel-gpu

[RFC] Tail-Optimized LRU (T-LRU): Reducing Tail Latency via Conversation-Aware KV Cache Eviction

#37823 · Mar 22, 2026

Easy

RFC

[Bug]: CUDA 13 LMCache KV connector install path still resolves CUDA 12 artifacts

#37801 · Mar 22, 2026

Easy

[Bug]: [OOM] DeepSeek-R1 Out of Memory

#37777 · Mar 21, 2026

Easy

bug

[Bug]: FLASHINFER_CUTLASS and FLASHINFER_TRTLLM do not work for Qwen3.5 Bf16 DP/EP

#37758 · Mar 21, 2026

Easy

bug

[Bug] FlashInfer + MTP speculative decoding crashes on SM121 (DGX Spark) with GQA=16 model

#37754 · Mar 21, 2026

Easy

[Bug]: Qwen 3.5 stops working after upgrade to v0.18.0

#37749 · Mar 21, 2026

Easy

bug

[Feature]: Is there docker image support vllm and rocm 7.1+

#37748 · Mar 21, 2026

Easy

feature request rocm

[Bug] prompt_logprobs causes livelock with IsHybrid models (Qwen3.5) in DP mode

#37746 · Mar 21, 2026

Easy

[Bug]: Mooncake Connector: Decode nodes stuck in WAITING_FOR_REMOTE_KVS after Prefill node restart

#37745 · Mar 21, 2026

Easy

bug

[Bug]: Missing logprobs for `<tool_call>` in streaming chat completions

#37737 · Mar 21, 2026

Easy

bug

[CI Failure]: Gemma3 OOMs with transformers backend

#37736 · Mar 21, 2026

Easy

rocm ci-failure

[Bug] Garbage output for long prompts after #35216

#37732 · Mar 21, 2026

Easy

bug

[Bug]: V1 engine core deadlocks under concurrent load (fp8 + prefix caching + Qwen3.5)

#37729 · Mar 21, 2026

Easy

bug

[CI Failure]: mi355_1: Quantization

#37724 · Mar 20, 2026

Easy

rocm ci-failure

[Installation]: Blackwell SM120 + CUDA 13 pip install: 5 sequential failures before Qwen3.5 27B+ runs

#37714 · Mar 20, 2026

Easy

installation

[CI Failure]: mi355_1: Entrypoints Integration (API Server 1)

#37710 · Mar 20, 2026

Easy

rocm ci-failure

[CI Failure]: mi325_2: Distributed Tests (2 GPUs)(H100-MI325)

#37709 · Mar 20, 2026

Easy

rocm ci-failure

[CI Failure]: mi325_1: Quantized MoE Test (B200-MI325)

#37708 · Mar 20, 2026

Easy

rocm ci-failure

[CI Failure]: mi325_1: PyTorch Compilation Passes Unit Tests

#37707 · Mar 20, 2026

Easy

ci-failure

[Bug]: Structured output crashes on CPU with pin_memory=True in apply_grammar_bitmask()

#37705 · Mar 20, 2026

Easy

[CI Failure]: mi250_1: Kernels Core Operation Test

#37704 · Mar 20, 2026

Easy

rocm ci-failure

[Bug][NIXL]: TRITON_ATTN ignores `VLLM_KV_CACHE_LAYOUT=HND`, breaks heterogeneous TP with NIXL

#37703 · Mar 20, 2026

Easy

bug

[Bug]: openai v1/responses api instructions from prior response leak through previous_response_id

#37697 · Mar 20, 2026

Easy

bug

[Feature]: add ParoQuant quantization

#37687 · Mar 20, 2026

Easy

feature request

[Feature]: IndexCache support for DSA models

#37684 · Mar 20, 2026

Easy

feature request

[Bug]: 使用vllm+lmcache部署glm4.7，在多并发+长上下文导致服务挂掉

#37680 · Mar 20, 2026

Easy

bug

[Bug]: deepgemm compile error

#37675 · Mar 20, 2026

Easy

bug

[Usage]: serve部署模型后，调用chat.completions输入给模型的text中image_pad token被提到了prompt开头

#37674 · Mar 20, 2026

Easy

usage

[Usage]: How to use Env in yaml

#36465 · Mar 9, 2026

Medium

usage

[Bug]: Qwen3.5-27B fails to start with CPU KV cache offloading (`--kv_offloading_backend native`) while Qwen3-32B works fine

#36463 · Mar 9, 2026

Medium

bug

[RFC]: vLLM IR Out-of-Tree (OOT) Kernel Registration

#36459 · Mar 9, 2026

Medium

RFC

[Bug]: Local GGUF path fails with "architecture qwen35 is not supported yet" even when --hf-config-path is provided

#36456 · Mar 9, 2026

Medium

bug

[Bug]: Unable to run Qwen3.5 on RTX5090

#36455 · Mar 9, 2026

Medium

bug

[Performance]: vLLM v0.15.0 throughput regression compared to ROCm vLLM v0.14.0

#36454 · Mar 9, 2026

Medium

performance rocm

[Bug]: Qwen3.5-35B-A3B-FP8 AssertionError

#36452 · Mar 9, 2026

Medium

bug

[Bug]: Qwen3.5 AWQ models crash during inference on RTX 5090 (Blackwell) with Triton OOM in solve_tril despite successful model load

#36450 · Mar 9, 2026

Medium

bug

[Bug]: qwen3.5-27b ValueError: Tokenizer class TokenizersBackendFast does not exist or is not currently imported.

#36443 · Mar 9, 2026

Medium

bug

[Bug]: prefix cache bug happens when use w4a16 for GLM5

#36441 · Mar 9, 2026

Medium

bug

[Feature] Add energy consumption metrics to benchmark suite

#36440 · Mar 9, 2026

Medium

[Bug]: Responses API streaming emits tool call XML as `response.output_text.delta` instead of `response.function_call_arguments.delta` for non-harmony models

#36435 · Mar 9, 2026

Medium

bug

[Bug]: ValueError: No user query found in messages QWEN 3.5 27B VLLM 0.16.0 NIGHTLY

#36432 · Mar 9, 2026

Medium

bug

Beginner-Friendly Issues 182