inclusionAI/Ling-2.6-flash
Ling-2.6-flash (BailingMoeV2_5) instruct model with 104B total / 7.4B active params, hybrid linear + MLA attention, 128K context, optimized for agent workloads
View on HuggingFaceGuide
Overview
Ling-2.6-flash is a BailingMoeV2_5 MoE instruct model with 104B total / 7.4B active parameters, hybrid linear + MLA attention, and a 131K context window.
Deployment Configurations
Docker (AMD MI300X / MI325X / MI355X, TP=2)
MI300X / MI325X / MI355X GPUs have larger per-GPU HBM, so TP=2 fits the full 131K context.
docker run --rm -it \
--cap-add=SYS_PTRACE \
--ipc=host \
--privileged=true \
--shm-size=128GB \
--network=host \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
-e VLLM_ROCM_USE_AITER=1 \
vllm/vllm-openai-rocm:v0.20.2 \
inclusionAI/Ling-2.6-flash \
--tensor-parallel-size 2 \
--trust-remote-code
Client Usage
Text Generation
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="inclusionAI/Ling-2.6-flash",
messages=[{"role": "user", "content": "Write a poem about the ocean."}],
max_tokens=512, temperature=0.7,
)
print(response.choices[0].message.content)