vLLM
18.9k 2.5kWhat is vLLM ?
vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM Features
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: GPTQ, AWQ, SqueezeLLM
- Optimized CUDA kernels
vLLM is flexible and easy to use with:
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs and AMD GPUs
vLLM seamlessly supports many Hugging Face models, including the following architectures:
- Aquila & Aquila2 (
BAAI/AquilaChat2-7B,BAAI/AquilaChat2-34B,BAAI/Aquila-7B,BAAI/AquilaChat-7B, etc.) - Baichuan & Baichuan2 (
baichuan-inc/Baichuan2-13B-Chat,baichuan-inc/Baichuan-7B, etc.) - BLOOM (
bigscience/bloom,bigscience/bloomz, etc.) - ChatGLM (
THUDM/chatglm2-6b,THUDM/chatglm3-6b, etc.) - DeciLM (
Deci/DeciLM-7B,Deci/DeciLM-7B-instruct, etc.) - Falcon (
tiiuae/falcon-7b,tiiuae/falcon-40b,tiiuae/falcon-rw-7b, etc.) - GPT-2 (
gpt2,gpt2-xl, etc.) - GPT BigCode (
bigcode/starcoder,bigcode/gpt_bigcode-santacoder, etc.) - GPT-J (
EleutherAI/gpt-j-6b,nomic-ai/gpt4all-j, etc.) - GPT-NeoX (
EleutherAI/gpt-neox-20b,databricks/dolly-v2-12b,stabilityai/stablelm-tuned-alpha-7b, etc.) - InternLM (
internlm/internlm-7b,internlm/internlm-chat-7b, etc.) - LLaMA & LLaMA-2 (
meta-llama/Llama-2-70b-hf,lmsys/vicuna-13b-v1.3,young-geng/koala,openlm-research/open_llama_13b, etc.) - Mistral (
mistralai/Mistral-7B-v0.1,mistralai/Mistral-7B-Instruct-v0.1, etc.) - Mixtral (
mistralai/Mixtral-8x7B-v0.1,mistralai/Mixtral-8x7B-Instruct-v0.1, etc.) - MPT (
mosaicml/mpt-7b,mosaicml/mpt-30b, etc.) - OPT (
facebook/opt-66b,facebook/opt-iml-max-30b, etc.) - Phi (
microsoft/phi-1_5,microsoft/phi-2, etc.) - Qwen (
Qwen/Qwen-7B,Qwen/Qwen-7B-Chat, etc.) - Qwen2 (
Qwen/Qwen2-7B-beta,Qwen/Qwen-7B-Chat-beta, etc.) - StableLM(
stabilityai/stablelm-3b-4e1t,stabilityai/stablelm-base-alpha-7b-v2, etc.) - Yi (
01-ai/Yi-6B,01-ai/Yi-34B, etc.)
Install vLLM
Install vLLM with pip or from source:
pip install vllmGetting Started
Visit the documentation to get started.