You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have 3 x 3090 GPU with 72GB RAM running in Linux so should have enough GPU memory. Im getting this error after install.
python3 server_vllm.py --model "meetkai/functionary-small-v2.2" --host 0.0.0.0
/mnt/data/Applications/functionary/server_vllm.py:94: PydanticDeprecatedSince20: Pydantic V1 style @validator validators are deprecated. You should migrate to Pydantic V2 style @field_validator validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/ @validator("tool_choice", always=True)
INFO 03-06 20:54:30 server_vllm.py:542] args: Namespace(host='0.0.0.0', port=8000, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], served_model_name=None, grammar_sampling=True, model='meetkai/functionary-small-v2.2', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 03-06 20:54:33 llm_engine.py:72] Initializing an LLM engine with config: model='meetkai/functionary-small-v2.2', tokenizer='meetkai/functionary-small-v2.2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 03-06 20:54:37 weight_utils.py:164] Using model weights format ['.safetensors']
INFO 03-06 20:55:24 llm_engine.py:322] # GPU blocks: 1699, # CPU blocks: 2048
Traceback (most recent call last):
File "/mnt/data/Applications/functionary/server_vllm.py", line 550, in
engine = AsyncLLMEngine.from_engine_args(engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/Applications/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 633, in from_engine_args
engine = cls(
^^^^
File "/mnt/data/Applications/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 350, in init
self.engine = self._init_engine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/Applications/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 393, in _init_engine
return engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/graham/miniconda3/envs/Fnary/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 114, in init
self._init_cache()
File "/home/graham/miniconda3/envs/Fnary/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 331, in _init_cache
raise ValueError(
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (27184). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.
The text was updated successfully, but these errors were encountered:
Hi, the default context window of the base model is 32k so you can set the max_model_len to 8k if you are using a GPU with 24GB VRAM. Like what the error suggests, 24GB of VRAM is not enough to load the model with KV cache of size 32K in vLLM. By setting max_model_len to 8k, it should work.
I have 3 x 3090 GPU with 72GB RAM running in Linux so should have enough GPU memory. Im getting this error after install.
python3 server_vllm.py --model "meetkai/functionary-small-v2.2" --host 0.0.0.0
/mnt/data/Applications/functionary/server_vllm.py:94: PydanticDeprecatedSince20: Pydantic V1 style
@validator
validators are deprecated. You should migrate to Pydantic V2 style@field_validator
validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/@validator("tool_choice", always=True)
INFO 03-06 20:54:30 server_vllm.py:542] args: Namespace(host='0.0.0.0', port=8000, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], served_model_name=None, grammar_sampling=True, model='meetkai/functionary-small-v2.2', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 03-06 20:54:33 llm_engine.py:72] Initializing an LLM engine with config: model='meetkai/functionary-small-v2.2', tokenizer='meetkai/functionary-small-v2.2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 03-06 20:54:37 weight_utils.py:164] Using model weights format ['.safetensors']
INFO 03-06 20:55:24 llm_engine.py:322] # GPU blocks: 1699, # CPU blocks: 2048
Traceback (most recent call last):
File "/mnt/data/Applications/functionary/server_vllm.py", line 550, in
engine = AsyncLLMEngine.from_engine_args(engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/Applications/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 633, in from_engine_args
engine = cls(
^^^^
File "/mnt/data/Applications/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 350, in init
self.engine = self._init_engine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/Applications/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 393, in _init_engine
return engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/graham/miniconda3/envs/Fnary/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 114, in init
self._init_cache()
File "/home/graham/miniconda3/envs/Fnary/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 331, in _init_cache
raise ValueError(
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (27184). Try increasing
gpu_memory_utilization
or decreasingmax_model_len
when initializing the engine.The text was updated successfully, but these errors were encountered: