Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: The model's max seq len (32768) #128

Open
RACMUP opened this issue Mar 6, 2024 · 1 comment
Open

ValueError: The model's max seq len (32768) #128

RACMUP opened this issue Mar 6, 2024 · 1 comment

Comments

@RACMUP
Copy link

RACMUP commented Mar 6, 2024

I have 3 x 3090 GPU with 72GB RAM running in Linux so should have enough GPU memory. Im getting this error after install.

python3 server_vllm.py --model "meetkai/functionary-small-v2.2" --host 0.0.0.0
/mnt/data/Applications/functionary/server_vllm.py:94: PydanticDeprecatedSince20: Pydantic V1 style @validator validators are deprecated. You should migrate to Pydantic V2 style @field_validator validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
@validator("tool_choice", always=True)
INFO 03-06 20:54:30 server_vllm.py:542] args: Namespace(host='0.0.0.0', port=8000, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], served_model_name=None, grammar_sampling=True, model='meetkai/functionary-small-v2.2', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 03-06 20:54:33 llm_engine.py:72] Initializing an LLM engine with config: model='meetkai/functionary-small-v2.2', tokenizer='meetkai/functionary-small-v2.2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 03-06 20:54:37 weight_utils.py:164] Using model weights format ['
.safetensors']
INFO 03-06 20:55:24 llm_engine.py:322] # GPU blocks: 1699, # CPU blocks: 2048
Traceback (most recent call last):
File "/mnt/data/Applications/functionary/server_vllm.py", line 550, in
engine = AsyncLLMEngine.from_engine_args(engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/Applications/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 633, in from_engine_args
engine = cls(
^^^^
File "/mnt/data/Applications/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 350, in init
self.engine = self._init_engine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/Applications/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 393, in _init_engine
return engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/graham/miniconda3/envs/Fnary/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 114, in init
self._init_cache()
File "/home/graham/miniconda3/envs/Fnary/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 331, in _init_cache
raise ValueError(
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (27184). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

@jeffrey-fong
Copy link
Contributor

Hi, the default context window of the base model is 32k so you can set the max_model_len to 8k if you are using a GPU with 24GB VRAM. Like what the error suggests, 24GB of VRAM is not enough to load the model with KV cache of size 32K in vLLM. By setting max_model_len to 8k, it should work.

python3 server_vllm.py --model meetkai/functionary-small-v2.2 --host 0.0.0.0 --max-model-len 8192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants