Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FP8 version #224

Open
themrzmaster opened this issue Jul 12, 2024 · 7 comments
Open

FP8 version #224

themrzmaster opened this issue Jul 12, 2024 · 7 comments

Comments

@themrzmaster
Copy link

Thanks for your work!
Would be nice to have FP8 versions avilable on HF, as vLLM has special
Kernels for it and flash attention 3 is moving on that directiong too.

Thanks

@khai-meetkai
Copy link
Collaborator

Hi @themrzmaster, you mean 8-bit-AWQ right? which version are you interested in v2.5 or v3?

@themrzmaster
Copy link
Author

v3! thanks

@localmind-ai
Copy link

@themrzmaster @khai-meetkai you can live-quantize with --quantization fp8 when launching the included vLLM script, no need for specific models. Only caveat is you still need to download the regular weights, but after that, quantization works fine. Already tested on latest medium functionary.

@localmind-ai
Copy link

@khai-meetkai also one more mention when doing AWQ quants (you probably know it already but I wanted to mention it just in case): it's quite important that the calibration dataset aligns with the use case of function calling, so it's probably a good idea to calibrate not just on some default dataset but also mixed with your own dataset (with some fc samples).

This makes AWQ quants (especially 4 bit) a bit more optimized and reliable. We tested this on some of the older medium functionary models and got better results by expanding the dataset we use for AWQ quantization to synthetically generated function calling data from your original model.

@khai-meetkai
Copy link
Collaborator

Hi @localmind-ai, thank you for reminding us ! Yeah, the calibration dataset should also be function calling data. Currently, we don't have any plans for creating .AWQ as we have more urgent tasks. But we will definitely use function calling data as calibration data if we do so.

@localmind-ai
Copy link

Thanks for the information @khai-meetkai! Fully understandable.

@khai-meetkai
Copy link
Collaborator

@localmind-ai We have just released meetkai/functionary-medium-v3.1-fp8 using small part of training data as calibration data. From our evaluation, this quantized model gave almost the same results as the original model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants