-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weight "executed code" more prominently #233
Comments
@ahumenberger @bauersimon since this happens for multiple models in the overall score, we should fix this problem before we do the v0.6.0 run. Please discuss some solutions. Maybe we should even weight more metrics differently e.g. response-with-code is more important than response-no-error |
The core problem we currently have is that we do not have normalized scores. This makes it inherently difficult to define fair and understandable weights for the different scoring categories. E.g. if we would define the weight for executable code to be 100 and assume there are two models A and B, and we assess two examples, one with 10 coverage objects and one with 1000 coverage objects. If example 2 would just have 2 coverage objects, then model A would still get 300 points, but model B would get 120 points. This shows, IMO, very well that just playing around with weights does not help us. We need normalized scores e.g. between 0 and 100, and then it is much much easier and understandable to define and adjust weights. |
Could avoid the weighting problem completely by not having one score but reporting everything separately. It feels almost impossible to break it down to just one single number and not run into problems. |
Having an overall score is still valuable I think. Imagine you are trying to find the best LLM for your needs, and you have 5-10 factors on which you are trying to evaluate the LLMs, like "does code compile", "how much coverage", "time". And now there 100 different LLMs. In the end you need a ranking of the LLMs, where each factor has some weight. So in the end it would be great if we could provide some kind of dash board where the user can rank factors by themselves, and provide weights for them. However, that still requires normalized scores. |
In v0.5.0 eval run we have the problem that GPT-4 is better than Gemini 1.5 Flash. Gemini has more code that is executable, but GPT has a higher coverage score that is why it is better. However, it makes sense to first order by executable code than coverage. We need to balance:
The text was updated successfully, but these errors were encountered: