Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weight "executed code" more prominently #233

Open
zimmski opened this issue Jul 2, 2024 · 4 comments
Open

Weight "executed code" more prominently #233

zimmski opened this issue Jul 2, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@zimmski
Copy link
Member

zimmski commented Jul 2, 2024

In v0.5.0 eval run we have the problem that GPT-4 is better than Gemini 1.5 Flash. Gemini has more code that is executable, but GPT has a higher coverage score that is why it is better. However, it makes sense to first order by executable code than coverage. We need to balance:

  • Executable code should be weighted much higher
  • Coverage is still very important
@zimmski zimmski added the enhancement New feature or request label Jul 2, 2024
@zimmski zimmski added this to the v0.6.0 milestone Jul 2, 2024
@zimmski
Copy link
Member Author

zimmski commented Jul 2, 2024

@ahumenberger @bauersimon since this happens for multiple models in the overall score, we should fix this problem before we do the v0.6.0 run. Please discuss some solutions. Maybe we should even weight more metrics differently e.g. response-with-code is more important than response-no-error

@ahumenberger
Copy link
Contributor

ahumenberger commented Jul 3, 2024

The core problem we currently have is that we do not have normalized scores. This makes it inherently difficult to define fair and understandable weights for the different scoring categories. E.g. if we would define the weight for executable code to be 100 and assume there are two models A and B, and we assess two examples, one with 10 coverage objects and one with 1000 coverage objects.
Model A produces perfect coverage for example 1, and only executable code for example 2, so the score is 100 (executable example 1) + 10 * 10 (coverage example 1) + 100 (executable example 2) = 300
Model B does not provide executable code for example 1, but gets full coverage for example 1, so the score is 100 + 1000*10 = 10100

If example 2 would just have 2 coverage objects, then model A would still get 300 points, but model B would get 120 points.

This shows, IMO, very well that just playing around with weights does not help us. We need normalized scores e.g. between 0 and 100, and then it is much much easier and understandable to define and adjust weights.

@bauersimon
Copy link
Member

Could avoid the weighting problem completely by not having one score but reporting everything separately. It feels almost impossible to break it down to just one single number and not run into problems.

@ahumenberger
Copy link
Contributor

Could avoid the weighting problem completely by not having one score but reporting everything separately. It feels almost impossible to break it down to just one single number and not run into problems.

Having an overall score is still valuable I think. Imagine you are trying to find the best LLM for your needs, and you have 5-10 factors on which you are trying to evaluate the LLMs, like "does code compile", "how much coverage", "time". And now there 100 different LLMs. In the end you need a ranking of the LLMs, where each factor has some weight.
However the weight always depends on the application, e.g. accuracy might be more relevant than speed or vice versa.,

So in the end it would be great if we could provide some kind of dash board where the user can rank factors by themselves, and provide weights for them. However, that still requires normalized scores.

@bauersimon bauersimon modified the milestones: v0.6.0, v0.7.0 Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants