Weight "executed code" more prominently #233

zimmski · 2024-07-02T12:19:31Z

In v0.5.0 eval run we have the problem that GPT-4 is better than Gemini 1.5 Flash. Gemini has more code that is executable, but GPT has a higher coverage score that is why it is better. However, it makes sense to first order by executable code than coverage. We need to balance:

Executable code should be weighted much higher
Coverage is still very important

zimmski · 2024-07-02T12:21:24Z

@ahumenberger @bauersimon since this happens for multiple models in the overall score, we should fix this problem before we do the v0.6.0 run. Please discuss some solutions. Maybe we should even weight more metrics differently e.g. response-with-code is more important than response-no-error

ahumenberger · 2024-07-03T05:40:50Z

The core problem we currently have is that we do not have normalized scores. This makes it inherently difficult to define fair and understandable weights for the different scoring categories. E.g. if we would define the weight for executable code to be 100 and assume there are two models A and B, and we assess two examples, one with 10 coverage objects and one with 1000 coverage objects.
Model A produces perfect coverage for example 1, and only executable code for example 2, so the score is 100 (executable example 1) + 10 * 10 (coverage example 1) + 100 (executable example 2) = 300
Model B does not provide executable code for example 1, but gets full coverage for example 1, so the score is 100 + 1000*10 = 10100

If example 2 would just have 2 coverage objects, then model A would still get 300 points, but model B would get 120 points.

This shows, IMO, very well that just playing around with weights does not help us. We need normalized scores e.g. between 0 and 100, and then it is much much easier and understandable to define and adjust weights.

bauersimon · 2024-07-03T06:23:25Z

Could avoid the weighting problem completely by not having one score but reporting everything separately. It feels almost impossible to break it down to just one single number and not run into problems.

ahumenberger · 2024-07-03T06:57:38Z

Could avoid the weighting problem completely by not having one score but reporting everything separately. It feels almost impossible to break it down to just one single number and not run into problems.

Having an overall score is still valuable I think. Imagine you are trying to find the best LLM for your needs, and you have 5-10 factors on which you are trying to evaluate the LLMs, like "does code compile", "how much coverage", "time". And now there 100 different LLMs. In the end you need a ranking of the LLMs, where each factor has some weight.
However the weight always depends on the application, e.g. accuracy might be more relevant than speed or vice versa.,

So in the end it would be great if we could provide some kind of dash board where the user can rank factors by themselves, and provide weights for them. However, that still requires normalized scores.

zimmski added the enhancement New feature or request label Jul 2, 2024

zimmski added this to the v0.6.0 milestone Jul 2, 2024

zimmski assigned bauersimon Jul 2, 2024

bauersimon modified the milestones: v0.6.0, v0.7.0 Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weight "executed code" more prominently #233

Weight "executed code" more prominently #233

zimmski commented Jul 2, 2024

zimmski commented Jul 2, 2024

ahumenberger commented Jul 3, 2024 •

edited

Loading

bauersimon commented Jul 3, 2024

ahumenberger commented Jul 3, 2024

Weight "executed code" more prominently #233

Weight "executed code" more prominently #233

Comments

zimmski commented Jul 2, 2024

zimmski commented Jul 2, 2024

ahumenberger commented Jul 3, 2024 • edited Loading

bauersimon commented Jul 3, 2024

ahumenberger commented Jul 3, 2024

ahumenberger commented Jul 3, 2024 •

edited

Loading