You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
extended cause longer context windows don't matter for our tasks
free and auto cause these are just "aliases" for existing models
Exclude special-purpose models
Vision models
Roleplay and creative writing models
Classification models
Models with internet access (usually denoted by -online suffix)
Models with extended context windows (usually denoted by -1234K suffix)
Always prefer fine tuned (-instruct, -chat) models over a plain base model
Tag version (tag can be moved in case important merges happen afterwards)
For all issues of the current milestone, one by one, add them to the roadmap tasks (it is ok if a task has multiple issues) with the users that worked on it
Fixed bugs should always be sorted into respective relevant categories and not in a generic "Bugs" category!
For all PRs of the current milestone, one by one, add them to the roadmap tasks (it is ok if a task has multiple issues) with the users that worked on it
Fixed bugs should always be sorted into respective relevant categories and not in a generic "Bugs" category!
Search all issues for ...
Unassigned issues that are closed, and assign them someone
Issues without a milestone, and assign them a milestone
Issues without a label, and assign them at least one label
Write the release notes:
Use the tasks that are already there for the release note outline
Add highlighted features based on the done tasks, sort by how many users would use the feature
Do the release
With the release notes
Set as latest release
Prepare the next roadmap
Create a milestone for the next release
Create a new roadmap issue for the next release
Move all open tasks/TODOs from this roadmap issue to the next roadmap issue.
Move every comment of this roadmap issue as a TODO to the next roadmap issue. Mark when done with a 🚀 emoji.
Blog post containing evaluation results, new features and learnings
Update README with blog post link and new header image
Add a "Deep dive: $blog-post-title" announcement for the blog post
Add a "v$version: $summary-of-highlights" announcement for the release
Announce release
Eat cake 🎂
TODO sort and sort out:
Models
Exclude openrouter models auto and flavor-of-the-week automatically in the provider
Include more models (We have the main problem that there are multiple models coming out every day. We should not wait for a "new version" of the eval, we should test these models right away and compare them. big problem: how do we promote findings?)
Nous-Hermes-2-SOLAR-10.7B; Also "Tree of Thoughts" approach might be interesting as a task
extend the report command such that it takes result csv's and automatically
does the summing and aggregation (if we still want that to be a separate step)
finds the maximum scores for that evaluation run
once we have the leaderboard, we basically want to configure the repository such that we just add a model to some config somewhere and the GitHub actions run automatically and benchmark this model
or in a similar fashion, we just do a new release and the GitHub actions run automatically and benchmark everything for the new version
Take a look at current leaderboards and evals to know what could be interesting Current popular code leaderboards are [LiveCodeBench](https://huggingface.co/spaces/livecodebench/leaderboard), the [BigCode models leaderboard](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard), [CyberSecEval](https://huggingface.co/spaces/facebook/CyberSecEval) and [CanAICode](https://huggingface.co/spaces/mike-ravkine/can-ai-code-results)
Figure out the "perfect" coverage score so we can display percentage of coverage reached
Make coverage metric fair
"Looking through logs... Java consistently has more code than Go for the same tasks, which yields more coverage. So a model that solves all Java tasks but no Go is automatically higher ranked than the opposite." -> only Symflower coverage will make this fair
distinguish between latency (time-to-first-token) and throughput (tokens generated per second)
Save the descriptons of the models as well: https://openrouter.ai/api/v1/models The reason is that these can change over time, and we need to know after a while what they where. e.g right now i would like to know if mistral-7b-instruct for the last evaluation was v0.1. or not
Query REAL costs of all the testing of a model: the reason this is interesting is that some models have HUGE outputs, and since more output means more costs, this should be addressed in the score.
Reporting
Bar charts should have have their value on the bar. The axis values do not work that well
Pick an example or several examples per category: goal is to find interesting results automatically, because it will get harder and harder to go manually through results.
Total-scores vs costs scatterplot. Result is upper-left-corner sweat spot: cheap and good results.
Scoring, Categorization, Bar Charts split by language.
Piechart of whole evaluations costs: for each LLM show how much it costs. Result is to see which LLMs are costing the most to run the eval.
deep-dive content
What are results that align with expectations? what are results against expectations? E.g. are there small LLMs that are better than big ones?
Are there big LLMs that totally fail?
Are there small LLMs that are surprisingly good?
What about LLMs where the commonity doesn't know that much yet: e.g. Snowflake, DBRX, ...
Order models by open.weight, allows commercial-use, closed, and price(!) and size: e.g. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 is great because open-weight, and Apache2 so commerical-use allowed. Should be better rated than GPT4
Rescore existing models / eval with fixes e.g. when we do a better code repair tool, the LLM answer did not change, so we should rescore right away with new version of tool over a whole result of an eval.
Automatic tool installation with fixed version
Go
Java
Ensure that non-critical CLI input validation (such as unavailable models) does not panic
Add an app-name to the requests so people know we are the eval https://openrouter.ai/docs#quick-start shows that other openapi-packages implement custom headers, but the one Go package we are using does not implement that. So do a PR to contribute.
Prepare language and evaluation logic for multiple files:
Use symflower symbols to receive files
Evaluation tasks
Add evaluation task for "querying the relative test file path of a relative implementation file path" e.g. "What is the test relative file path for some/implementation/file.go" ... it is "some/implementation/file_test.go" for most cases.
Add evaluation task for code refactoring: two function with the same code -> extract into a helper function
Add evaluation task for implementing and fixing bugs using TDD
Check determinism of models e.g. execute each plain repository X-times, and then check if they are stable.
Code repair
0-shot, 1-shot, ...
With LLM repair
With tool repair
Do test file paths through
symflower symbols
Task for models
Move towards generated cases so models cannot integrate fixed cases to always have 100% score
Think about adding more trainings data generation features: This will also help with dynamic cases
Heard that Snowflake Arctic is very open with how they gathered trainings data... so we see what LLM creators think and want of trainings data
Documentation
Clean up and extend README
Better examples for contributions
Overhaul explanation of "why" we need evaluation, i.e. why is it good to evaluate for an empty function that does nothing.
Write down a playbook for evaluations, e.g. one thing that should happen is that we let the benchmark play 5 times and then sum up points, but ... the runs should have at least one hour break in between to not run into cached responses.
Content
Benchmark that showcases base-models vs their fine-tuned coding model e.g. in v0.5.0 we see that Codestral, codellama, ... are worse
Snowflake against Databricks would be a nice comparison since they align company-wise and are new
Tasks/Goals:
Release version of this roadmap issue:
nitro
cause they are just fasterextended
cause longer context windows don't matter for our tasksfree
andauto
cause these are just "aliases" for existing models-online
suffix)-1234K
suffix)-instruct
,-chat
) models over a plain base modelTODO sort and sort out:
openrouter
modelsauto
andflavor-of-the-week
automatically in the providerreport
command such that it takes result csv's and automaticallyCurrent popular code leaderboards are [LiveCodeBench](https://huggingface.co/spaces/livecodebench/leaderboard), the [BigCode models leaderboard](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard), [CyberSecEval](https://huggingface.co/spaces/facebook/CyberSecEval) and [CanAICode](https://huggingface.co/spaces/mike-ravkine/can-ai-code-results)
Model
andProvider
to be in the same package Preload Ollama models before inference and unload afterwards #121 (comment)No test files
actually identify and error that there are no test files (needs to be implemented insymflower test
)symflower symbols
to receive filessymflower symbols
https://www.reddit.com/r/LocalLLaMA/comments/1cihrdt/comment/l2d4im0/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
The text was updated successfully, but these errors were encountered: