Skip to content

Latest commit

 

History

History
54 lines (38 loc) · 2.29 KB

File metadata and controls

54 lines (38 loc) · 2.29 KB

MATH

Paper

Measuring Mathematical Problem Solving With the MATH Dataset https://arxiv.org/abs/2103.03874

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.

NOTE: This task corresponds to the MATH (hendrycks_math) implementation at https://github.com/EleutherAI/lm-evaluation-harness/tree/master . For the variant which uses the custom 4-shot prompt in the Minerva paper (https://arxiv.org/abs/2206.14858), and SymPy answer checking as done by Minerva, see lm_eval/tasks/minerva_math.

Homepage: https://github.com/hendrycks/math

Citation

@article{hendrycksmath2021,
  title={Measuring Mathematical Problem Solving With the MATH Dataset},
  author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},
  journal={NeurIPS},
  year={2021}
}

Groups and Tasks

Groups

  • hendrycks_math: the MATH benchmark from Hendrycks et al. 0- or few-shot.

Tasks

  • hendrycks_math_algebra
  • hendrycks_math_counting_and_prob
  • hendrycks_math_geometry
  • hendrycks_math_intermediate_algebra
  • hendrycks_math_num_theory
  • hendrycks_math_prealgebra
  • hendrycks_math_precalc

Checklist

The checklist is the following:

For adding novel benchmarks/datasets to the library:

  • Is the task an existing benchmark in the literature?
    • Have you referenced the original paper that introduced the task?
    • If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
      • Answer extraction code is taken from the original MATH benchmark paper's repository.

If other tasks on this dataset are already supported:

  • Is the "Main" variant of this task clearly denoted?
  • Have you provided a short sentence in a README on what each new variant adds / evaluates?
  • Have you noted which, if any, published evaluation setups are matched by this variant?