The Benchmark Bug That Inflates(or Deflates) Healthcare AI Results
We have found a bug in SOTA healthcare benchmark design that causes wildly different scores even from the same output. Depending on how the evaluation criteria is phrased, the same output produces a difference in score up to 16%!