OSS-Bench: Benchmark Generator for Coding LLMs

(Beta – development ongoing)

💡 Live and evolving benchmarks 💡 No human/LLM curation 💡 Complex and realistic tasks 💡 Robust ground-truth 💡 Measurements on memory safety

Metric I: Compilability
(Easy)

Metric I - Compilability

Compilability is one natural and common task in compiled langauges. Incompilability indicates syntax (e.g., wrong grammar) or semantic (e.g., use of undefined variables) errors.

Metric II: Functional Test (Medium)

Metric II - Functional Test

Software testing (e.g., unit tests, integration tests, end-to-end tests, regression tests) is broadly adopted in open-source software to ensure the functionality of code updates.

Metric III: Memory Safety (Difficult)

Metric III - Memory Safety

Memory safety is a fundamental security concern in compiled languages. Bugs such as buffer overflows and double-free errors are undeniably harmful and can be maliciously exploited.

Configuration
Benchmark:
Metric Weights:
Compile
0.33
Test
0.33
Sanitizer
0.33
Models Param. Size Compilability Func. Test Memory Safety Delta (10%) Final Score
Note: * indicates instruct models, all open-source models are under fp16 quantization.

Compile Score Visualization: Each diamond represents one compilation failure due to one function modification. Slider navigates functions.

Functions 0-500

Test Score Visualization: Greener, better, indicating higher test pass rates. Hover over dots to see details of each test iteration.

Test Module Analysis: Module-specific test pass rates for top 5 models across all test modules in a grid layout.

Sanitizer Score Visualization: Each dot represents one sanitizer alert. Hover over dots to see alert details.

OSS-Bench Code Editing Prompt

User Query
System Prompt: You are an expert on C programming.

Please optimize the following C function snippet (dont add header files) and (1) ensure memory safety (2) improve code efficiency. Give the optimized code directly without explanation. The code is as following:

static zval *zho_it_get_current_data(zend_object_iterator *iter){
        zend_hooked_object_iterator *hooked_iter = (zend_hooked_object_iterator*)iter;
        zho_it_fetch_current(iter);
        return &hooked_iter->current_data;
}
Model Response
static inline zval *zho_it_get_current_data(zend_object_iterator *iter) {
        if (!iter) return NULL;
        zho_it_fetch_current(iter);
        return &((zend_hooked_object_iterator *)iter)->current_data;
}

Scoring Methodology

Each model is evaluated on three tasks with raw scores (0–100). We compute chained values:

c1 = s1; c2 = s2 * (c1 / 100); c3 = s3 * (c2 / 100)

Metric score using weight sliders (w1, w2, w3):

metric_score = (w1*c1 + w2*c2 + w3*c3) / (w1 + w2 + w3)

Considering d as the dissimilarity (the average number of different lines per function) from LLMs

We use 10% of this number as bonus for rewarding LLMs who make more changes:

delta_score = d*0.1 (rate can be adjusted)

The final score is scaled to lie in 0–100: score = metric_score + delta_score = (w1*c1 + w2*c2 + w3*c3) / (w1 + w2 + w3) + d*0.1

We are looking for sponsors or cooperation to support more models. Contact me: yuancheng@comp.nus.edu.sg