OSS-Bench

OSS-Bench: Benchmark Generator for Coding LLMs

(Beta – development ongoing)

💡 Live and evolving benchmarks 💡 No human/LLM curation 💡 Complex and realistic tasks 💡 Robust ground-truth 💡 Measurements on memory safety

Metric I: Compilability
(Easy)

Metric I - Compilability

Compilability is one natural and common task in compiled langauges. Incompilability indicates syntax (e.g., wrong grammar) or semantic (e.g., use of undefined variables) errors.

Metric II: Functional Test (Medium)

Metric II - Functional Test

Software testing (e.g., unit tests, integration tests, end-to-end tests, regression tests) is broadly adopted in open-source software to ensure the functionality of code updates.

Metric III: Memory Safety (Difficult)

Metric III - Memory Safety

Memory safety is a fundamental security concern in compiled languages. Bugs such as buffer overflows and double-free errors are undeniably harmful and can be maliciously exploited.

Configuration

Benchmark:

Metric Weights:

Compile

0.33

Test

0.33

Sanitizer

0.33

Models	Param. Size	Compilability	Func. Test	Memory Safety	Delta (10%)	Final Score

Note: * indicates instruct models, all open-source models are under fp16 quantization.

OSS-Bench Code Editing Prompt

User Query

System Prompt: You are an expert on C programming.

Please optimize the following C function snippet (dont add header files) and (1) ensure memory safety (2) improve code efficiency. Give the optimized code directly without explanation. The code is as following:

static zval *zho_it_get_current_data(zend_object_iterator *iter){
        zend_hooked_object_iterator *hooked_iter = (zend_hooked_object_iterator*)iter;
        zho_it_fetch_current(iter);
        return &hooked_iter->current_data;
}

Model Response

static inline zval *zho_it_get_current_data(zend_object_iterator *iter) {
        if (!iter) return NULL;
        zho_it_fetch_current(iter);
        return &((zend_hooked_object_iterator *)iter)->current_data;
}

Scoring Methodology

Each model is evaluated on three tasks with raw scores (0–100). We compute chained values:

c1 = s1; c2 = s2 * (c1 / 100); c3 = s3 * (c2 / 100)

Metric score using weight sliders (w1, w2, w3):

metric_score = (w1*c1 + w2*c2 + w3*c3) / (w1 + w2 + w3)

Considering d as the dissimilarity (the average number of different lines per function) from LLMs

We use 10% of this number as bonus for rewarding LLMs who make more changes:

delta_score = d*0.1 (rate can be adjusted)

The final score is scaled to lie in 0–100: score = metric_score + delta_score = (w1*c1 + w2*c2 + w3*c3) / (w1 + w2 + w3) + d*0.1

We are looking for sponsors or cooperation to support more models. Contact me: yuancheng@comp.nus.edu.sg

OSS-Bench: Benchmark Generator for Coding LLMs (Beta – development ongoing)

Metric I - Compilability

Metric II - Functional Test

Metric III - Memory Safety

OSS-Bench Code Editing Prompt

Scoring Methodology

OSS-Bench: Benchmark Generator for Coding LLMs
(Beta – development ongoing)