Eval Function Python Program Code

SwiftEval: Developing a Language-Specific Benchmark for LLM-generated Code Evaluation

Abstract: In recent years, large language models (LLMs) have showcased significant advancements in code generation. However, most evaluation benchmarks are primarily oriented towards Python, making it ...

GitHub

Provider-agnostic, open-source evaluation infrastructure for language models

openbench provides standardized, reproducible benchmarking for LLMs across 30+ evaluation suites (and growing) spanning knowledge, math, reasoning, coding, science, reading comprehension, health, long ...

IEEE

An Evaluation of Large Language Models for Code Optimization

Abstract: Code optimization has traditionally been a manual and time-consuming process in which developers identify and correct coding inefficiencies and bad programming practices. Large Language ...

GitHub

WebArena: A Realistic Web Environment for Building Autonomous Agents

Check out this script for a quick walkthrough on how to set up the browser environment and interact with it using the demo sites we hosted. This script is only for education purpose, to perform ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results