Eval Production Lead

Eval Production Lead

Berkeley, California

About METR

METR is a non-profit which does empirical research to determine whether frontier AI models pose a significant threat to humanity. It’s robustly good for civilization to have a clear understanding of what types of danger AI systems pose, and know how high the risk is. You can learn more about our goals from our videos (overall goals, recent update).

Some highlights of our work so far:

  • Establishing autonomous replication evals: Thanks to our work, it’s now taken for granted that autonomous replication (the ability for a model to independently copy itself to different servers, obtain more GPUs, etc) should be tested for. For example, labs pledged to evaluate for this capability as part of the White House commitments.
  • Pre-release evaluations: We’ve worked with OpenAI and Anthropic to evaluate their models pre-release, and our research has been widely cited by policymakers, AI labs, and within government.
  • Inspiring lab evaluation efforts: Multiple leading AI companies are building their own internal evaluation teams, inspired by our work.
  • Early commitments from labs: Anthropic credited us for their recent Responsible Scaling Policy (RSP), and OpenAI recently committed to releasing a Risk-Informed Development Policy (RDP). These fit under the category of “evals-based governance”, wherein AI labs can commit to things like, “If we hit capability threshold X, we won’t train a larger model until we’ve hit safety threshold Y”.

We’ve been mentioned by the UK government, Obama, and others. We’re sufficiently connected to relevant parties (labs, governments, and academia) that any good work we do or insights we uncover can quickly be leveraged.

About the role

Lead and manage our efforts to produce tasks/benchmarks/protocols that can determine if a model is dangerous.

Concretely, this might look like recruiting and leading a team of software engineers and research engineers to create suites of tasks which follow the METR Task Standard.

This role could also involve leading our effort at performing evaluations in-house, and helping evaluation teams at governments and AI companies run our evaluation protocols.

What we’re looking for

This role is best-suited for a generalist who enjoys wearing many hats. Former founders could be a good fit, as could research leads with software engineering experience, research engineers with management experience, former CTOs, or engineering managers/tech leads.

Great candidates for this role need three main things: leadership & execution abilities, engineering talent, and good research direction (each described below).

  • Research direction and communication:
    • Understands considerations and makes wise tradeoffs among the many different desiderata for evaluations (e.g. weighing implementation difficulty vs. informativeness vs. effort to run the evaluation).
    • Has well-calibrated estimates about how hard things are and how long they might take.
    • Communicates with other teams at METR as well as external partners about what evaluations they want the team to create, and helps them understand constraints and bottlenecks in eval creation and execution.

  • Engineering talent:
    • Strong technical design — can avoid inessential complexity, accurately anticipates where corners can be cut and where they can’t
    • Capable of quickly writing & recognizing code which is lucid and correct

  • Leadership, management, and execution abilities:
    • Talented at scoping down first versions of things, being scrappy, rapidly derisking
    • Quickly identifies the most important bottlenecks and how to solve them
    • Has a founder mentality — will take ownership and make things go fast, doing whatever is necessary along the way.
    • Strong project management — able to smoothly oversee many moving pieces and build good project infrastructure
    • Can attract and grow talent — can evaluate, mentor and amplify world-class IC researchers and software engineers

Especially strong candidates might have experience with areas like:

  • Developing robust evaluation metrics for language models (Or for humans! Designing work tests or coursework may be relevant.)
  • Handling textual or code dataset sourcing, curation, and processing tasks involving many human workers
  • Managing codebases with a large number of low-context contributors
  • Quickly spinning up on, and overseeing experts’ work in, intensely technical domains (like the threat-model-relevant domains below)

Bonus Skills

  • Threat-model-relevant subject-matter expertise (in order to help assess the models’ abilities in these areas):
    • Large model inference and training compute optimization, esp for non-standard hardware setups
    • Complicated distributed system setups for many LM agents, secure command-and-control
    • Cybersecurity
    • Cybercrime, fraud, KYC, authentication
  • Familiarity with AI lab infrastructure and relevant technical or security constraints on serving and working with large models


Deadline to apply: None. Applications will be reviewed on a rolling basis.

Salary range: $209,000—$555,000 USD

Apply for this job

We encourage you to apply even if your background may not seem like the perfect fit! We would rather review a larger pool of applications than risk missing out on a promising candidate for the position. If you lack US work authorization and would like to work in-person (strongly preferred), we can likely sponsor a cap-exempt H-1B visa for this role.

We are committed to diversity and equal opportunity in all aspects of our hiring process. We do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. We welcome and encourage all qualified candidates to apply for our open positions.

Registering interest is short/quick — there’s just a single required question, which can be answered with a few bullet points. Register interest!