Engineering Lead

Engineering Lead

Apply here. Applications are rolling, and there’s no set deadline.

About ARC Evals (now METR)

METR does empirical research to determine whether frontier AI models pose a significant threat to humanity. It’s robustly good for civilization to have a clear understanding of what types of danger AI systems pose, and know how high the risk is. You can learn more about our goals from Beth’s talk.

Some highlights of our work so far:

  • Establishing autonomous replication evals: Thanks to our work, it’s now taken for granted that autonomous replication (the ability for a model to independently copy itself to different servers, obtain more GPUs, etc) should be tested for. For example, labs pledged to evaluate for this capability as part of the White House commitments.
  • Pre-release evaluations: We’ve worked with OpenAI and Anthropic to evaluate their models pre-release, and our research has been widely cited by policymakers, AI labs, and within government.
  • Inspiring lab evaluation efforts: Multiple leading AI companies are building their own internal evaluation teams, inspired by our work.
  • Early commitments from labs: Anthropic credited us for their recent Responsible Scaling Policy (RSP), and OpenAI recently committed to releasing a Risk-Informed Development Policy (RDP). These fit under the category of “evals-based governance”, wherein AI labs can commit to things like, “If we hit capability threshold X, we won’t train a larger model until we’ve hit safety threshold Y”.

We’ve been mentioned by the UK government, Obama, and others. We’re sufficiently connected to relevant parties (labs, governments, and academia) that any good work we do or insights we uncover can quickly be leveraged.

About the role

The engineering lead at METR is in charge of our internal platform for evaluating model capabilities (Concretely: infrastructure to run a hundred agents in parallel against different tasks inside isolated virtual machines), as well as managing the engineers who expand this tooling.

This platform is critical to our success — as increasingly powerful models are created, we’ll need to keep pace by constructing tooling that allows us to evaluate these new models. As models gain new modalities and capabilities, the tooling necessary to test out their capabilities will shift as well.

The work is technically fascinating, and you get to be on the cutting edge of what models can do. If you’re up for it, you may also liaise with our partners — labs, the US and UK governments, etc — as they embark on their own evaluation efforts. There’s room here to help set the standards for tooling that enable evaluations overall.

Compensation is about $250k–$400k, depending on the candidate.

What we’re looking for

This role is best-suited for a generalist who enjoys wearing many hats. Former founders could be a good fit, or engineering managers who enjoy talking to users, or strong ICs or tech leads with at least a bit of management experience.

Engineering

Requirements

  • Strong technical design — avoids inessential complexity, accurately anticipates where corners can be cut and where they can’t
  • Capable of writing code which is lucid and correct
  • Strategic about managing technical debt

Nice to haves:

  • Machine learning experience
  • A penchant for creating good tooling
  • Web application experience — has built them personally, is familiar with approaches for doing so, etc
  • Application security awareness — important due to the security-related content of some of our work, safeguarding lab access, and keeping AIs in the box
  • Some low-level OS knowledge (e.g. networking, packet filtering, etc)

Product & design

  • Empathizes with users (our researchers), understand their workflows, and invent intuitive tools to empower them
  • Communicates costs clearly with researchers to make better tradeoffs
  • Can integrate input from multiple stakeholders, set clear expectations with them.
  • Are skilled in both “0 to 1” (designing something totally new) and “1 to n” (improving something which exists)
  • Can triage requests for features, prioritize, say "no" where necessary
  • Can be scrappy, quickly spinning up and prototyping projects when needed

General

  • Communicates clearly (written & verbal)
  • Mission-aligned — is concerned about AI disempowering humanity
  • Capable of being scrappy — i.e. themselves having had a track record of quickly spinning up projects, prototyping stuff rapidly, etc. seems really hard to lead a team to behave that way unless you know how to do it.
  • Founder mentality — will take ownership and make things go fast, doing whatever necessary along the way

Leadership

  • Eventually earns respect, is looked to for sound judgment
  • Can achieve product velocity
    • Talented at scoping down first versions of things
    • Excites team towards goals by making clear why they matter
    • Pays attention to how fast we're moving, actively thinks of ways we can stay fast or move faster (without creating undue strain on people!)

People

  • Can hold effective 1:1s, elicit and solve problems
  • Can woo & evaluate engineers — can recognize signs of talent in potential hires
  • Helps people see how they could grow
  • Able to break down a project into tasks that are consumable by engineers
  • Able to spec tasks at differing levels, e.g.
    • A single high-level instruction (for an engineer who has lots of context and can do their own UI design)
    • More granular tasks with clear descriptions, hints for good ways to implement (for new engineers with less context)

Our team

Our engineering team (current size: three people) is motivated and talented, and you’ll also work closely with our research team.

You’ll work alongside our current engineering lead, Ted Suzman (who’ll switch to a different role once you join and transition in). He previously built an app that reached over 25 million users, went through Y Combinator, and developed Amplify and Empower, tools that have supported democratic political organizing since the 2020 election.

After a transitional period, you’ll report to our project lead, Beth Barnes, who redid OpenAI’s Debate framework during her time there before founding Evals. You’ll also work closely with Daniel Ziegler, our research lead and a strong ML engineer, whose previous work includes pioneering implementations of RLHF and core pieces of GPT-3 infrastructure while at OpenAI.

Our research is advised on a weekly basis by Paul Christiano, who was one of the inventors of RLHF.

Nature of the work

In practice looks like a mix of technical strategy, technical design, code review, hiring, management, etc. The engineering team integrates tightly with each research initiative by attending research standups (and also there aren’t barriers to contributing ideas directly on the research front, as well!). It’s neither the case that research is totally beholden to engineering or the other way around — there’s a collaboration between the two which feels good and productive.

Some example projects:

  • Figuring out how our agent safety policy should work (e.g., how to oversee or isolate agents from their external environment)
  • Sorting out a way to test agents
  • Designing a task format that can both import external task suites and is good for us defining our own internally.
  • Working side-by-side with researchers to discover the bottlenecks they’re facing, and designing tooling that might help.
  • Designing workflows for contractors to annotate data for fine-tuning

Our days start at 10am with a very short (5 minute) organization-wide standup, followed by particular team standups.

Impact of the work

We’re confronted each day with so many valuable opportunities for impact, most of which we have to say “no” to because we don’t have the capacity to take them on. There’s a huge opportunity to scale our internal capacity and take on more. That’s why we’re so excited about this role — we’re in an extraordinary moment with very high stakes, and the person in this role could help us all meet the moment.

How to register interest

We encourage you to apply even if your background may not seem like the perfect fit! We would rather review a larger pool of applications than risk missing out on a promising candidate for the position. If you lack US work authorization and would like to work in-person (preferred), we can likely sponsor a cap-exempt H-1B visa for this role.

We are committed to diversity and equal opportunity in all aspects of our hiring process. We do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. We welcome and encourage all qualified candidates to apply for our open positions.

Registering interest is short/quick — there’s just a single required question, which can be answered with a few bullet points. Register interest!