Apply here. Applications are rolling, and there’s no set deadline.
About ARC Evals (now METR)
METR does empirical research to determine whether frontier AI models pose a significant threat to humanity. It’s robustly good for civilization to have a clear understanding of what types of danger AI systems pose, and know how high the risk is. You can learn more about our goals from Beth’s talk.
Some highlights of our work so far:
- Establishing autonomous replication evals: Thanks to our work, it’s now taken for granted that autonomous replication (the ability for a model to independently copy itself to different servers, obtain more GPUs, etc) should be tested for. For example, labs pledged to evaluate for this capability as part of the White House commitments.
- Pre-release evaluations: We’ve worked with OpenAI and Anthropic to evaluate their models pre-release, and our research has been widely cited by policymakers, AI labs, and within government.
- Inspiring lab evaluation efforts: Multiple leading AI companies are building their own internal evaluation teams, inspired by our work.
- Early commitments from labs: Anthropic credited us for their recent Responsible Scaling Policy (RSP), and OpenAI recently committed to releasing a Risk-Informed Development Policy (RDP). These fit under the category of “evals-based governance”, wherein AI labs can commit to things like, “If we hit capability threshold X, we won’t train a larger model until we’ve hit safety threshold Y”.
We’ve been mentioned by the UK government, Obama, and others. We’re sufficiently connected to relevant parties (labs, governments, and academia) that any good work we do or insights we uncover can quickly be leveraged.
About the role
Software engineers at METR work on our internal platform for evaluating model capabilities (concretely: infrastructure to run a hundred agents in parallel against different tasks inside isolated virtual machines).
This platform is critical to our success — as increasingly powerful models are created, we’ll need to keep pace by constructing tooling that allows us to evaluate these new models. As models gain new modalities and capabilities, the tooling necessary to test out their capabilities will shift as well.
The work is technically fascinating, and you get to be on the cutting edge of what models can do. Your work may also contribute to that of our partners — labs, the US and UK governments, and others — as they embark on their own evaluation efforts. There’s room here to help set the standards for tooling that enable evaluations overall.
Compensation is about $150k–$350k, depending on the candidate.
What we’re looking for
This role is best-suited for experienced software engineers with strong technical design skills and the ability to get stuff done. People who have started their own projects, been founders or early employees, or otherwise worked on small teams could be a great fit. It’s likely that people that fit these requirements will have at least a few years of software engineering experience.
- Strong technical design — avoids inessential complexity, accurately anticipates where corners can be cut and where they can’t
- Can write code which is lucid and correct
- Strategic about managing technical debt
- Capable of being scrappy when it’s called for, or making investments in really great code when it’s called for
Nice to haves:
- Machine learning experience
- A penchant for creating good tooling
- Web application experience
- Application security awareness — important due to the security-related content of some of our work, safeguarding lab access, and keeping AIs in the box
- Communicates clearly (written & verbal)
- Mission-aligned — is concerned about AI disempowering humanity
- Empathizes with users (our researchers), understands their workflows, writes tooling to empower them
Our engineering team (current size: three people) is motivated and talented, and you’ll also work closely with our research team.
You’ll work with our current engineering lead, Ted Suzman. He previously built an app that reached over 25 million users, went through Y Combinator, and developed Amplify and Empower, tools that have supported democratic political organizing since the 2020 election.
You’ll also work with Daniel Ziegler, our research lead and a strong ML engineer, whose previous work includes pioneering implementations of RLHF and core pieces of GPT-3 infrastructure while at OpenAI.
Our research is advised on a weekly basis by Paul Christiano, who was one of the inventors of RLHF.
Nature of the work
In practice it looks like a mix of writing code, technical design, code review, etc. The engineering team integrates tightly with each research initiative by attending research stand-ups (also — there aren’t barriers to contributing ideas directly on the research front, as well!). It’s neither the case that research is totally beholden to engineering or the other way around — there’s a collaboration between the two which feels good and productive.
Some example projects:
- Figuring out how our agent safety policy should work (e.g., how to oversee or isolate agents from their external environment)
- Making it possible to save/restore agent state in order to branch attempts made by agents on tasks
- Designing a way to efficiently test changes to agents
- Designing a task format that can both import external task suites and is good for us defining our own internally.
- Designing workflows for contractors to annotate data for fine-tuning
Our days start at 10am PT with a very short (5 minute) organization-wide standup, followed by particular team standups. This role is open to people who want to work in person in Berkeley, or remotely (with occasional in-person visits), or a hybrid.
Impact of the work
We’re confronted each day with so many valuable opportunities for impact, most of which we have to say “no” to because we don’t have the capacity to take them on. There’s a huge opportunity to scale our internal capacity and take on more. That’s why we’re excited about this role — we’re in an extraordinary moment with very high stakes, and people like you could help us all meet the moment.
How to register interest
We encourage you to apply even if your background may not seem like the perfect fit! We would rather review a larger pool of applications than risk missing out on a promising candidate for the position. If you lack US work authorization and would like to work in-person (preferred), we can likely sponsor a cap-exempt H-1B visa for this role.
We are committed to diversity and equal opportunity in all aspects of our hiring process. We do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. We welcome and encourage all qualified candidates to apply for our open positions.