Human Data Lead

Human Data Lead

Apply here. Applications are rolling, and there’s no set deadline.

About ARC Evals (now METR)

METR does empirical research to determine whether frontier AI models pose a significant threat to humanity. It’s robustly good for civilization to have a clear understanding of what types of danger AI systems pose, and know how high the risk is. You can learn more about our goals from Beth’s talk.

Some highlights of our work so far:

  • Establishing autonomous replication evals: Thanks to our work, it’s now taken for granted that autonomous replication (the ability for a model to independently copy itself to different servers, obtain more GPUs, etc) should be tested for. For example, labs pledged to evaluate for this capability as part of the White House commitments.
  • Pre-release evaluations: We’ve worked with OpenAI and Anthropic to evaluate their models pre-release, and our research has been widely cited by policymakers, AI labs, and within government.
  • Inspiring lab evaluation efforts: Multiple leading AI companies are building their own internal evaluation teams, inspired by our work.
  • Early commitments from labs: Anthropic credited us for their recent Responsible Scaling Policy (RSP), and OpenAI recently committed to releasing a Risk-Informed Development Policy (RDP). These fit under the category of “evals-based governance”, wherein AI labs can commit to things like, “If we hit capability threshold X, we won’t train a larger model until we’ve hit safety threshold Y”.

We’ve been mentioned by the UK government, Obama, and others. We’re sufficiently connected to relevant parties (labs, governments, and academia) that any good work we do or insights we uncover can quickly be leveraged.

Our team

You’ll work with the incredibly talented and passionate members of our team and manage our highly skilled contractors, who produce data that help us improve our LLM-powered agents and enable us to better assess agents’ true capabilities. You’ll report to our industry-leading team of Model Evaluation Researchers who design and run experiments for assessing the capabilities of state-of-the-art language models. You’ll act as the interface between the researchers and contractors.

Your role

Data from our contractors is crucial to our model evaluation efforts, since high-quality feedback on agent behavior is a key bottleneck to improving agent performance. You’ll have the critical role of managing this data generation process by recruiting and managing skilled contractors and incentivizing them to generate high-quality data on challenging tasks. You’ll also evaluate different data generation contracting services or strategies and make high level decisions about when to start outsourcing to scale up data generation.

In addition, the research team often needs contractors to conduct one-off analyses and experiments, so you’ll scope down such tasks, create instructions, and delegate these to appropriate contractors.

To succeed at this role, you’ll:

  • Leverage generalist software engineering skills to quickly understand software-based tasks in our growing evaluation suite, including what optimal solutions to them can look like
  • Craft instructions for our contractors from around the world and with varying backgrounds to understand tasks clearly
  • Design incentives that encourage high-quality data generation
  • Encourage interaction between contractors and with yourself to motivate contractors and help them feel part of a team
  • Train and mentor contractors to help increase their performance
  • Discover and test potential improvements to the data generation process
  • Enjoy producing tangible outputs with tight feedback loops
  • Tackle our biggest bottlenecks to high quality data generation as they develop
  • Deeply understand our research, so you can suggest ways our contractors can help beyond data generation

Example things you might do include:

1. Manage the team of skilled contractors to extend a small set of expert-produced gold-standard ratings of complex ML trajectories (e.g. 100 examples) into a large dataset, making sure to maintain rating quality close to the gold standard.

2. Ask researchers how labels should be generated, and then produce clear instructional materials for contractors which help them understand the task.

a. An example task is "identify mistakes made by an LLM agent attempting to set up a state-of-the-art open-source language model on a new server".

3. Compare different contracting services for a data generation task and decide which, if any, could produce data that meets our bar.

Compensation is about $150k–$250k, depending on the candidate.

About you

While we don’t expect any particular set of qualifications, you might be a good fit if you:

1. Have successfully organized fast-moving projects with many people involved

2. Understand LLMs, including finetuning with human data

3. Are skilled at data analytics and munging

4. Are familiar with bash and python, and have general software engineering skills at least at the level of an entry-level software engineer

5. Are eager to engage with the tasks and produce high-quality data yourself

We strongly prefer applicants who can work in person but will consider exceptional remote candidates.

Application Process

We primarily use work tests to evaluate candidates. If we proceed with your application after review, we'll typically ask you to do 1–2 of these. The first trial task is often (but not always) less than an hour, while later trial tasks can be longer. If we ask you to do a work test that is expected to take 2 or more hours, we will compensate you at a competitive hourly rate for your time.

After the work tests, we'll typically schedule a short phone or video call to discuss what you're looking for in a role, ask you any questions we have, and provide time for any other questions you have.

The final stage in our process is usually a few-week trial period; we're open to figuring out how to manage this with each candidate.

We encourage you to apply even if your background may not seem like the perfect fit! We would rather review a larger pool of applications than risk missing out on a promising candidate for the position. If you lack US work authorization and would like to work in-person (preferred), we can likely sponsor a cap-exempt H-1B visa for this role.

We are committed to diversity and equal opportunity in all aspects of our hiring process. We do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. We welcome and encourage all qualified candidates to apply for our open positions.