Measuring AI R&D Automation
Automation of AI R&D could be a big deal. We should track its broader consequences.
Frontier AI companies are trying to automate AI R&D. If they succeed, it could be a big deal for a lot of reasons.

One reason is that AI progress could go a lot faster. We might get the benefits of AI a lot sooner, but dangerous capabilities might come sooner too. Would the acceleration of defensive capabilities (e.g., forecasting) or safety research keep sufficient pace with that of dangerous capabilities? And would we still be able to make sense of what is going on quickly enough to respond? We don’t yet know.
If we automate AI R&D, we might also lose important oversight over the research process. AI systems could sabotage alignment experiments or poison the training of future AI systems without our knowledge. Concentration of power could become much more of a problem if AI companies need fewer human researchers: small groups of people within companies could make large leaps in AI progress with fewer internal or external checks. Automating oversight could mitigate some of these concerns, but we don’t know if it will suffice.

Empirical data on these dynamics could help inform important decisions in the near future. Companies will decide whether and how to implement safeguards upon crossing AI R&D automation thresholds. Governments will decide whether to impose additional regulation (e.g. oversight requirements for AI R&D). And public understanding of these trends will shape broader expectations of governments and companies.
Unfortunately, the currently available data isn’t that good. We have AI R&D benchmarks, but they can saturate quickly – even the benchmarks used in the METR time horizon metric, which were specifically designed not to saturate – and also don’t measure how much is actually automated in practice. We also don’t have much data on how AI R&D automation affects oversight.
In a new paper, we propose metrics that (primarily but not exclusively) AI companies could track to help fix these problems. We propose 14 metrics split across 4 categories:
Experimental metrics involving model behaviour in controlled environments, such as evaluating the effectiveness of oversight systems.
Survey-based metrics that query staff views, such as on productivity boosts.
Operational metrics that capture the AI R&D process as it’s operating, such as tracking how researchers allocate their time.
Organizational metrics that provide structural information about AI companies and the AI R&D process, such as the share of R&D spending going towards capital compared to labour.
No single metric perfectly correlates with the extent of AI R&D automation or its effects, but different metrics can compensate for each other’s weaknesses and provide a broader picture. Also, some metrics could be sensitive enough to be shareable only with select parties (e.g. governments, third-party auditors) rather than the general public.
We intend these metrics to be useful for frontier AI companies designing and implementing their safety frameworks, policymakers developing reporting requirements, and researchers studying AI development trajectories. We recommend that these actors focus on metrics (or components thereof) that are relatively neglected, which we summarize in the table below.
Having all this information will put us in a much better position. But we still need to figure out what to do as we approach full automation of AI R&D. Future work should address this problem.
See the full paper for more details.







