How Could Formal Verification Help with Cyber Resilience?
Formal verification could help to secure critical software, but it faces a number of practical barriers. More advanced AI could help lift many, but not all, of them.
Summary
Cyber agents are getting good. A Chinese state-sponsored actor likely used Claude Code to orchestrate a cyber espionage campaign. Coding agents can help find exploits on live smart contracts worth millions of dollars. AI-enabled malware can dynamically generate malicious scripts and obfuscate code to evade detection.
To mitigate increased cyber risks, we will need a multi-pronged approach. Approaches like misuse safeguards can help to deal with the leading edge of AI cyber capabilities. But we will also need societal-level interventions that can protect us as these capabilities diffuse widely. The OpenAI Foundation alone will commit up to billions for such resilience interventions.
One intuitively appealing approach to cyber resilience is formal verification: mathematically guaranteeing that software behaves according to a precise specification. Often, formal verification will mean proving that software is free of certain bugs, such as memory safety bugs (which account for 70% of all security vulnerabilities). Formal verification can also help prevent cryptographic flaws, authentication bypasses, and race conditions. The hope would be not just to address these flaws in existing software, but to integrate formal verification into the development process for new software.
Although appealing, formal verification efforts currently face at least three barriers:
We sometimes lack appropriate formal verification tools.
Even when the tools exist, using them might be costly.
And even when we manage to produce formally verified code, the code might not be adopted.
More advanced AI will likely help with many, but not all, of these barriers. For example, AI could reduce the cost of writing specifications and proofs, thereby making verification of existing codebases more feasible. However, blockers like feasibility for large projects and organizational inertia will potentially remain for some time.
Given such constraints, cyber resilience efforts (e.g. funders) have two kinds of bets available:
High-risk, high-reward bets to drastically advance AI capabilities for formal verification, potentially making it feasible to verify projects exceeding 1 million lines of code.
Lower-risk (but still resource-intensive) bets to verify small (< 200 000 lines), widely-used codebases using available or emerging AI tools, with concrete plans for adoption of the verified code.
In either case, and to improve software security in the long run, we will need investments in:
Developer-friendly tooling to integrate formal verification into standard workflows, such as through AI coding tools that provide verification assistance.
More benchmarks for writing formal specifications and proofs, to accelerate the process of automating these tasks.
Besides formal verification, resilience efforts should also explore complementary interventions that can scale to larger codebases and reliably deliver on shorter timelines, especially for organizations with minimal investments in cybersecurity.
What is formal verification and how could it help at all?
Many exploits come from a mismatch between how developers intend software1 to behave (its “specification”) and how it actually behaves. For example, most security vulnerabilities come from unintended memory safety issues, such as buffer overflows. Incorrect implementation of cryptography algorithms can leak secret keys. Flawed authentication logic can allow attackers to bypass access controls entirely.
Formal verification uses mathematical proofs to guarantee that software behaves according to its specification. Some tools verify whether existing code matches a specification, while others generate code directly from specifications. Formal verification can thus rule out (or find) exploits. For example:
For encrypted messaging, the Signal protocol provably guarantees that compromising a session key does not expose past or future messages.
The Rust compiler enforces rules that have been formally verified to guarantee memory safety (although the implementation of the compiler itself is not formally verified).
CompCert provably compiles (a subset of) C code correctly.2 Compilation errors could lead to exploits: a compilation error in a previous version of the Linux kernel could potentially have allowed attackers to escalate user privileges.
In a draft of the TLS 1.3 protocol, formal verification helped find an attack that allowed an adversary to impersonate a client.
Although we always have to trust something (e.g. the specification, the proof checker, the hardware, or the physical environment), formal verification radically shrinks how many things we need to trust.
One hope for cyber resilience is not just to formally verify existing software, but to integrate formal verification into the process of developing new software. If we could do so easily, there is a lot we might want to formally verify:
Communication protocols. Flaws in protocols can enable attackers to intercept traffic, hijack sessions, or impersonate legitimate parties.
Compilers for widely used languages.
Legacy systems in government.
Software used in critical infrastructure, such as water treatment plants (which tend to have poor cybersecurity), power and sewer systems, telecommunications, and hospitals.
Hypervisors. Vulnerabilities in hypervisors can allow attackers to escape virtual machine isolation and access data from other tenants in cloud environments.
Web browsers. Browser vulnerabilities can enable attackers to escape sandboxes and access users’ machines.
Operating systems. Vulnerabilities can grant attackers full control over a system.
Why isn’t formal verification more widespread?
Given how much we might want to formally verify, why isn’t formal verification more widespread? There are three main reasons:
We don’t have the tools or infrastructure
Even when we have the tools, using them is costly
Even when we produce formally verified code, the code might not be adopted
Lack of tools or infrastructure
Verifying existing code requires:
Formal semantics: Mathematically precise statements of what expressions in the programming language mean.
Proof infrastructure, which includes proof systems (the rules for how proofs work) and theorem provers (software tools that automate applying the rules).
Legacy languages could lack both of these things. Consider COBOL, which remains prevalent in finance and government. The language has a notoriously fragmented ecosystem. Authors of the GCC COBOL compiler noted:
The authors are well aware that a complete, pure COBOL-85 compiler won’t compile most existing COBOL code. Every vendor offered (and offers) extensions, and most environments rely on a variety of preprocessors and ancillary systems defined outside the standard.
Different formal semantics are needed for each vendor version, which decreases the incentive to create them. Researchers in the 1990s made an attempt, but admitted it only applied to “40% of COBOL”. COBOL also lacks proof infrastructure. Static analyzers exist, but they cannot provide mathematical guarantees. The closest example of proof infrastructure seems to be this paper, but the framework only covers a subset of the language, has no follow-up papers, and does not seem to be widely used.
Another issue is that formal semantics and proof infrastructure are missing for certain kinds of computational systems. Concurrent systems (i.e., those where multiple processes run at the same time) are one example, where we lack frameworks even to properly reason about the properties we would want to verify. The multi-core version of the seL4 microkernel remains unverified for this reason.
Cost of using the tools
Once we have the tools and infrastructure, we still need to:
Write a specification
Prove that the code3 implements the specification
Knowing and precisely specifying what we want software to do is a challenge. For complex software (e.g. web browsers) that interacts with humans and other systems, it is inherently difficult to account for all possible desired behaviours and edge cases.4 For existing code, writing the specification requires understanding what the code is supposed to do. This task is particularly difficult for legacy code: COBOL is notoriously difficult to understand, and the original developers of the code may have long since moved on. Specification mistakes crop up in practice too: for example, this paper from 2024 found 10 specification bugs in real-world libraries, including an AWS encryption library.
We also need to prove that the code implements the specification. Although some proofs can be automated, humans often need to intervene. Quantitative evidence is hard to come by, but qualitative evidence suggests that the process is time-consuming. CompCert’s proof took 6 person-years of effort. seL4’s proof took 20 person-years, which the authors estimate could be reduced to 8 person-years. Debugging failed proofs can be particularly time-intensive. When an automated proof fails, error messages often don’t reveal whether the problem lies in the code, an overly strict specification, or the proof approach itself.
These costs seem to balloon as projects grow. The limited empirical data suggests that proof effort scales linearly with proof size, and proof size scales with the square of specification size. Another paper finds that specification size scales roughly linearly with code size for components of seL4. Putting it all together, proof effort (excluding effort to write the specification) seems to scale quadratically with code size.
What does this scaling mean concretely? Linux has about 40 million lines of code compared to seL4’s about 10 000. If seL4 optimistically would have taken 8 person-years to verify, then verifying Linux might take
This effort seems completely infeasible without significant parallelization and/or AI assistance.
Despite these difficulties, formal verification can still be worth it. AWS engineers found that it forced them to think more clearly about design, and gave them confidence to pursue aggressive optimizations. Even so, the perception of high costs could discourage formal verification.
An additional problem is that the costs of formal verification, or security failures in general, may not be distributed optimally, even when the benefits of formal verification would outweigh the costs. For instance, formal verification of widely used, open-source software (e.g. OpenSSL, MySQL) would benefit all users, but impose costs mainly on the project’s maintainers.
Difficulties adopting the code
Even once we have formally verified code, there is no guarantee it will be used. miTLS, a formally verified implementation of TLS developed over the span of ten years, has seen limited adoption in contrast to the more popular (but not formally verified) OpenSSL.5 The CompCert compiler is used in a few safety-critical applications, but is not even listed as an option in one of the most popular developer surveys. Similarly, the seL4 microkernel seems limited to a few safety-critical applications.
A few factors could hinder adoption of formally verified code:
Performance losses. Code extracted from a formally verified algorithm could fail to meet performance expectations. For example, CompCert-compiled code (the compilation is verified, not the code itself) can be slower because aggressive optimizations used by other compilers have not been formally verified. That said, verification can sometimes enable better performance. AWS engineers found that bug fixes discovered during verification often improve runtime, and verification gives engineers confidence to attempt optimizations they would otherwise consider too risky.
Maintenance burden. For a codebase to remain formally verified, every new feature has to be formally verified. New features could break existing proofs and require changes to the specification.
Matching code style. Some formal verification tools help to generate code from a description of an algorithm. While this generated code is correct, it may not match the style conventions used elsewhere in an organization, such as variable naming, formatting, commenting practices, and code organization. This barrier affected Firefox’s adoption of HACL*, the formally verified cryptography library.
Additional blockers emerge when formally verified code is written for another organization to adopt.
API compatibility. Established libraries expose a set of functions (an “API”) that any codebase can use. If efforts to replace existing libraries do not use the same APIs, they will be incompatible with existing codebases. For example, miTLS has a fundamentally different API from OpenSSL, meaning that applications would have to be rewritten to use miTLS.
Lack of expertise and familiarity. Formal verification is not a part of standard development workflows. As a result, many developers will struggle to understand, debug, or modify formally verified code written by others. Especially in sectors like critical infrastructure with strict uptime requirements, organizations can be extremely conservative about the adoption of even critical security patches, let alone entirely new code.
Build system integration. Verified code may require different compilers, lack support for certain platforms, or need different linking configurations, which would require organizations to rework existing pipelines.
How much could AI help?
For lack of tooling or infrastructure:
AI could help develop new formal verification tools and infrastructure, although it is unclear how much it will help with thorny conceptual problems like verification of concurrent systems.
AI could help rewrite legacy code into languages that have verification tooling (e.g. see this paper). However, conservative organizations may resist porting entire codebases, the target language could be impractical in production, and verifying the correctness of the rewrite would be difficult (i.e. you would need formal semantics).
For the cost of using the tools:
AI could likely help to write and improve specifications by helping us to formalize what we want, which some researchers are already exploring. However, a fundamental problem remains: we need to know what we want.
Given recent advances in AI for mathematics, AI could make proofs a lot easier to write. Some existing lines of work aim to build AI systems that formally verify software.
AI could likely help rewrite formally verified code to ensure API compatibility and build system integration.
AI could make formal verification cheaper, but doing so only partially solves the cost distribution problem. For widely used open-source software, we may still run into the free-rider problem: everyone benefits from formal verification, but no single organization has a strong incentive to do the work.
For adoption:
AI could help improve the performance of formally verified code. For compilers, AI could help to verify that optimizations used in practice are (or are not) formally correct. More generally, as AWS engineers noted, formal verification can give developers the confidence to attempt optimizations they would otherwise consider too risky.
AI could reduce maintenance burden by helping update proofs and code for new features.
If AI coding tools start to integrate formal verification capabilities, verification could become a part of standard development processes. Indeed, adoption of AI coding tools is widespread amongst software developers. Still, formal verification requires adopting new workflows. Startups and new projects will adapt more easily, but organizations with legacy code or working with critical infrastructure tend to be conservative about new development practices.
What would be feasible to formally verify?
An especially binding constraint is the possibility that proof effort scales quadratically with codebase size. Given this constraint, and different assumptions about AI speed-up, what is the largest existing codebase that we could feasibly verify in the next few years?
Model
As discussed above, the limited empirical data suggests that proof effort scales quadratically with code size. If a reference project with L lines of code took P person-years to verify, then a new project with n lines of code would require:
Using seL4 as a reference (10 000 lines, 20 person-years),6 the table below shows maximum feasible codebase sizes under different AI speedup assumptions:
For reference, the Linux kernel was estimated in 2008 to have required 7 500 person-years of effort (17 years of calendar time). If we treat this estimate as an upper bound for a major, sustained effort, then:
Without AI assistance, codebases on the order of 100 000 lines seem potentially feasible.
With a 10x AI speed-up, codebases on the order of 500 000 lines seem potentially feasible.
With a 100x AI speed-up, codebases on the order of 1 million lines seem potentially feasible.
How much could AI speed us up?
What degree of AI speed-up is plausible within the next few years? Benchmarks suggest large potential gains, but real-world evidence is more mixed. Two relevant reference classes are math and software engineering.
In math, benchmarks suggest speed-ups on the order of 100x. On AIME, the best models achieve above 90% accuracy in about 100 seconds. Given the three-hour time limit, the implicit speed-up is roughly 100x. AlphaGeometry2 solved a particular IMO 2024 problem within 19 seconds. With 90 minutes budgeted per question, the speed-up is 280x. AI systems are also getting faster every year: silver-medal performance took 3 days in 2024, and gold-medal performance took 4.5 hours in 2025. A key caveat is that math competition problems are clean and well-defined. Real-world formal verification involves messy codebases, ambiguous specifications, and extensive iteration.
Speed-ups have so far been more modest for software engineering. One study found a 56% speed-up, but another found a 19% slowdown. Anthropic researchers and engineers self-report a 50% productivity improvement with Claude Code, 2-3x more than the improvement recorded in the previous year. Extrapolating optimistically, we might hope for a 20x speed-up within three years.
Absent a major, sustained effort to accelerate formal verification with AI, speed-ups on the order of 3-10x over the next few years seem most plausible, but are far from guaranteed. Non-moonshot efforts to verify existing codebases should likely focus on those well below 500 000 lines of code.
Concrete projects
To get well below the 500 000 threshold, formal verification efforts could potentially focus on smaller, critical portions of codebases. Full operating systems (Linux has ~40 million lines) and web browsers (Chrome has ~50 million lines) would be ruled out. Efforts could also focus on verifying the design of the system rather than the full implementation in code, as AWS engineers did.
Another option is to focus on widely-used, open-source libraries with fewer than 500 000 lines of code. Some potential candidates include:
Firecracker (50 000 lines): AWS’s Rust-based virtual machine monitor that powers serverless workloads on Lambda and Fargate.
libcurl (149 000 lines): A client-side URL transfer library supporting HTTP, HTTPS, FTP, and many other protocols. Powers the curl command-line tool and is embedded in billions of devices.
SQLite (156 000 lines): A self-contained, serverless, embedded relational database engine, deployed in smartphones, web browsers, and applications.
KVM (150 000 lines): The Linux kernel’s built-in hypervisor module, used by major cloud providers including AWS, Google Cloud, and Oracle Cloud.
zlib (55 000 lines): A general-purpose compression library used in the Linux kernel, OpenSSH, and many devices.
Note that formally verifying such projects would still be major, sustained efforts. With a 10x AI speed-up, a codebase with 100 000 lines would still take 200 person-years. Even if you could perfectly parallelize across 40 people, it would still take 5 years.
This analysis has other caveats. Our cost estimates used only one data point and cover proof effort, not the cost of writing formal specifications. This analysis also doesn’t give us the ROI of formal verification, especially relative to other interventions like rewriting code in memory-safe languages.7 Finally, verifying a distributed, open-source project would be a significantly different undertaking than verifying a piece of software that a small team wrote, with potentially different costs (e.g. coordinating with the maintainers).
So what?
AI could reduce many barriers to formal verification, but not all. Organizational inertia will likely persist, and absent a successful, major effort to dramatically accelerate formal verification with AI, verification costs will remain prohibitive for large-scale projects. As a rough upper bound in the absence of such a moonshot, formal verification seems feasible only for projects well under 500 000 lines of code.
Given these constraints, cyber resilience efforts (e.g. funders) have two kinds of bets available:
High-risk, high-reward bets to drastically advance AI capabilities for formal verification, which could make it feasible to verify projects with more than 1 million lines of code.
Lower-risk (but still resource-intensive) bets to verify existing codebases over the next several years, such as by:
Verifying high-level designs rather than full implementations.
Using available or emerging AI tools to help verify small (< 200 000 lines of code), widely-used codebases, with a concrete plan for adoption of verified code. Potential candidates include SQLite, Libcurl, Firecracker, KVM, and zlib. A key challenge is how formal verification would interact with frequent codebase changes. Lessons learned could be transferred to larger-scale formal verification efforts.
For both types of bets,and to improve software security in the long run, we will need investments in:
Developer-friendly tooling, to integrate formal verification into standard workflows.
More benchmarks for writing formal specifications and proofs, to accelerate the process of automating these tasks.
Because even the relatively lower-risk bets may not succeed, cyber resilience will also require complementary approaches that can scale to larger codebases and deliver reliably on shorter timelines, especially for organizations with minimal investments in cybersecurity.
Acknowledgements
I’m grateful to the following people for feedback (and pushback) that greatly improved this piece: Twm Stone, Evan Miyazono, John Halstead, Matthew van der Merwe, Herbie Bradley, Elias Groll, Girish Sastry, Kamilė Lukošiūtė, Aidan Homewood, Tom Reed, and Gopal Sarma. All errors remain my own.
I use the term “software” loosely to include both protocol designs and implemented code.
Technically, the C code itself could be insecure, such as by having memory safety issues. But the translation of C into assembly is guaranteed to be correct.
Technically, if we’re verifying existing code, then the best we can usually do is prove that a mathematical representation of the code—derived from a formal semantics—implements the specification. The reason is because standards for programming languages are written in natural language, which can be ambiguous or incomplete. C, for example, leaves the behavior of signed integer overflow undefined. Different compilers or runtime environments could also interpret the same code differently. Writing new code in a verification-friendly language (e.g. Dafny) does not have this problem.
To get more intuition for where the difficulty comes from, consider the problem of sorting an array. A specification for an array-sorting algorithm A might be, for every index i < j, we have sorted(A)[i] <= sorted(A)[j]. Unfortunately, the array [0, 0, 0] would satisfy this specification, no matter your starting array. We must also impose the condition that the sorted array is a permutation of the original array. The lesson here is that edge cases matter, and it can be difficult to capture everything you care about in a specification.
Smaller components of the full library have seen some adoption, such as the cryptographic library HACL*.
I use seL4 as a more conservative reference point because compared to CompCert, its codebase is smaller (10k vs 46k) and it took more person-years of effort (20 vs 6).
Formal verification would cover vulnerabilities besides memory safety issues, but rewriting code into memory-safe languages is likely cheaper.

