The Agentic AI Second Eye, and What It Can't See
Originally published on our website.
This is the second piece in a three-part series on how we’ve put AI agents to work inside the firm. The first covered what the agents do for us. This one is about where they break, and what those failures reveal about us.
The verification reflex
As mentioned in part one, Natalie Hoya is our entire legal team. There is no associate down the hall to catch a mistake, no second lawyer to confirm she’s reading the right clause in the right version of the right document. So when she describes the agents as a “second eye,” she means it literally. But the phrase carries a condition: the eye only works because she’s still reading everything herself.
That instinct turns out to be universal at AVV, and nobody had to be told to develop it. For high-stakes work, the team has independently converged on the same habit: trust the agent to do the first pass, then check it as if it were done by a sharp but unproven junior analyst.
GP Eddie Thai is blunt about why the analogy holds. “I wouldn’t call it upping my game,” he says, comparing the agents to human analysts. “Even for human analysts and associates, you’d have to rigorously check where the information is coming from, how they came to the conclusions.” The difference now is speed: there’s no time gap and less risk of being misunderstood. The standards, however, haven’t changed. “I still have to keep my standard up,” he says, “and not fall into the trap of, okay, I can just rely on what AI is putting out without checking it. It’s still prone to error.”
Partner Hau Ly applies the same discipline to numbers headed for LPs. When Ava (our AI portfolio analyst) returns a figure for a report, Hau’s reflex is to push back before she believes it. “Give me the sources so I can verify,” she’ll say. Natalie frames the dependency more plainly: “The most important thing is that we have the right set of data, because otherwise Ava cannot give the data I want correctly.”
While AVV arrived at this by instinct, one of our LPs has turned into a process. Francis Perelman, AI Program Manager at Capria Ventures, has spent the past 18 months building out the firm’s internal agents, and the discipline around them is formalized in a way ours isn’t yet.
Every agent runs against a set of written rules, and anything that would be written to a document, spreadsheet, or presentation has to be pre-approved by a person, either in Slack or by email, before it goes through. They run evaluation systems that compare an agent’s output whenever a prompt or input changes to catch regressions. And there’s a standing expectation that every team member checks an agent’s output after using it, including for the agents that run on their own.
It’s the same reflex we’ve described, scaled into a routine, with the explicit goal of keeping the agents reliable. Seeing it codified elsewhere makes something clear: the double-checking isn’t just a temporary phase on the way to trust, it’s the permanent cost of using these tools well.
This is the theme of this piece: the hardest problems we’ve run into aren’t really about the agents at all.
The mirror, part one: the agents show us our data
The clearest pattern in our first few weeks of working with Ava and Khoa (our AI generalist) is that the agents are far better at exposing our gaps than at hiding them. Their failures tend to be honest and often about our own data.
Hau ran into this while building a fund-level report. She asked Ava how many board seats and board observer positions AVV holds across Fund II, and Ava couldn’t find the answer. The instructive part is what that meant: not that the agent was incapable, but that the data wasn’t reliably there. “That points me to where a potential data gap is,” she says, and then to the harder question of how we keep that information current and accessible in the first place. As she puts it, the agents can only work with what we give them.
Partner Adrian Latortue ran into the same dynamic from the other direction. He prompted Khoa to verify the board observer and board seat rights we hold at a few companies, partly because he wasn’t sure our records were accurate. In checking, the agent flagged an error nobody had caught: internal links to the wrong folders, and folders with inaccurate material. In the end, Adrian went looking for one answer and found a structural problem in how some of our records were organized.
These are not stories about AI being smart. They’re stories about a system that tells you if something is a mess when you point it in that direction.
The mirror, part two: the agents show us ourselves
The mirror reflects more than data. It reflects habits and dependencies, too.
Adrian noted that since different team members use the firm’s various databases in different ways, when an agent fails at something, the person best placed to diagnose it may assume the data simply isn’t there. In fact, the agent may simply not know how to read it. He describes the trap as initially assuming “the data’s not there,” when in reality, it may be that nobody taught the agent where to look.
This is something of a bottleneck at times, as the person who built an agent is often the only one who fully understands what it can and can’t do. For Adrian, that creates some lack of clarity around what we should or should not use the agents for. “It’s weird,” he says, “because it’s like, hey, use the agents. But by the way, don’t use the agents for this stuff.” Unless you built it, you don’t really know which side of the line you’re on.
In the case of traditional software, this uncertainty would have changed the value calculation. If a program couldn’t reliably handle simple things, it was harder to trust it for difficult tasks. Agentic AI, on the other hand, requires experimentation. If the results of a query aren’t what you expected, they can be addressed and fixed on the next run. Instead of being fully trusted out of the box, agents must be refined over time.
The costs nobody priced in
There’s are areas of friction that will become more notable as we expand our use of agents: what all of this costs, and what it’s safe to run.
Some of it is literal token cost. Adrian recently spent time trying to download a deck through a link that wasn’t cooperating, and considered handing the problem to Khoa, until he realized that this might be an expensive way to solve a trivial task. “There are certain processes that you don’t want to run,” he says, “either because they’re inefficient, or they just cost a ton of tokens.” At the individual level, the numbers look tiny (pennies per task), but nobody has a clear view of how many times those tasks run, or how the total adds up across the firm over a month.
Some of it is plain inefficiency: Adrian describes running a process that got stuck for eight minutes, forcing him to start a new chat and then return to the old one to tell Khoa two or three times to stop. When a process freezes and you don’t know what that’s costing, the calculus around using it can shift.
There are also questions we simply haven’t answered yet, which we’d rather name than pretend we’ve solved. What should we deliberately not run on agents? The information an agent can reach is, by definition, available to the agent. So what’s available to whom, and what controls belong around that? How much should we actually be spending, and at what point does convenience stop being worth the cost? As Adrian notes, most companies eventually hit that wall, which is something we’d prefer to see coming.
Capria has encountered a similar safety question. As they’ve started experimenting with agents that operate more autonomously by running in the background and being reachable over Slack or WhatsApp, Francis draws a hard line at write access. Letting an agent create or modify files, he says, is something you only do once you’re “very certain that it’s working amazingly well.”
In other words, the more an agent can change on its own, the higher the bar it has to clear before being allowed to do that.
When the black box isn’t enough
For some work, the issue isn’t cost or data quality. It’s that we need to see the reasoning, and an agent’s confident answer isn’t good enough.
Adrian’s example is cap table math. Take the fully diluted ownership of a company that has done three SAFE rounds, where we invested in the second and added a follow-on check in the round after. That calculation has embedded logic and exceptions, and the stakes are high enough that a plausible-looking number isn’t sufficient. He’d rather run it in BitQuery than ask an agent, for a specific reason: he can see the calculation. “I can say, oh, this is an exception case, and I can see why it’s wrong,” he says. “Not that I need to actually fix it, but I can see the calculations. That’s really important.”
We could train Ava to do this, but the harder part, in Adrian’s words, is whether the agent could do it “consistently, reliably, for all companies in the portfolio.” Often, the answer is no, because the context it would need isn’t fully there. For a class of work where auditability matters more than fluency, the deterministic tool still wins. Knowing which class a given task belongs to is part of the skill we’re still building.
An important upside of agents is their flexibility in areas where software was previously rigid. And, in circumstances where you need deterministic answers, those can be designed in.
What the friction is telling us
To be clear, none of this is an argument to pull back.
Every limitation in this piece points to something specific to build. The data gaps are a case for cleaning up our systems of record and the naming conventions around them, which Natalie and Binh have already started.
The bottleneck is a case for documenting what each agent can do so that knowledge doesn’t live in one person’s head.
The cost and security questions are a case for actual guardrails: a shared sense of what to run, what not to, and what to watch. Capria’s above approach is a working example of what that can look like in practice. For us, Eddie has suggested building an adversarial agent that would take a first pass at checking another agent’s output. To be clear, this still wouldn’t remove the human from the loop, especially when important decisions are being made.
That’s the lesson underneath all of this, as Natalie puts it in four words: “Still need human work.” The agents have made us faster, and they’ve made our weaknesses easier to see. What they haven’t done, and we won’t let them do, is the judging.
That’s where the last piece in this series goes: past the work the agents do today, to what each of us wants to build next, and what the firm starts to look like when everyone has their own agents.
See you next week for part three.





