Key Takeaways
- The Prediction: Y Combinator's next major Request for Startups (RFS) will be for "Enterprise-Grade AI Agent Infrastructure"—the robust, stateful, and observable platforms required to run autonomous agents in production.
- The Data: Analysis of ~2,000 issues from the open-source agent framework OpenClaw reveals four macro-trends of failure: brittle channel integrations, fragmented tooling, flawed core orchestration, and unstable gateway infrastructure.
- The Gap: Open-source agent frameworks are excellent for prototyping but expose the fundamental infrastructure gaps for production workloads: state management, observability, and reliable orchestration.
- The Solution: The market needs a managed platform that abstracts away this complexity. This is precisely why we built Epsilla's Agent-as-a-Service platform, centered on a Semantic Graph for stateful context.
As founders, we're paid to see the signal in the noise. While the world is distracted by the capabilities of the next frontier models, the more critical, unsexy work of building the infrastructure to actually use them is being neglected. To get a data-driven view of this gap, my team pulled and analyzed nearly 2,000 issues from OpenClaw, a popular open-source agent framework. The results paint a clear picture of a system cracking under the weight of real-world complexity.
This isn't an indictment of OpenClaw. It's a map of the foundational problems every team building with Agentic AI will face. Before we dissect the data, let's establish a clear definition. AI Agent Infrastructure is the foundational layer of software and services—encompassing the agent runtime, state management, observability tooling, and security protocols—required to build, deploy, and operate reliable, autonomous AI agents at enterprise scale. It's the plumbing. And right now, everyone's plumbing is leaking.
The clustered issue data points directly to the next major opportunity, one I'm certain will be formalized as a YC RFS.
The Four Horsemen of Agent Instability: A Data-Driven Breakdown
We embedded and clustered the issues into four distinct macro-trends. Each cluster represents a fundamental pillar of infrastructure that is currently failing.
Cluster 1: The Last Mile is a Minefield (Channel Brittleness & State Loss)
Representative Issues:
[Bug]: WhatsApp listener dies silently after ~2-3 minMessage loss during WhatsApp channel restart (stale-socket reconnect)Webchat: messages lost during WebSocket reconnect (no client-side queue/ACK)This cluster is a graveyard of broken connections and lost messages across WhatsApp, Telegram, and WebChat. The core problem is state management in an asynchronous world. Agents are inherently stateful, but the channels they operate on are not. An open-source framework trying to manage WebSocket state, session data, and message queues across multiple, flaky third-party APIs is a recipe for silent failure and data loss. Without an industrial-grade persistence and state management layer, your agent is flying blind.
Cluster 2: The Cambrian Explosion of Tools (Feature Fragmentation)
Representative Issues:
[Feature Request] GitHub Webhook Integration for Real-time Agent Task Triggering[Feature]: api.runtime.llm() — Plugin SDK inference method[Feature]: Pluggable Guardrail Provider Interface for tool authorizationThis cluster is a wish list. Users are demanding integrations with everything from WeChat to GitHub, along with more sophisticated SDKs, security guardrails, and native capabilities like TTS and file sending. This highlights the need for a unified, extensible platform. Building one-off integrations is a dead end. The winning architecture will be a platform that provides a standardized interface—a Model Context Protocol (MCP)—for tools, data sources, and models to interoperate seamlessly, rather than a patchwork of plugins.
Cluster 3: The Orchestration Nightmare (DevEx & Core Logic Flaws)
Representative Issues:
[Bug]: Config section is horriby convoluted now (v2 UI)[Bug]: Subagent sessions_spawn resolves workspace from requester instead of target agentId[Bug]: model fallback doesn't work at all. also this can break long tasks from running.These issues reveal the brutal complexity of core agent logic. We see convoluted configurations, context overflows, broken model fallbacks, and incorrect sub-agent behavior. This is the direct result of inadequate context management. When an agent's "memory" is just a simple vector store and a sliding context window, it cannot perform complex, multi-step tasks reliably. The developer experience (DevEx) suffers because developers are forced to manually manage state that the underlying system should handle automatically.
Cluster 4: The Black Box (Gateway Instability & Poor Observability)
Representative Issues:
[Bug] openclaw logs fails with "gateway closed (1000)"[Bug]: gateway start crash due to OOM[Bug]: Intermittent gateway RPC/WebSocket failures (1000 close) break openclaw cron commandsThis is the most critical failure point. Thegateway—the central nervous system of the agent framework—is plagued by crashes, timeouts, and handshake failures. Worse, the diagnostic tools are failing, reporting contradictory information (unreachablevs.ok). You cannot build a production service on an infrastructure that is both unstable and impossible to observe. This is the clearest signal that a managed, highly-available, and transparent agent runtime is not a luxury; it's a prerequisite.
The Inevitable YC RFS: Enterprise-Grade AI Agent Infrastructure
These four clusters are not separate problems. They are symptoms of a single, massive gap in the market: the lack of true AI Agent Infrastructure. The progression from models to copilots to agents requires a leap in infrastructure, and the open-source world, for all its strengths in rapid prototyping, is not equipped to provide it.
This is the foundation of our thesis at Epsilla. We saw these problems coming.
The channel brittleness and state loss (Cluster 1) and the orchestration nightmare (Cluster 3) are symptoms of poor context management. That's why we built our platform on a Semantic Graph. It moves beyond simple vector retrieval to create a persistent, relational map of users, data, history, and tools. It provides the rich, stateful memory agents need to execute complex tasks without losing context.
The gateway instability and lack of observability (Cluster 4) are solved by our managed Agent-as-a-Service platform. We handle the runtime, the scaling, the security, and the monitoring, so our customers can focus on building agent capabilities, not debugging infrastructure.
Finally, the tooling fragmentation (Cluster 2) is addressed by AgentStudio and our work on a standardized Model Context Protocol (MCP). We provide a unified environment for building, testing, and deploying agents with a consistent interface for integrating tools and data, eliminating the plugin chaos.
Build on Bedrock, Not Sand
The data from OpenClaw is a warning. Founders in the Agentic AI space have a choice: spend the next 18 months painfully rebuilding this foundational infrastructure from scratch, or build on a platform designed for enterprise scale and reliability from day one.
When YC puts out the RFS for "AI Agent Infrastructure," the race will already have started. The question is whether you'll be at the starting line or still trying to patch your plumbing.
FAQ: AI Agent Infrastructure
Q1: What's the biggest mistake teams make when building AI agents today? A: They fixate on prompt engineering and model capabilities while completely underestimating the infrastructure required for reliable operation. An agent is a stateful application, not a script. It fails on brittle state management, poor observability, and unstable runtimes—problems that a better prompt can't solve.
Q2: Why can't we just use open-source frameworks like OpenClaw for production? A: They are fantastic for R&D and prototyping. However, as the data shows, they lack the enterprise-grade reliability, state management, and security guarantees required for production workloads. They expose the hard problems but don't provide the robust, managed solutions needed to solve them at scale.
Q3: How does Epsilla's Semantic Graph solve the agent context problem? A: It replaces fragile, short-term memory with a persistent, long-term understanding of relationships. Instead of just finding similar vectors, it maps how users, documents, and tool outputs are connected. This gives the agent deep, stateful context to handle complex, multi-turn tasks and reason effectively over time.

