Why the Former Manus Lead Abandoned Function Calling for CLI Agents

Key Takeaways

The prevailing multi-tool, strongly-typed Function Calling paradigm for AI agents is fundamentally inefficient, imposing unnecessary cognitive load on the LLM.
A single run(command="...") tool, exposing all capabilities as a Unix-style Command Line Interface (CLI), is a vastly superior execution model.
This CLI approach is native to LLMs, which are pre-trained on billions of lines of shell commands, making them more accurate and capable of complex, chained operations in a single tool call.
While the CLI paradigm perfects agent execution, it requires an orchestration and memory layer like Epsilla's Semantic Graph and Agent-as-a-Service (AaaS) to provide the necessary context, governance, and long-term memory for enterprise-grade applications.

In the race to build truly autonomous AI agents, the industry has largely converged on a standard playbook: define a set of discrete, strongly-typed tools and use the LLM's function calling capabilities to orchestrate them. We've built intricate frameworks around search_web(), read_file(), send_email(), and so on. It feels like progress. But what if this entire paradigm is a premature optimization—a complex abstraction built on a flawed premise?

A recent, startling revelation from the former Backend Lead at Manus (prior to the Meta acquisition) suggests exactly that. After two years of building agents, he came to a conclusion that should give every agent developer pause: he completely abandoned the multi-tool function calling model. In its place, he adopted a radically simpler, more powerful paradigm: a single, pure-string run(command="...") tool that exposes all agent capabilities as a Unix-style command line interface.

This isn't just a minor architectural tweak. It's a fundamental rethinking of the agent-tool interface, grounded in a fifty-year-old design philosophy that, it turns out, is the native language of Large Language Models. As we build the future of agentic systems at Epsilla, this insight is not just compelling; it's a critical component of building efficient, scalable, and truly intelligent agents.

The Unix Parallel: A Half-Century of Convergent Evolution

To understand why this CLI-first approach is so powerful, we have to look back. Fifty years ago, the architects of Unix made a profound design decision: everything is a text stream. Programs don't exchange complex, structured data objects; they pipe plain text to one another. This philosophy gave rise to a suite of small, hyper-focused tools (cat, grep, wc, sort) that could be composed into infinitely complex workflows using the | operator. Success, failure, and errors were communicated through simple, standardized mechanisms like exit codes and stderr.

Half a century later, from a completely different technological starting point, LLMs arrived at an almost identical conclusion: everything is a token. LLMs don't think in JSON or binary structures; they process and generate sequences of text. Their "thoughts" are text, their "actions" are text, and the feedback they receive from the world must be converted back into text.

The convergence is stunning. The text-based system designed for human operators at a terminal—shell commands, man pages, pipes, and redirects—is not just usable by an LLM; it is its native interface. An LLM, in the context of tool use, is effectively a terminal operator with near-instantaneous speed and a memory pre-filled with nearly every shell command and CLI pattern ever published on the internet.

The core thesis of the *nix Agent is this: stop inventing new, complex tool interfaces. The perfect one has existed for fifty years, and the LLM already knows how to use it.

The Cognitive Tax of Multi-Tool Architectures

Most agent frameworks today present the LLM with a list of distinct tools: tools: [search_web, read_file, write_file, run_code, ...]. Before every action, the model must perform a high-stakes classification task: which of these N tools is the right one for this specific sub-task? What are its exact parameters? What is its JSON schema?

This imposes a significant cognitive load. The more tools you add, the harder the selection problem becomes, and the more the model's accuracy degrades. The LLM's precious context window and processing power are spent on the meta-task of "which tool should I use?" instead of the actual task of "what do I need to accomplish?"

The CLI approach elegantly sidesteps this entire problem. By providing a single tool, run(command="..."), we remove the classification burden. The model's task is no longer to choose between disparate, unrelated APIs. Instead, it becomes a task of string composition within a single, unified namespace—a task it is exceptionally good at.

Consider these examples: run(command="cat notes.md") run(command="see screenshot.png") run(command="memory search 'last quarter deployment issue'") run(command="clip sandbox bash 'python3 analyze.py'")

The LLM is still making a choice, but it's choosing between commands, not between entire API schemas. This is a fundamentally simpler and more natural operation for a text-based model.

Why the CLI is the LLM's Mother Tongue

The most compelling argument for the CLI paradigm is found in the LLM's training data. GitHub, with its billions of lines of code, is saturated with shell commands in README files (pip install -r requirements.txt), CI/CD pipelines (make build && make test), and inline documentation. Stack Overflow solutions are littered with command-line recipes for debugging and system administration (cat /var/log/syslog | grep "Out of memory").

We don't need to teach an LLM how to use grep, curl, or wc. It already knows, probabilistically, from its training. This existing "knowledge" is a massive, untapped resource that traditional function calling ignores.

Let's contrast the two approaches with a simple task: read a log file and count the number of lines containing the word "ERROR".

Function Calling Method (3+ Tool Calls):

read_file(path="/var/log/app.log") -> Returns the entire file content, potentially flooding the context window.
search_text(text=<entire file content>, pattern="ERROR") -> Returns a list of matching lines.
count_lines(text=<matching lines>) -> Returns the final count.

This is a brittle, chatty process requiring multiple round trips between the LLM and the tool executor.

CLI Method (1 Tool Call):

run(command="cat /var/log/app.log | grep ERROR | wc -l") -> Returns "42".

One call. One round trip. The composition of tools happens at the execution layer, not in the LLM's context window. This is possible because the Unix philosophy of piping stdout to stdin is a native form of composition. By implementing a simple chain parser that understands | (pipe), && (and), || (or), and ; (sequence), a single run command can encapsulate an entire, robust workflow, complete with fallbacks and conditional logic.

This isn't a special optimization; it's the default behavior of a well-designed system. The combinatorial power is immense, yet for the LLM, it's just a matter of generating a string it has seen thousands of times before.

Epsilla: From Blazing-Fast Execution to Governed Orchestration

The CLI paradigm is a breakthrough for agent execution. It makes agents faster, more efficient, and more reliable at carrying out tasks. But raw execution speed is only half the battle. A flamethrower is a powerful tool, but without a trained operator who knows what to burn and what to spare, it's more dangerous than useful.

This is where the execution layer must meet the orchestration and memory layer. A CLI agent, for all its power, is still just a brilliant shell script executor. For it to function as a true business system, it needs three things the CLI itself cannot provide: memory, context, and governance.

This is the role of Epsilla. We see these hyper-efficient CLI agents as the "muscle" of the future agentic workforce. Epsilla provides the "brain" and "nervous system" that directs that muscle.

Semantic Graph as Long-Term Memory: How does the agent know which command to run? How does it know that the relevant logs are at /srv/prod-api/logs/ and not /var/log/? How does it connect a customer support ticket to a specific user ID and a series of past failed API calls? This is the function of Epsilla's Semantic Graph. It provides the agent with a rich, interconnected model of the world—the relationships between data, users, systems, and past events. This graph informs the LLM, allowing it to generate a command that is not just syntactically correct, but semantically and contextually relevant.
Agent-as-a-Service (AaaS) as Governance: In an enterprise setting, you can't have agents running arbitrary commands without oversight. Our AaaS platform acts as the orchestration layer. It manages fleets of these CLI agents, enforces permissions (Agent A can read logs, but Agent B can deploy code), maintains detailed audit trails of every command executed, and ensures that all actions align with business rules. It transforms a collection of powerful tools into a managed, secure, and auditable system.
Model Context Protocol (MCP) as the Bridge: The bridge between the Semantic Graph's deep context and the CLI agent's execution prompt is the Model Context Protocol (MCP). MCP is our standard for efficiently injecting relevant information from the graph into the agent's context, ensuring that when it formulates its run(command="...") payload, it does so with the full weight of historical and relational knowledge.

The future of agentic AI isn't a choice between efficient execution and intelligent orchestration. It requires both. The Manus lead's insight perfects the execution layer. Epsilla provides the critical memory and governance layer that makes that execution meaningful. By combining the raw power of the Unix philosophy with the contextual intelligence of a semantic graph, we can finally move from building clever demos to deploying real, enterprise-grade autonomous systems.

FAQ: Agentic Interfaces and Function Calling

Isn't function calling better for handling structured data like JSON or YAML?

Not necessarily. The Unix philosophy has always included tools for structured data, like jq for JSON and yq for YAML. A command like run(command="cat data.json | jq '.user.name'") is often more concise and powerful for data extraction than forcing an LLM to navigate a complex, nested JSON schema via multiple function calls.

Does this CLI-first approach work with any Large Language Model?

Yes, in principle, it works with any capable LLM. However, its effectiveness is directly proportional to the amount of code and shell command data in the model's training set. Models with extensive training on sources like GitHub and Stack Overflow (such as GPT-5, Claude 4, and Llama 4) will demonstrate a much stronger "native" fluency with CLI patterns.

How is this different from just giving an agent access to a sandboxed shell?

While similar, the key difference is the curated and observable nature of the toolset. Instead of a raw shell with thousands of potentially dangerous binaries, the run command exposes a specific, registered set of commands (cat, grep, memory, see, etc.). This allows for fine-grained permissions, logging, and security, turning an open-ended shell into a governed, purpose-built agentic tool.