System Brief

SG-03 / GO AGENT SYSTEMS

Go becomes useful for agent systems when the hard part is operating a bounded runtime with contracts, cancellation, policy, and durable state.

Most agent work starts in Python or TypeScript, and that is usually the right default. Those ecosystems are better for experiments, notebooks, SDK churn, and product integration.

Go becomes interesting later, when the agent stops being a demo and starts behaving like a service. At that point the hard problems are contracts, cancellation, policy, retries, concurrency, durable state, and knowing exactly when the loop should stop.

Take a customer support triage agent for a SaaS product. A user sends a message like “I was charged twice and your docs didn’t help.” The agent should search internal docs, inspect account state, decide whether the issue is informational or operational, open an escalation if money is involved, and stop cleanly whenever a human needs to review the next step.

This is a good example because the words are simple and the system is not. You hit planning, tool execution, policy, retries, concurrency, and state persistence almost immediately.

Why Go fits agent runtimes

Go gives you control over the parts that usually get slippery once the agent is attached to real systems: cheap concurrency for tool execution and streaming workflows, explicit interfaces for model and tool boundaries, first-class cancellation through context.Context, and a deployment story that stays boring in the good way. Static binaries are easy to ship. Service behavior tends to stay predictable. The operational parts of the system can remain legible even as the product behavior gets more complicated.

Most production failures in agent systems are not mysterious at all. A tool call takes too long. A retry path duplicates a write. A session loses state between steps. The model picks the right tool with the wrong payload. The loop keeps going because nobody decided when it should stop. Go helps because it nudges you toward visible control flow instead of hidden orchestration.

Here, that means you can see the execution path clearly. Did the agent choose search_docs first? Did lookup_billing_account time out? Did open_refund_case get blocked by policy? Did the session stop because a human approval gate was hit? Go makes it much easier to build a system that can answer them.

Why Go over Python or TypeScript

Python is still the best choice if the core of the product is experimentation, model iteration, evaluation notebooks, or close coupling to data science workflows. TypeScript is often the fastest choice if the agent lives mostly inside an existing web application stack and the operational demands are still light. Both are good languages.

Go becomes compelling when the center of gravity moves away from experimentation and toward service behavior. If the thing you are building looks like a planner attached to queues, timeouts, worker pools, policy checks, and durable session state, Go starts to feel unusually natural. context.Context, goroutines, simple deployment, and explicit interfaces all line up with the actual shape of the problem.

Use Python when the hard part is model work, use TypeScript when the hard part is product integration, and start looking at Go when the hard part is operating the runtime.

A running example: support triage

The support triage example is useful because it forces you past “what should the model say next.” A realistic turn might look like this:

  1. load the customer session and recent conversation
  2. search internal docs for relevant help content
  3. inspect billing history and account flags
  4. decide whether the issue can be answered, escalated, or blocked
  5. create the right follow-up artifact

Even that short list already contains several different trust levels. A docs search is cheap and safe. Reading billing history is sensitive but routine. Opening a refund or escalation case is a write with business consequences. That mix is exactly why the runtime matters more than the text generation.

Separate the model from the runtime

The first useful move is to stop thinking of “the agent” as one thing. In practice, you usually have three concerns that deserve to stay separate: the model proposes the next action, the runtime enforces budgets, policy, and persistence, and the tools perform real work at the system boundary.

Many prototypes collapse all three into a recursive loop around an LLM SDK. That feels efficient at first, but it gets brittle fast. Prompts start carrying policy logic. Tool adapters start carrying session state. Retries become coupled to model wording. Observability turns into transcript archaeology. Once that happens, the team is no longer operating a service. It is chasing emergent behavior across too many layers at once.

A better mental model is simpler. The model recommends the next step. The runtime owns the rules of execution. The tools are controlled capabilities behind that runtime.

In this system, the model might decide that the right sequence is search docs, inspect account, open escalation. The runtime is still the thing that decides whether those tools are available, how long they get to run, whether the write is authorized, and whether the session is done.

Start with an explicit runtime shape

Instead of starting with an open-ended recursive loop, start with a small state machine with visible phases. The runtime loads session state, builds the model input from the current context, asks for the next step, validates and approves any tool calls, executes the allowed work, persists the result, and then either stops or continues within a bounded step budget.

This shape gives you clear places to attach logs, retries, approval checks, latency measurement, and human review. It also gives the rest of the engineering system something stable to work with. Product can define approval states. Operations can measure failure rate by step. Security can review what categories of actions exist. None of that works well if the runtime is effectively “call the model again until something looks done.”

Here is the architecture I would want for the support triage agent before I touched prompts at all:

user message
    |
    v
[session loader] --> [planner] --> [policy gate] --> [tool executor]
    |                   |               |                 |
    |                   |               |                 +--> search_docs
    |                   |               |                 +--> lookup_billing_account
    |                   |               |                 +--> open_refund_case
    |                   |               |
    |                   |               +--> block / request human review
    |                   |
    +--> [event store] <-- [step results] <-- [worker jobs / queue]

The important thing about that diagram is not complexity. It is separability. Each part has a job, and each job is easier to observe and test than a single recursive model loop.

And once that architecture exists, you can log a real run in a way that is useful to humans instead of only to SDK internals:

[
  {
    "step": 1,
    "kind": "decision_recorded",
    "message": "Search docs and inspect billing state before taking action"
  },
  {
    "step": 1,
    "kind": "tool_result",
    "tool": "search_docs",
    "status": "ok",
    "matches": 3
  },
  {
    "step": 1,
    "kind": "tool_result",
    "tool": "lookup_billing_account",
    "status": "ok",
    "duplicate_charge_detected": true
  },
  {
    "step": 2,
    "kind": "tool_result",
    "tool": "open_refund_case",
    "status": "blocked",
    "reason": "refund_threshold_exceeded"
  },
  {
    "step": 2,
    "kind": "state_transition",
    "state": "awaiting_human_review"
  }
]

That kind of trace is what makes the runtime legible. It tells product, operations, and engineering what actually happened without making them reconstruct the story from prompt text alone.

Define a stable step contract

Before you worry too much about prompts, define the data shape of a turn. If the runtime is going to survive beyond a prototype, it needs stable objects for decisions, tool calls, results, and persisted events.

type Decision struct {
 Message   string     `json:"message"`
 ToolCalls []ToolCall `json:"tool_calls"`
 Done      bool       `json:"done"`
}

type ToolCall struct {
 ID    string          `json:"id"`
 Name  string          `json:"name"`
 Input json.RawMessage `json:"input"`
}

type ToolResult struct {
 CallID string          `json:"call_id"`
 Name   string          `json:"name"`
 Output json.RawMessage `json:"output"`
 Err    string          `json:"err,omitempty"`
}

type Event struct {
 SessionID string          `json:"session_id"`
 Step      int             `json:"step"`
 Kind      string          `json:"kind"`
 Payload   json.RawMessage `json:"payload"`
 At        time.Time       `json:"at"`
}

This is deliberately plain. It should be. If your runtime has a stable decision object, a stable tool result object, and a stable event log, you can change model vendors, prompts, or internal tool implementations without rewriting the whole service. You also make replay possible, because the runtime is no longer tightly coupled to an SDK-specific response shape.

In the triage example, that means the planner can change its wording, the billing service can change its raw response fields, and the event store can still preserve a stable operational history. That stability is what lets the system age well.

Keep tool contracts narrow

Tool interfaces should be boring too. If they are ambiguous to the runtime, they will be ambiguous to the model.

type Tool interface {
 Name() string
 Run(ctx context.Context, input json.RawMessage) (json.RawMessage, error)
}

That interface can stay small because the runtime should own the harder work around it: validate tool input before execution, normalize tool output afterward, classify failures as retryable, blocked, or terminal, and persist each transition as an event. If you let raw SDK responses or vendor payloads leak directly into the loop, you usually pay for it later in observability and debugging.

It is also worth resisting the urge to expose very granular tools. If the model sees fifteen variants of “update record,” “patch record,” “edit field,” and “apply delta,” then it has to infer policy from naming alone. That is a poor interface. Task-level operations are much easier to reason about.

For the support triage agent, I would rather expose just four tools:

  1. search_docs
  2. lookup_billing_account
  3. open_refund_case
  4. escalate_to_human

That surface is small enough to understand and specific enough to authorize. It is also much easier to explain to a reviewer than a thin wrapper over every internal support API.

I also want outcomes normalized quickly. Something like this is much easier to reason about than a pile of tool-specific payloads:

type ToolOutcome struct {
 Status  string `json:"status"` // ok | retryable_error | blocked
 Reason  string `json:"reason,omitempty"`
 Message string `json:"message,omitempty"`
 CaseID  string `json:"case_id,omitempty"`
}

blocked := ToolOutcome{
 Status: "blocked",
 Reason: "refund_threshold_exceeded",
}

That is not a full domain model. It is enough structure for the runtime to know whether to continue, retry, or hand off.

Put policy in code, not only in prompts

Prompts can express intent. They should not carry enforcement. A lot of agent demos rely on instructions like “never make destructive changes without approval.” That is useful as guidance to the model, but it is not a system guarantee. The runtime should still be able to block or redirect sensitive actions before they execute. Maybe a write exceeds a dollar threshold. Maybe an account change is missing a confirmation token. Maybe a privileged action is outside the tenant scope the session is allowed to touch. Maybe the retry budget for the session has already been exhausted.

In Go, that usually becomes another explicit interface.

type Policy interface {
 Check(ctx context.Context, state SessionState, call ToolCall) error
}

Once that exists, the control path stays readable. The runtime asks the model for a decision, validates the payload, runs the policy check, executes or blocks the action, and persists the outcome. That is a much more defensible architecture than hoping the prompt phrasing is strong enough.

For example, the blocked result can be explicit and machine-readable:

{
  "tool": "open_refund_case",
  "status": "blocked",
  "reason": "refund_threshold_exceeded",
  "message": "Refunds above $250 require human approval"
}

In the running example, that might mean search_docs and lookup_billing_account run freely, but open_refund_case is blocked unless the account is verified and the amount is under a threshold. If the amount exceeds the threshold, the runtime can force the session into a blocked state and ask for human review.

Use concurrency with budgets, not optimism

Go makes parallel tool execution cheap. One of the easiest mistakes is to treat that as permission.

If an agent can call search, billing, CRM, and internal data services in the same turn, the runtime needs clear boundaries around step deadlines, session-level concurrency, per-tool timeout defaults, bounded retries, and idempotency for write operations. Without those limits, “parallel” quickly turns into “difficult to operate.”

errgroup plus a semaphore is usually enough:

g, ctx := errgroup.WithContext(parentCtx)
sem := make(chan struct{}, 4)

for _, call := range toolCalls {
 call := call
 g.Go(func() error {
  select {
  case sem <- struct{}{}:
  case <-ctx.Done():
   return ctx.Err()
  }
  defer func() { <-sem }()

  _, err := registry.Run(ctx, call)
  return err
 })
}

if err := g.Wait(); err != nil {
 // classify and persist failure
}

The important detail is that the budget belongs to the runtime, not the tool. A search adapter may be perfectly happy to fan out twenty requests in parallel. Your session budget may still only allow four because downstream systems, write contention, or cost controls say otherwise.

In practice, it is sensible to run search_docs and lookup_billing_account concurrently because both are read paths. It is much less sensible to let the planner fire off multiple writes in parallel just because the runtime makes that easy.

Make every step cancellation-aware

context.Context matters here because agent systems have many legitimate reasons to stop work mid-flight. The user may cancel the task. The step timeout may expire. A sibling tool call may fail, causing the runtime to abort the batch. The service may be shutting down. All of those conditions are normal parts of operating a live system.

This only helps if tool implementations genuinely honor the context they receive. The runtime can set deadlines, but the underlying HTTP clients, database calls, and queue producers have to propagate that context all the way through. Otherwise the cancellation model looks good on paper and fails where it matters.

The control loop can stay ordinary:

func (r *Runtime) RunSession(ctx context.Context, sessionID string) error {
 state, err := r.store.Load(ctx, sessionID)
 if err != nil {
  return err
 }

 startedAt := time.Now()

 for step := 1; step <= r.cfg.MaxSteps; step++ {
  if time.Since(startedAt) > r.cfg.MaxWallClock {
   return ErrWallClockExceeded
  }

  stepCtx, cancel := context.WithTimeout(ctx, r.cfg.StepTimeout)

  decision, err := r.planner.Next(stepCtx, state)
  if err != nil {
   cancel()
   return err
  }

  if err := r.store.Append(stepCtx, Event{
   SessionID: sessionID,
   Step:      step,
   Kind:      "decision_recorded",
   At:        time.Now(),
  }); err != nil {
   cancel()
   return err
  }

  if decision.Done {
   cancel()
   return nil
  }

  results, err := r.executor.Run(stepCtx, state, decision.ToolCalls)
  cancel()
  if err != nil {
   return err
  }

  state = state.Apply(results)
 }

 return ErrStepLimitExceeded
}

There is nothing exotic here. That is good. The loop is bounded, deadlines are visible, and persistence happens in the same control path as planning and execution.

Persist the loop like a workflow, not a chat transcript

A durable agent service needs a step log, not just a final answer. It should capture the session id, step number, model request id, tool call request and result, timing data, and the final state transition. When incidents happen, that event history matters more than the prompt text alone because it tells you whether the failure came from planning, tool selection, policy enforcement, latency, or state persistence.

Replay is one of the most underrated capabilities in agent infrastructure. If you can rerun a specific step with the same state and recorded or stubbed tool outputs, you can answer much sharper questions. Was the bug in planning, tool-output drift, or a timeout path? Did a new prompt change tool choice behavior? Without replay, teams mostly read transcripts and guess.

Use queues for long-running or side-effect-heavy work

Some actions should become jobs, not inline tool calls. Expensive data enrichment, bulk external synchronization, document generation, or multi-minute research workflows usually fit more naturally in a queued model.

In those cases, the agent should create work rather than wait beside it. The runtime records the requested action, validates it, enqueues a job with an idempotency key, and lets a worker perform the task. When that task completes, the worker emits an event that updates or resumes the session. Go handles this split well because the same service can expose synchronous handlers for short steps and worker processes for longer ones without changing the basic contracts.

Think of a complicated billing dispute that needs a ledger reconciliation job. The support agent should not sit in an inline request waiting for that work. It should dispatch a durable job, persist the handoff, and resume when the result is ready.

Common failure modes in Go agent runtimes

Go is a good fit for this kind of work, but it also makes some failure modes cheap to create if the team is sloppy.

One is the quiet goroutine leak. A worker starts, the session is cancelled, and some downstream call keeps running because the context was not propagated or respected. Another is unbounded fan-out. The planner produces more tool calls than expected and the runtime happily executes them because nobody put a session-level concurrency cap in place. A third is retry invisibility, where a transport helper retries under the hood and the agent layer no longer has a truthful picture of timing or duplication risk.

There is also a very Go-shaped version of “it mostly worked” where the code looks explicit but the budgets are not actually centralized. One package owns timeouts. Another owns retries. A third owns queue dispatch. Individually, each piece looks reasonable. Together, the session has no single source of truth for how much work it is allowed to do.

These are not arguments against Go. They are reminders that Go rewards explicit structure and punishes implicit structure.

Define approval and blocked states explicitly

Agents get much harder to operate when every unresolved case is treated as a generic error. It is usually better to distinguish success, retryable failure, and blocked pending review or approval.

The blocked state matters because many agent tasks are not actually failing. They are simply not authorized to continue alone. A refund may exceed policy threshold. A production change may require operator confirmation. A tool payload may be missing business context that only a human can supply. A step may be genuinely ambiguous and require someone to choose between multiple valid options.

Once blocked becomes a real state in the runtime, the rest of the product can respond to it intelligently. The UI can ask for review. The queue can pause work cleanly. Metrics can distinguish policy holds from operational failures instead of collapsing everything into “error.”

For a support triage agent, blocked is not an edge case. It is a primary outcome. The runtime should expect it, represent it, and make it easy for a human to take over.

Test the runtime at three levels

Teams often spend a lot of energy evaluating the model and not nearly enough testing the runtime around it. For a Go-based agent service, I usually want confidence at three levels:

  1. unit tests for tool adapters, payload validation, and policy checks
  2. integration tests for the step loop with a stubbed planner and tool registry
  3. replay or fixture tests for production-shaped sessions

The integration layer is especially valuable because it lets you verify orchestration behavior without depending on a live model call. You can check that the loop stops when Done is true, that a blocked tool never executes, that a timeout cancels sibling work, and that state persistence happens on every transition. If those behaviors are not deterministic in test, they will not be deterministic in production either.

I would absolutely want a fixture for “double charge complaint under threshold” and another for “double charge complaint over threshold.” Those are not model prompts. They are business pathways, and the runtime should make them testable as such.

Define stop conditions early

Agents get weird when teams postpone the stop rules. The runtime should know its maximum step count, wall-clock budget, tool-call budget per step, blocked states that require human review, and explicit done conditions from the beginning.

Without those boundaries, an agent loop quietly turns into an unbounded workflow engine. That is usually where cost spikes and confusing failures begin. This is also where engineering honesty matters. Many tasks do not need an open-ended agent. They need one or two planning steps and a controlled execution path. If you can define a clean terminal condition, you should.

In the support triage example, “done” probably means one of four things: the user got an answer, the agent opened a tracked case, the runtime blocked for human review, or the request timed out and the user was told the system would follow up later. That is already enough. It does not need to be more open-ended than that.

Build a service, not a magic loop

Treat an agent in Go as a workflow service with model-assisted planning. Once you do that, the architecture becomes much clearer. You need an API layer that creates or resumes sessions, a planner that turns session state into the next structured decision, a runtime that enforces budgets and policy, a tool registry that executes approved operations, a store that persists events and session snapshots, and workers that handle queued or long-running tasks.

This is less romantic than the “autonomous agent” framing, but it is much closer to what survives contact with production. It is not a glamorous diagram. It is a reliable one.

If I were building this today

If I were starting the support triage agent in Go today, I would keep the first production version deliberately constrained:

  1. one planner interface with structured decision output
  2. four or five task-level tools only
  3. a central runtime config for step budgets, concurrency, and retries
  4. policy checks in code for every write path
  5. durable event logging from the first real deployment
  6. a blocked state that hands off cleanly to a human
  7. replay fixtures for the highest-risk production flows

That is enough to ship something real without pretending the runtime is smarter or more autonomous than it is.

Go works well here because it rewards explicit structure. Most agent systems need more of that than more magic.