The Instrumentation Guide for Observability in the Age of Coding Agents

Software instrumentation used to be a craft practiced close to the code. A developer wrote a feature, understood the control flow, knew the edge cases, and then decided where to add logs, metrics, traces, dashboards, and alerts. In mature engineering teams, that often meant using OpenTelemetry directly: creating spans around meaningful operations, adding attributes, recording exceptions, propagating trace context, and emitting metrics that reflected real service behavior.
In less mature systems, it meant something more improvised: a few logs around risky code paths, maybe a counter for errors, maybe a dashboard after the first production incident. The quality of the instrumentation depended heavily on the judgment, experience, and time pressure of the developer writing the code.
Then came auto-instrumentation and eBPF. That changed the baseline. OpenTelemetry eBPF Instrumentation, also known as OBI, can provide application and protocol observability without code changes, capturing network and protocol activity and producing OpenTelemetry data. But the OpenTelemetry documentation is explicit about the limitation: eBPF-based instrumentation does not replace language-level instrumentation when teams need custom spans, application-specific attributes, business events, or in-process telemetry that cannot be derived automatically. (OpenTelemetry)
That distinction matters. eBPF can often tell you that checkout-service called payments-api, how long the call took, and whether there was an HTTP or gRPC error. It usually cannot tell you whether the failure was a customer card decline, a provider timeout, a fraud-rule rejection, a retryable rate-limit event, a business validation error, or an SLO-impacting checkout failure.
Auto-instrumentation gives you the skeleton. Human-authored instrumentation adds the meaning. Now coding agents are changing the problem again.
Tools like Claude Code and Codex are not just autocomplete systems. Claude Code is described by Anthropic as an agentic coding tool that reads a codebase, edits files, runs commands, and integrates with development tools. (Claude) Codex is described by OpenAI as a cloud-based software engineering agent that can write features, answer questions about a codebase, fix bugs, and propose pull requests for review in a sandbox preloaded with the repository. (OpenAI)
The important shift is not that agents can generate code. The shift is that agents can increasingly generate complete changes.
They can inspect a repository, infer patterns, modify multiple files, run tests, and prepare a pull request. That means developers may be less directly involved in writing the implementation. They may be less directly involved in writing the instrumentation. But they are still responsible for supporting the software.
When production breaks, the on-call engineer cannot say, “The agent wrote that code.” The customer does not care. The incident commander does not care. The service still needs to explain itself. That is the new instrumentation challenge: How do we make sure agent-written code is not only functional, but observable, explainable, debuggable, and consistent with company standards?
The old failure mode was not just “we forgot to instrument this”
The classic observability failure was simple: someone shipped code and forgot to instrument it. That still happens. But in real production systems, instrumentation failure is usually more subtle. The code may contain logs. It may emit metrics. It may produce traces. There may even be a dashboard.
And yet, during an incident, the system still does not explain what is happening. That is because bad instrumentation can be worse than missing instrumentation. Missing instrumentation creates an obvious gap. Bad instrumentation creates false confidence.
Mistake 1: high-cardinality metrics
A developer adds a metric like this:
paymentFailures.add(1, {
user_id: user.id,
order_id: order.id,
path: req.path,
error_message: err.message,
provider_response_code: response.rawCode,
});
At review time, this looks helpful. It contains a lot of detail. It feels rich. In production, it can be disastrous.
user_id, order_id, raw paths, and dynamic exception messages can create huge numbers of unique time series. That increases cost, slows queries, makes dashboards harder to use, and can even make the metric unusable during the incident it was supposed to help debug.
The better version is usually boring and bounded:
paymentFailures.add(1, {
provider: "stripe",
operation: "authorize",
reason: "provider_timeout",
retryable: "true",
});
The detailed identifiers may still belong somewhere, but usually in traces or carefully controlled structured logs, not as metric labels. Metrics should aggregate. Traces and logs can carry request-level detail, subject to privacy and retention rules.
Coding agents are especially likely to make this mistake if the prompt says “add useful metrics” but does not define allowed labels. Agents often optimize for apparent usefulness. More fields look better unless the company teaches the agent that cardinality is a production constraint.
Mistake 2: wrong severity
Log severity is often treated as emotional emphasis rather than operational meaning.
- A declined credit card becomes error.
- A malformed user request becomes warn.
- A retry that succeeds becomes error.
- A background job that skips an already-processed record becomes warn.
- A noisy loop logs info thousands of times per minute.
This creates alert fatigue and confusion. Expected business outcomes look like system failures. Real failures are buried inside noise. The on-call engineer sees a wall of red and has to reverse-engineer which messages actually matter.
Severity should be defined by the company, not improvised in each file. A useful severity model might be:
debug: temporary or local diagnostic detail
info: meaningful lifecycle or business event, not a problem
warn: unexpected or degraded behavior that was handled
error: failed operation that affects correctness, user experience, or an SLO
fatal: process or service cannot continue safely
With that model, a payment decline is probably info, not error. It is a valid business outcome. A payment provider timeout may be warn if the system retries and recovers, or error if the checkout operation fails. A database write failure during checkout is probably error. A process unable to start because configuration is invalid may be fatal.
Without a standard, an agent will infer severity from nearby code. If nearby code is inconsistent, the agent will reproduce the inconsistency.
Mistake 3: spammy logs with little value
Many systems are full of logs like this:
logger.info("starting payment");
logger.info("calling provider");
logger.info("provider returned");
logger.info("processing payment response");
logger.info("finished payment");
This is noise disguised as instrumentation. These messages duplicate what traces already show. They do not explain the business outcome. They do not classify failure. They do not tell the responder what changed, what degraded, or what action to take.
A better log line is structured, sparse, and meaningful:
logger.warn("payment_authorization_failed", {
operation: "checkout.payment.authorize",
provider: "stripe",
outcome: "failed",
reason: "provider_timeout",
retryable: true,
trace_id: traceId,
});
This log line is useful because it has a stable event name, an operation, a provider, a bounded reason, retryability, and a trace ID. It can be searched, aggregated, linked to traces, and used during an incident.
The lesson is not “log more.” It is “log what changes the operator’s understanding.”
Mistake 4: misleading log messages
A misleading log is worse than a missing one.
Consider this:
logger.info("order_created", { orderId });
await publishOrderCreatedEvent(order);
If the database transaction has not committed yet, or if publishing the event fails afterward, the log may claim that an order was created when the workflow did not actually complete.
Or this:
logger.error("user_not_found", { userId });
But the actual cause was an expired session token, a tenant mismatch, or an upstream identity service timeout. The log points the responder in the wrong direction.
Coding agents can easily create misleading logs because they often summarize the local code path rather than the durable business outcome. A human who knows the domain might realize that “order created” is only true after commit and event publication. An agent needs that rule written down.
Good instrumentation should describe the operational truth, not the developer’s local guess.
Mistake 5: missing correlation
Sometimes every signal exists, but none of them connect.
A dashboard shows an error spike. Logs show failures. Traces show slow spans. But the logs do not include trace IDs. The metric labels do not match the trace attributes. The dashboard cannot pivot to example traces. The runbook does not mention which log event to search for.
The result is fragmented observability.
Good instrumentation creates a path:
alert -> dashboard -> metric dimension -> trace exemplar -> span -> log event -> runbook -> mitigation
If a coding agent adds a metric but not the matching span attribute, or logs an event without trace context, the operational chain breaks.
Mistake 6: instrumentation detached from SLOs and user journeys
A service can emit hundreds of metrics and still fail to answer the important question:
Are users able to complete checkout?
System metrics are not enough. Generic request metrics are not enough. Even traces are not enough if they do not encode the business operation.
For checkout, the supportable questions are more specific:
- Are users failing before payment or after payment?
- Are failures concentrated in one payment provider?
- Are declines rising, or are provider errors rising?
- Are retries masking a dependency problem?
- Which failure reasons burn the checkout SLO?
- Which dashboard confirms rollback worked?
- Which runbook should the incident commander open?
This is where business-aware instrumentation matters. It connects telemetry to user journeys, not just infrastructure behavior.
Coding agents change the ownership model
In the old workflow, the developer who wrote the code usually knew where the operational risks were. They might forget to instrument them, but at least the knowledge was in their head.
In the agentic workflow, that assumption weakens.
A developer may ask an agent to add a queue consumer, implement a payment retry path, migrate a data model, or refactor an API route. The agent may touch files the developer has never read deeply. It may produce a correct-looking patch that passes tests. The developer may review the diff, but not internalize every branch and failure mode.
That creates a new operational gap:
- The agent authored the code.
- The developer approved the code.
- The production team supports the code.
- But nobody explicitly authored the operational story.
This is not just a theoretical concern. A 2026 empirical study of AI-generated code in the wild analyzed 304,362 verified AI-authored commits from 6,275 repositories and found 484,606 distinct issues introduced by AI-generated changes; 24.2% of tracked AI-introduced issues were still present in the latest repository revision. (arXiv) Another 2026 paper, AIRA, focuses on “quiet” AI-induced failures: code that preserves the appearance of functionality while degrading or concealing guarantees. (arXiv)
Instrumentation is directly related to that problem. Observability is how software tells the truth about its behavior. If agent-written code catches exceptions without recording them, returns friendly errors without classifying the failure, emits generic logs, or omits metrics from critical paths, it can appear healthy while becoming harder to operate.
Survey data points in the same direction. Sonar’s 2026 developer survey reported that 72% of developers who have tried AI use it daily and that AI accounts for 42% of committed code, while 96% of developers do not fully trust AI-generated code to be functionally correct. (SonarSource) Stack Overflow’s 2025 survey found that more developers actively distrust AI-tool accuracy than trust it, with only a small fraction reporting high trust. (Stack Overflow)
The observability implication is clear: as agents generate more code, teams need stronger ways to verify not only correctness, but supportability.
Why coding agents do not focus on instrumentation by default
Agents are usually asked to complete product or engineering tasks:
- Add refund support.
- Create a batch import job.
- Fix the checkout timeout bug.
- Refactor this endpoint.
- Add a retry policy for failed webhook delivery.
Those prompts describe functionality. They rarely describe operability.
They usually do not say:
- Which logger should be used?
- Which log schema applies?
- Which event names are allowed?
- Which metric names should be used?
- Which labels are forbidden?
- Which attributes are standardized?
- Which span names are required?
- Which failure taxonomy should be used?
- Which outcomes are expected business outcomes?
- Which outcomes affect the SLO?
- Which dashboard should change?
- Which runbook should be updated?
Without that context, the agent has to infer.
It may inspect nearby files. It may copy patterns. It may produce plausible instrumentation. But if the repository is inconsistent, the agent will inherit the inconsistency. If one file uses logger.info("payment failed"), another uses log.error("Payment error"), and a third uses console.log, the agent has no reliable way to know which pattern is correct.
The problem is not that agents cannot write good instrumentation.
The problem is that most repositories do not define “good instrumentation” in a form agents can reliably use.
Company standards are the missing input
The biggest unlock for agent-written instrumentation is not a better one-off prompt. It is company context.
A coding agent can produce much better instrumentation when it knows the organization’s standards before it writes code. That context should include:
- The approved logger
- The approved tracing library
- The approved metrics library
- Log event naming conventions
- Metric naming conventions
- Span naming conventions
- Allowed metric labels
- Forbidden metric labels
- Known low-cardinality attributes
- Known high-cardinality attributes
- Required fields on every log
- Required attributes on every span
- PII and secret-handling rules
- Failure reason taxonomy
- Severity-level definitions
- SLOs and critical user journeys
- Dashboard ownership
- Runbook locations
- Sampling rules
- Retention rules
- Alerting rules
This is not bureaucratic overhead. It is what lets agents produce consistent, explainable telemetry.
Consistency matters because observability is cross-service by nature. If every service invents its own names, attributes, and severity rules, operators have to translate during incidents. That slows down debugging.
Explainability matters because telemetry is not just data. It is the system’s narrative about itself. If the names, attributes, and failure reasons are stable, responders can understand the story quickly.
A practical service-level observability contract could look like this:
# .observability/service.yaml
service:
name: checkout-service
owner: payments-platform
tier: customer-facing
runtime: nodejs
logger:
package: "@company/platform-logger"
import: "import { logger } from '@company/platform-logger'"
message_style: "event_name"
message_case: "snake_case"
required_fields:
- service
- operation
- outcome
- trace_id
forbidden_fields:
- email
- raw_user_id
- payment_token
- card_number
- authorization_header
- request_body
- response_body
- provider_raw_response
severity:
debug: "temporary diagnostic detail"
info: "successful lifecycle or expected business event"
warn: "unexpected but handled degradation"
error: "failed operation that affects correctness, user experience, or SLO"
fatal: "process cannot continue safely"
tracing:
standard: opentelemetry
span_name_case: "dot.separated.operation"
required_for:
- inbound_http
- outbound_http
- database_query
- queue_publish
- queue_consume
- scheduled_job
- payment_provider_call
required_span_attributes:
- service.name
- operation.name
- outcome
forbidden_span_attributes:
- payment_token
- email
- raw_request_body
- raw_response_body
- authorization_header
metrics:
prefix: checkout
name_case: "dot.separated"
allowed_labels:
- operation
- provider
- outcome
- reason
- currency
- retryable
forbidden_labels:
- user_id
- order_id
- session_id
- email
- exception_message
- raw_path
- request_id
required:
- name: checkout.payment.authorization.attempts
type: counter
labels: [provider, currency]
- name: checkout.payment.authorization.failures
type: counter
labels: [provider, reason, retryable]
- name: checkout.payment.authorization.duration
type: histogram
labels: [provider, outcome]
failure_taxonomy:
payment_authorization:
- approved
- declined
- invalid_request
- provider_timeout
- provider_rate_limited
- provider_5xx
- dependency_unavailable
- unknown_error
slos:
- name: checkout_success_rate
affected_operations:
- checkout.payment.authorize
- checkout.order.create
- name: checkout_latency
target_p95_ms: 300
dashboards:
- dashboards/checkout-service.json
runbooks:
- runbooks/payment-authorization-failures.md
This file gives an agent a source of truth.
Instead of asking the agent to “add logging,” the team can ask it to follow the service contract. Instead of letting it invent metric labels, the allowed and forbidden labels are defined. Instead of letting it guess severity, the severity model is explicit. Instead of letting it create one-off failure strings, the failure taxonomy is known.
A concrete example: agent-written code that looks instrumented but is not supportable
Imagine this prompt:
Add payment authorization to checkout. If the payment provider fails, return a friendly error.
A coding agent might produce something like this
export async function authorizeCheckout(req: CheckoutRequest) {
try {
logger.info("starting payment", {
userId: req.userId,
orderId: req.orderId,
});
const response = await paymentClient.authorize({
amount: req.total,
currency: req.currency,
token: req.paymentToken,
});
if (!response.approved) {
logger.error("payment failed", {
userId: req.userId,
orderId: req.orderId,
reason: response.reason,
});
return { ok: false, message: "Payment was declined." };
}
logger.info("payment successful", {
userId: req.userId,
orderId: req.orderId,
});
return { ok: true, authorizationId: response.id };
} catch (err) {
logger.error("payment failed", {
userId: req.userId,
orderId: req.orderId,
error: String(err),
});
return {
ok: false,
message: "We could not process your payment. Please try again.",
};
}
}
This code is not uninstrumented. It has logs. But it is not supportable.
It logs userId and orderId, which may violate company policy. It uses error for a card decline, which is probably an expected business outcome. It uses the same message, "payment failed", for different failure modes. It logs raw exception text. It does not classify provider failures. It does not create a span around the provider call. It does not emit low-cardinality metrics. It does not record latency. It does not link logs to traces. It does not distinguish user declines from provider outages. It does not create telemetry tests.
Now give the agent better context:
Implement payment authorization for checkout.
Before coding:
- Read `.observability/service.yaml`.
- Follow the logger and OpenTelemetry patterns in `src/platform/telemetry`.
- Use the `payment_authorization` failure taxonomy.
- Do not log user IDs, order IDs, payment tokens, raw provider payloads, request bodies, or exception messages.
- Payment declines are expected business outcomes and must not be logged as errors.
- Provider timeouts, provider 5xx, and dependency failures affect the checkout SLO.
Definition of done:
- Add a span named `checkout.payment.authorize`.
- Emit low-cardinality metrics for attempts, failures, and duration.
- Use only approved metric labels.
- Record exceptions on unexpected provider/system failures.
- Add telemetry tests for approved, declined, provider timeout, and provider 5xx.
- Update the checkout dashboard and payment authorization runbook if new signals are introduced.
Now the agent has a much better target.
The implementation may look more like this:
import { SpanStatusCode, trace, metrics } from "@opentelemetry/api";
import { logger } from "@company/platform-logger";
import { classifyPaymentFailure } from "./paymentFailureTaxonomy";
const tracer = trace.getTracer("checkout-service");
const meter = metrics.getMeter("checkout-service");
const attempts = meter.createCounter("checkout.payment.authorization.attempts");
const failures = meter.createCounter("checkout.payment.authorization.failures");
const duration = meter.createHistogram("checkout.payment.authorization.duration");
export async function authorizeCheckout(req: CheckoutRequest) {
return tracer.startActiveSpan("checkout.payment.authorize", async (span) => {
const start = performance.now();
span.setAttributes({
"operation.name": "checkout.payment.authorize",
"payment.provider": "stripe",
"checkout.currency": req.currency,
});
attempts.add(1, {
provider: "stripe",
currency: req.currency,
});
try {
const response = await paymentClient.authorize({
amount: req.total,
currency: req.currency,
token: req.paymentToken,
});
if (!response.approved) {
span.setAttributes({
outcome: "declined",
"payment.failure_reason": "declined",
});
duration.record(performance.now() - start, {
provider: "stripe",
outcome: "declined",
});
logger.info("payment_authorization_declined", {
operation: "checkout.payment.authorize",
provider: "stripe",
outcome: "declined",
reason: "declined",
trace_id: span.spanContext().traceId,
});
return { ok: false, message: "Payment was declined." };
}
span.setAttribute("outcome", "approved");
duration.record(performance.now() - start, {
provider: "stripe",
outcome: "approved",
});
return { ok: true, authorizationId: response.id };
} catch (err) {
const failure = classifyPaymentFailure(err);
span.recordException(err as Error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: failure.reason,
});
span.setAttributes({
outcome: "failed",
"payment.failure_reason": failure.reason,
"payment.failure_retryable": failure.retryable,
});
failures.add(1, {
provider: "stripe",
reason: failure.reason,
retryable: String(failure.retryable),
});
duration.record(performance.now() - start, {
provider: "stripe",
outcome: "failed",
});
logger.warn("payment_authorization_failed", {
operation: "checkout.payment.authorize",
provider: "stripe",
outcome: "failed",
reason: failure.reason,
retryable: failure.retryable,
trace_id: span.spanContext().traceId,
});
return {
ok: false,
message: "We could not process your payment. Please try again.",
};
} finally {
span.end();
}
});
}
This is not just “more telemetry.” It is better telemetry. The span names the business operation. The metrics are low-cardinality. The log event names are stable. The severity reflects operational meaning. The failure reason comes from a taxonomy. Sensitive fields are excluded. The signals can be correlated through the trace ID.
The system now tells a clearer story.
The agent needs more than source code
A coding agent can inspect the repository, but source code alone is not enough.
To write best-in-class instrumentation, the agent needs operational context. That context should be available in files the agent can discover and follow.
Useful context includes:
1. Repository instructions
How this repo expects agents to work.
2. Observability contract
Service-level telemetry rules, schemas, metrics, spans, logs, and failure taxonomies.
3. Existing telemetry helpers
Logger wrappers, tracing helpers, metric factories, middleware, exporters, test utilities.
4. Company naming conventions
Log event format, metric prefix, span naming style, attribute names, severity definitions.
5. Known attributes
Approved names for tenant, region, provider, operation, outcome, reason, retryability, feature flag, and deployment environment.
6. Forbidden attributes
PII, secrets, raw payloads, user identifiers, request bodies, response bodies, tokens, and high-cardinality labels.
7. Critical user journeys
Checkout, signup, login, ingestion, search, billing, notification delivery, or whatever the business depends on.
8. SLOs
Availability, latency, freshness, correctness, durability, queue lag, delivery success, conversion rate.
9. Dashboards and runbooks
Where new signals should appear and how responders should use them.
10. Telemetry tests
Patterns for asserting spans, metrics, logs, and absence of forbidden fields.
This context can live in a few places:
AGENTS.md
CLAUDE.md
.codex/instructions.md
.observability/service.yaml
docs/observability.md
runbooks/*.md
dashboards/*.json
src/platform/telemetry/*
test/telemetry/*
The exact filenames matter less than the principle: the standards must be close to the code, versioned with the code, and readable by agents during implementation and review.
How to direct coding agents to write better instrumentation
Do not ask for instrumentation as an afterthought.
A weak prompt is:
Build the refund feature and add some logs.
A better prompt is:
Build the refund feature.
Before coding:
- Read `AGENTS.md`.
- Read `.observability/service.yaml`.
- Inspect existing telemetry helpers in `src/platform/telemetry`.
- Identify the critical user journey and SLOs affected by this change.
Instrumentation requirements:
- Add spans around new inbound handlers, outbound calls, database writes, and queue publishes.
- Use only approved span attributes and metric labels.
- Emit metrics for attempts, failures, and duration.
- Use stable failure reasons from the service taxonomy.
- Use the company logger and log schema.
- Do not log PII, secrets, raw request bodies, raw response bodies, or exception messages.
- Use `info` for expected business outcomes, `warn` for handled degradation, and `error` for failed operations that affect correctness, user experience, or SLOs.
- Add telemetry tests for success, expected business failure, and unexpected system failure.
- Update dashboards and runbooks if new signals are introduced.
In the pull request description, include:
- New spans
- New metrics
- New log events
- Failure reasons
- SLOs affected
- Dashboard and runbook changes
- Known instrumentation gaps
That prompt changes the target. The agent is no longer optimizing only for code that works. It is optimizing for code that can be supported. For larger changes, ask the agent to plan instrumentation before implementation:
Create an implementation plan before writing code.
The plan must include:
- Functional changes
- New or changed spans
- New or changed metrics
- New or changed structured logs
- Failure taxonomy updates
- Cardinality risks
- Privacy risks
- Dashboard changes
- Runbook changes
- Telemetry tests
Do not write code until the plan includes the observability section.
This matters because agents, like humans, are more likely to produce coherent instrumentation when it is part of the design rather than patched in afterward.
What developers need to become good at
Coding agents do not eliminate developer responsibility. They change the shape of it.
Developers need to become better at defining operational contracts.
That means knowing how to describe:
- Critical user journeys
- Expected business outcomes
- Unexpected system failures
- Failure taxonomies
- SLO impact
- Metric names and labels
- Span names and attributes
- Log schemas and severity rules
- Privacy and security constraints
- Dashboard expectations
- Runbook requirements
- Telemetry tests
- CI review rules
Developers also need observability literacy. They should understand traces, spans, metrics, logs, context propagation, semantic conventions, sampling, exemplars, baggage, collector pipelines, and the difference between metric labels and trace attributes.
They need schema discipline. A company-wide telemetry schema is what lets humans and agents produce consistent signals across services.
They need cardinality judgment. Agents will often add more fields unless told not to. Developers must define which data belongs in metrics, which belongs in traces, which belongs in logs, and which should not be emitted at all.
They need severity judgment. Without shared severity rules, every service speaks a different operational language.
They need privacy judgment. Instrumentation can leak secrets, PII, access tokens, prompts, model responses, provider payloads, and customer data.
They need review automation skills. The future of code review is not just reading diffs. It is maintaining the checks, contracts, and agent instructions that prevent bad diffs from being produced in the first place.
The future: instrumentation as an agent-readable contract
The old model was:
- Developer writes code.
- Developer remembers to add instrumentation.
- Reviewer maybe notices missing logs.
- SRE discovers the real gap during an incident.
The agentic model should be:
- Company defines observability standards.
- Repository exposes those standards to coding agents.
- Agent plans instrumentation before implementation.
- Agent writes code and telemetry together.
- Telemetry tests verify the contract.
- CI checks schema, cardinality, privacy, dashboards, and runbooks.
- Agentic review flags semantic instrumentation gaps.
- Incidents update the contract for the next change.
This is the opportunity.
Coding agents will increase the amount of software teams can produce. Without better instrumentation practices, they may also increase the amount of software teams cannot explain.
But with the right context, agents can do something humans often fail to do consistently: apply the same observability standards every time, across every endpoint, queue, dependency, job, and failure path.
The goal is not more logs. The goal is not adding OpenTelemetry calls everywhere.
The goal is software that tells the truth when it fails.
In the age of coding agents, instrumentation has to become part of the definition of done: not tribal knowledge, not a cleanup task, not something added after the first incident, but a first-class contract that agents can read, implement, test, and review.


