By Eran Barlev - Part 1 of 3: A comprehensive guide to implementing observability with AWS and OpenTelemetry
Observability is crucial for modern cloud-native applications, providing deep insights into system performance, user experience, and business metrics. This guide will walk you through implementing a robust observability solution using Amazon Web Services (AWS) and OpenTelemetry, the industry standard for telemetry data collection.
In this first part, we'll cover the fundamentals of AWS and OpenTelemetry, then dive into the practical implementation of the AWS Distro for OpenTelemetry (ADOT) collector on Amazon EKS. You'll learn how to set up the infrastructure, configure the collector, and instrument your applications to send telemetry data to AWS services like CloudWatch and X-Ray.
What Is AWS?
Amazon Web Services (AWS) is the leading cloud computing platform, offering scalable infrastructure and services for storage, compute, networking, and more. It powers applications for startups and enterprises alike, providing the backbone for modern cloud-native development.
What Is OpenTelemetry?
OpenTelemetry (often abbreviated as OTel) is an open-source observability framework designed to standardize the collection of telemetry data such as logs, metrics, and traces. It enables developers and DevOps teams to collect, process, and export observability data from their applications to various backends.
At the heart of the OpenTelemetry ecosystem is the OTel Collector. The collector is a vendor-agnostic agent that receives telemetry data, processes it (e.g., filtering, batching, transforming), and exports it to the observability platform of your choice (e.g., Amazon CloudWatch, AWS X-Ray, Prometheus, or third-party services like Datadog).
How Do AWS and OpenTelemetry Work Together?
AWS provides a distribution of the OpenTelemetry Collector known as the AWS Distro for OpenTelemetry (ADOT). It's an AWS-supported version of the OTel Collector, preconfigured to integrate seamlessly with AWS services like:
- Amazon CloudWatch (for metrics and logs)
- AWS X-Ray (for distributed traces)
- Amazon ECS, EKS, and EC2 (for infrastructure observability)
This combination allows you to instrument your applications using OpenTelemetry SDKs and forward the telemetry data through the AWS OTEL Collector to native AWS services, offering deep visibility into system performance and behavior.
Getting Started Using OTel Collectors and AWS
Step 1: Choose Your Environment
Before you begin, decide where you want to run the AWS OTEL Collector:
- Amazon EC2
- Amazon ECS (Fargate or EC2 launch type)
- Amazon EKS (Kubernetes)
For this guide, we'll use Amazon EKS as an example.
Step 2: Install the AWS OTEL Collector on EKS
Step 2.1: Prerequisites
Before installing the ADOT collector, ensure you have:
- AWS CLI configured with appropriate permissions
- kubectl configured to access your EKS cluster
- Helm installed (version 3.x)
- An EKS cluster with proper IAM OIDC provider configured
Step 2.2: Create IAM Role for ADOT Collector
Create an IAM role that the ADOT collector will use to send data to CloudWatch and X-Ray:
# Create IAM policy for ADOT collector
cat <<EOF > adot-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"xray:PutTraceSegments",
"xray:PutTelemetryRecords",
"xray:GetSamplingRules",
"xray:GetSamplingTargets",
"xray:GetSamplingStatisticSummaries"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"logs:PutLogEvents",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:DescribeLogGroups",
"logs:DescribeLogStreams"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData"
],
"Resource": "*"
}
]
}
EOF
# Create the policy
aws iam create-policy \
--policy-name ADOTCollectorPolicy \
--policy-document file://adot-policy.json
# Create IAM role and attach policy (replace with your account ID and cluster name)
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
CLUSTER_NAME=your-eks-cluster-name
aws iam create-role \
--role-name ADOTCollectorRole \
--assume-role-policy-document '{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::'"$ACCOUNT_ID"':oidc-provider/oidc.eks.'"$AWS_REGION"'.amazonaws.com/id/'"$(aws eks describe-cluster --name $CLUSTER_NAME --query cluster.identity.oidc.issuer --output text | cut -d'/' -f5)"'"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.'"$AWS_REGION"'.amazonaws.com/id/'"$(aws eks describe-cluster --name $CLUSTER_NAME --query cluster.identity.oidc.issuer --output text | cut -d'/' -f5)"':sub": "system:serviceaccount:aws-otel:aws-otel-collector"
}
}
}
]
}'
aws iam attach-role-policy \
--role-name ADOTCollectorRole \
--policy-arn arn:aws:iam::$ACCOUNT_ID:policy/ADOTCollectorPolicy
Step 2.3: Install the ADOT Operator
Use Helm to install the AWS Distro for OpenTelemetry Operator:
# Add the AWS Helm repository
helm repo add aws-observability https://aws-observability.github.io/aws-otel-helm-charts
helm repo update
# Create namespace for ADOT
kubectl create namespace aws-otel
# Install the ADOT Operator
helm install adot-operator aws-observability/adot-operator \
--namespace aws-otel \
--set serviceAccount.create=true \
--set serviceAccount.name=adot-operator
Step 2.4: Create Collector Configuration
Create a ConfigMap with the ADOT collector configuration:
cat <<EOF > adot-collector-config.yaml
apiVersion: v1
kind: ConfigMap # Kubernetes resource type for storing configuration data
metadata:
name: adot-collector-config # Name of the ConfigMap
namespace: aws-otel # Kubernetes namespace where this ConfigMap will be created
data:
config.yaml: | # The actual OpenTelemetry collector configuration
receivers: # Define how the collector receives telemetry data
otlp: # OpenTelemetry Protocol receiver (standard protocol)
protocols:
grpc: # gRPC protocol endpoint for receiving data
endpoint: 0.0.0.0:4317 # Listen on all interfaces, port 4317 (standard OTLP gRPC port)
http: # HTTP protocol endpoint for receiving data
endpoint: 0.0.0.0:4318 # Listen on all interfaces, port 4318 (standard OTLP HTTP port)
processors: # Define how to process/transform telemetry data before exporting
batch: # Batch processor groups multiple telemetry items together
timeout: 1s # Maximum time to wait before sending a batch
send_batch_size: 1024 # Maximum number of items in a batch
resource: # Resource processor adds metadata to telemetry data
attributes:
- key: service.name # Add service name attribute to all telemetry
value: "my-service" # Value for the service name
action: upsert # Create if doesn't exist, update if it does
- key: service.namespace # Add namespace attribute
value: "production" # Environment/namespace value
action: upsert
exporters: # Define where to send the processed telemetry data
awscloudwatch: # AWS CloudWatch exporter for metrics and logs
region: us-west-2 # AWS region where CloudWatch is located
log_group_name: "/aws/eks/my-cluster/application" # CloudWatch log group name
log_stream_name: "{PodName}" # Use pod name as log stream (dynamic)
endpoint: "https://logs.us-west-2.amazonaws.com" # CloudWatch logs endpoint
awsxray: # AWS X-Ray exporter for distributed tracing
region: us-west-2 # AWS region where X-Ray is located
index_attributes: true # Enable indexing of trace attributes for faster queries
default_indexed_attributes: true # Index common attributes by default
service: # Define which telemetry types to collect and how to process them
pipelines: # Processing pipelines for different telemetry types
traces: # Pipeline for distributed traces
receivers: [otlp] # Receive traces via OTLP protocol
processors: [batch, resource] # Process with batching and resource attribution
exporters: [awsxray] # Send traces to AWS X-Ray
metrics: # Pipeline for metrics
receivers: [otlp] # Receive metrics via OTLP protocol
processors: [batch, resource] # Process with batching and resource attribution
exporters: [awscloudwatch] # Send metrics to CloudWatch
logs: # Pipeline for logs
receivers: [otlp] # Receive logs via OTLP protocol
processors: [batch, resource] # Process with batching and resource attribution
exporters: [awscloudwatch] # Send logs to CloudWatch
EOF
kubectl apply -f adot-collector-config.yaml # Apply the ConfigMap to the cluster
Step 2.5: Deploy the ADOT Collector
Create the ADOT collector deployment using the Operator:
cat <<EOF > adot-collector-deployment.yaml
apiVersion: opentelemetry.io/v1alpha1 # Custom resource API for OpenTelemetry Operator
kind: OpenTelemetryCollector # Custom resource type for deploying collectors
metadata:
name: adot-collector # Name of the collector instance
namespace: aws-otel # Kubernetes namespace
spec:
mode: daemonset # Deploy as DaemonSet (one pod per node)
serviceAccount: aws-otel-collector # Service account for the collector pods
image: amazon/aws-otel-collector:latest # AWS-provided collector image
config: | # Inline collector configuration (alternative to ConfigMap)
receivers: # Define data sources
otlp: # OpenTelemetry Protocol receiver
protocols:
grpc: # gRPC endpoint for high-performance data ingestion
endpoint: 0.0.0.0:4317 # Listen on all network interfaces
http: # HTTP endpoint for web-based data ingestion
endpoint: 0.0.0.0:4318 # Standard OTLP HTTP port
processors: # Data processing pipeline
batch: # Group multiple telemetry items for efficient transmission
timeout: 1s # Wait up to 1 second to form a batch
send_batch_size: 1024 # Maximum items per batch
resource: # Add contextual metadata to telemetry data
attributes:
- key: service.name # Service identifier
value: "my-service" # Your service name
action: upsert # Insert or update the attribute
- key: service.namespace # Environment identifier
value: "production" # Your environment name
action: upsert
exporters: # Data destinations
awscloudwatch: # Send metrics and logs to CloudWatch
region: us-west-2 # Target AWS region
log_group_name: "/aws/eks/my-cluster/application" # CloudWatch log group
log_stream_name: "{PodName}" # Dynamic log stream naming
awsxray: # Send traces to X-Ray for distributed tracing
region: us-west-2 # Target AWS region
index_attributes: true # Enable attribute indexing for better query performance
default_indexed_attributes: true # Index common attributes automatically
service: # Telemetry processing configuration
pipelines: # Define processing flows for different data types
traces: # Distributed tracing data flow
receivers: [otlp] # Source: OTLP protocol
processors: [batch, resource] # Processing: batching + metadata enrichment
exporters: [awsxray] # Destination: AWS X-Ray
metrics: # Metrics data flow
receivers: [otlp] # Source: OTLP protocol
processors: [batch, resource] # Processing: batching + metadata enrichment
exporters: [awscloudwatch] # Destination: CloudWatch
logs: # Log data flow
receivers: [otlp] # Source: OTLP protocol
processors: [batch, resource] # Processing: batching + metadata enrichment
exporters: [awscloudwatch] # Destination: CloudWatch
serviceAccount: # Service account configuration for AWS permissions
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::$ACCOUNT_ID:role/ADOTCollectorRole # IAM role for AWS service access
EOF
kubectl apply -f adot-collector-deployment.yaml # Deploy the collector to the cluster
Step 2.6: Verify the Deployment
Check that the ADOT collector is running properly:
# Check if the operator is running
kubectl get pods -n aws-otel
# Check if the collector is running
kubectl get daemonset -n aws-otel
# Check collector logs
kubectl logs -n aws-otel -l app.kubernetes.io/name=aws-otel-collector --tail=50
Step 3: Instrument Your Application
Now that the ADOT collector is running, you need to instrument your applications to send telemetry data to it. This involves installing the OpenTelemetry SDK for your programming language and configuring it to export data via OTLP (OpenTelemetry Protocol).
Step 3.1: Choose Your Language SDK
OpenTelemetry provides SDKs for multiple programming languages. Here are examples for the most common ones:
Python Example:
# Install the required packages
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc
# Basic Python instrumentation
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.resources import Resource
# Configure resource attributes (metadata about your service)
resource = Resource.create({
"service.name": "my-python-service", # Service identifier
"service.version": "1.0.0", # Version information
"service.namespace": "production", # Environment
"deployment.environment": "prod" # Deployment environment
})
# Set up tracing
trace.set_tracer_provider(TracerProvider(resource=resource)) # Create tracer provider with resource info
otlp_trace_exporter = OTLPSpanExporter(
endpoint="http://localhost:4317", # ADOT collector endpoint (gRPC)
insecure=True # Use HTTP instead of HTTPS for local development
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(otlp_trace_exporter) # Batch spans for efficient transmission
)
# Set up metrics
metric_reader = PeriodicExportingMetricReader(
OTLPMetricExporter(
endpoint="http://localhost:4317", # ADOT collector endpoint
insecure=True
),
export_interval_millis=5000 # Export metrics every 5 seconds
)
metrics.set_meter_provider(MeterProvider(resource=resource, metric_readers=[metric_reader]))
# Get tracer and meter for your application
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)
# Example usage in your application
def process_request(request_data):
with tracer.start_as_current_span("process_request") as span: # Create a span for this operation
span.set_attribute("request.size", len(request_data)) # Add custom attributes
# Create a counter metric
request_counter = meter.create_counter(
name="requests_total", # Metric name
description="Total number of requests processed" # Metric description
)
request_counter.add(1, {"endpoint": "/api/process"}) # Increment counter with labels
# Your business logic here
result = perform_processing(request_data)
span.set_attribute("result.status", "success") # Add result to span
return result
Node.js Example:
# Install the required packages
npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/exporter-otlp-proto-grpc
Java Example:
// Basic Node.js instrumentation
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-otlp-proto-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-otlp-proto-grpc');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
// Configure resource attributes
const resource = new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'my-nodejs-service', // Service identifier
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0', // Version information
[SemanticResourceAttributes.SERVICE_NAMESPACE]: 'production', // Environment
'deployment.environment': 'prod' // Deployment environment
});
// Initialize the SDK
const sdk = new NodeSDK({
resource: resource, // Set resource attributes
traceExporter: new OTLPTraceExporter({
url: 'http://localhost:4317', // ADOT collector endpoint
headers: {}, // Optional headers
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: 'http://localhost:4317', // ADOT collector endpoint
}),
exportIntervalMillis: 5000, // Export metrics every 5 seconds
}),
});
// Start the SDK
sdk.start();
// Example usage in your application
const { trace, metrics } = require('@opentelemetry/api');
async function processRequest(requestData) {
const tracer = trace.getTracer('my-service'); // Get tracer instance
const meter = metrics.getMeter('my-service'); // Get meter instance
return await tracer.startActiveSpan('process_request', async (span) => {
span.setAttribute('request.size', requestData.length); // Add custom attributes
// Create a counter metric
const requestCounter = meter.createCounter('requests_total', {
description: 'Total number of requests processed'
});
requestCounter.add(1, { endpoint: '/api/process' }); // Increment counter with labels
try {
// Your business logic here
const result = await performProcessing(requestData);
span.setAttribute('result.status', 'success'); // Add result to span
return result;
} catch (error) {
span.setAttribute('result.status', 'error'); // Mark as error
span.recordException(error); // Record the exception
throw error;
} finally {
span.end(); // End the span
}
});
}
Step 3.2: Configure for Production
For production deployments, update the OTLP endpoint to point to your ADOT collector:
# Production configuration - replace with your collector endpoint
otlp_trace_exporter = OTLPSpanExporter(
endpoint="http://adot-collector.aws-otel.svc.cluster.local:4317", # Kubernetes service endpoint
insecure=False # Use HTTPS in production
)
// Production configuration
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://adot-collector.aws-otel.svc.cluster.local:4317', // Kubernetes service endpoint
}),
// ... other configuration
});
Step 4: View Your Data in AWS
Once your applications are instrumented and sending data to the ADOT collector, you can view the telemetry data in various AWS services.
Step 4.1: View Distributed Traces in AWS X-Ray
Access X-Ray Console:
- Open the AWS Management Console
- Navigate to X-Ray service
- Go to the Traces section
Key X-Ray Features:
- Service Map: Visual representation of your distributed system showing service dependencies
- Trace List: View individual traces with timing and error information
- Trace Details: Drill down into specific traces to see spans and timing
- Filtering: Filter traces by service, operation, status, or time range
Example X-Ray Trace View:
Service Map:
[Frontend] → [API Gateway] → [Backend Service] → [Database]
Trace Details:
├── Frontend Request (100ms)
│ ├── API Gateway Processing (50ms)
│ │ ├── Backend Service Call (30ms)
│ │ │ └── Database Query (10ms)
│ │ └── Response Processing (20ms)
└── Frontend Response (100ms)
Step 4.2: View Metrics in Amazon CloudWatch
Access CloudWatch Console:
- Open the AWS Management Console
- Navigate to CloudWatch service
- Go to Metrics section
Custom Namespaces:
- Your application metrics will appear under custom namespaces
- Look for namespaces like my-service or production
- Metrics include counters, gauges, and histograms from your application
Creating CloudWatch Dashboards:
# Example: Create a dashboard for your application metrics
aws cloudwatch put-dashboard \
--dashboard-name "MyService-Dashboard" \
--dashboard-body '{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["my-service", "requests_total", "endpoint", "/api/process"]
],
"period": 300,
"stat": "Sum",
"region": "us-west-2",
"title": "Total Requests"
}
}
]
}'
CloudWatch Alarms:
# Example: Create an alarm for high error rates
aws cloudwatch put-metric-alarm \
--alarm-name "HighErrorRate" \
--alarm-description "Alert when error rate exceeds 5%" \
--metric-name "error_rate" \
--namespace "my-service" \
--statistic Average \
--period 300 \
--threshold 5 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2
Step 4.3: View Logs in CloudWatch Logs
Access CloudWatch Logs:
- Open the AWS Management Console
- Navigate to CloudWatch service
- Go to Log groups section
- Find your log group: /aws/eks/my-cluster/application
Log Streams:
- Each pod creates its own log stream
- Log streams are named using the {PodName} pattern
- You can filter logs by pod, time range, or search terms
CloudWatch Logs Insights Queries:
-- Example: Find all error logs in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
-- Example: Count requests by endpoint
fields @timestamp, @message
| filter @message like /request/
| stats count() by @message
-- Example: Find slow requests (>1 second)
fields @timestamp, @message, @duration
| filter @duration > 1000
| sort @duration desc
Summary
In this first part of our comprehensive guide, we've covered the essential foundations for implementing observability with AWS and OpenTelemetry:
What we accomplished:
- Understanding the basics of AWS and OpenTelemetry integration
- Setting up the ADOT collector on Amazon EKS with proper IAM configuration
- Instrumenting applications in Python, Node.js, and Java
- Configuring data export to CloudWatch and X-Ray
- Viewing telemetry data in AWS services
Key takeaways:
- The AWS Distro for OpenTelemetry (ADOT) provides a production-ready collector optimized for AWS services
- Proper IAM configuration is crucial for secure data transmission
- Application instrumentation follows OpenTelemetry standards and works across multiple languages
- AWS services like CloudWatch and X-Ray provide powerful visualization and analysis capabilities
Coming soon:
- In Part 2, we'll dive deep into Best Practices for Monitoring, covering comprehensive alerting strategies, dashboard creation, and operational excellence
- In Part 3, we'll explore Best Practices for Using OpenTelemetry and AWS, including security, performance optimization, and cost management
This foundation sets you up for a robust, scalable observability platform that can grow with your applications and provide deep insights into your system's performance and user experience.