Monitoring & Observability - LangChain in Production β
Learn how to monitor, log, and observe LangChain applications for reliability, performance, and troubleshooting
π Monitoring Overview β
Monitoring is essential for production LangChain systems to ensure reliability, detect issues, and optimize performance. This guide covers metrics, logging, tracing, alerting, and observability patterns.
π Key Metrics to Track β
- LLM Latency: Time taken for LLM calls
- Chain Throughput: Number of requests processed per second
- Error Rate: Failed requests, exceptions
- Resource Usage: CPU, memory, GPU utilization
- Vector DB Performance: Query latency, index health
π Logging Best Practices β
- Use structured logging (JSON, key-value pairs)
- Log request/response, errors, and performance data
- Integrate with log aggregators (ELK, Azure Monitor, AWS CloudWatch)
python
import logging
import json
logger = logging.getLogger("langchain")
logger.setLevel(logging.INFO)
# Structured log example
def log_request(request, response, latency):
log_entry = {
"request": request,
"response": response,
"latency": latency
}
logger.info(json.dumps(log_entry))π Distributed Tracing β
- Use tracing tools (OpenTelemetry, Jaeger, Azure Application Insights)
- Trace LLM calls, retrieval, and chain execution
- Visualize traces for bottleneck analysis
python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
span_processor = BatchSpanProcessor(ConsoleSpanExporter())
trace.get_tracer_provider().add_span_processor(span_processor)
# Example trace
with tracer.start_as_current_span("llm_call"):
# LLM call logic here
passπ¨ Alerting & Incident Response β
- Set up alerts for high error rates, latency, resource exhaustion
- Use cloud alerting (Azure Monitor Alerts, AWS CloudWatch Alarms)
- Automate incident response and escalation
π οΈ Observability Patterns β
- Health checks and readiness probes
- Real-time dashboards (Grafana, Azure Dashboards)
- Automated anomaly detection
π§© Example: FastAPI Health Check Endpoint β
python
from fastapi import FastAPI
app = FastAPI()
@app.get("/health")
def health():
return {"status": "ok"}π Next Steps β
Key Monitoring Takeaways:
- Track latency, throughput, errors, and resource usage
- Use structured logging and distributed tracing
- Set up alerts and automate incident response
- Build dashboards for real-time observability
- Continuously improve monitoring coverage