Performance Optimization - Speed Up LangChain Applications

Learn advanced techniques to optimize the speed, efficiency, and scalability of LangChain applications for production workloads

⚡ Performance Optimization Overview

LangChain applications can be resource-intensive due to LLM calls, retrieval operations, and complex chains. This guide covers best practices for profiling, optimizing, and scaling LangChain systems.

🚀 Performance Optimization Pyramid

text

                    ⚡ PERFORMANCE OPTIMIZATION PYRAMID ⚡
                      (From code to infrastructure)

    ┌─────────────────────────────────────────────────────────────────┐
    │                    INFRASTRUCTURE OPTIMIZATION                 │
    │  ┌─────────────────────────────────────────────────────────────┐ │
    │  │ • Autoscaling         • Load Balancing                    │ │
    │  │ • Caching Layers      • Vector DB Indexing                │ │
    │  │ • Network Tuning      • Hardware Acceleration             │ │
    │  └─────────────────────────────────────────────────────────────┘ │
    └─────────────────────┬───────────────────────────────────────────┘
                         │
    ┌─────────────────────▼───────────────────────────────────────────┐
    │                 APPLICATION OPTIMIZATION                       │
    │  ┌─────────────────────────────────────────────────────────────┐ │
    │  │ • Async Execution     • Batching Requests                  │ │
    │  │ • Memory Management   • Efficient Chains                   │ │
    │  │ • Prompt Engineering  • Model Selection                    │ │
    │  └─────────────────────────────────────────────────────────────┘ │
    └─────────────────────┬───────────────────────────────────────────┘
                         │
    ┌─────────────────────▼───────────────────────────────────────────┐
    │                    CODE OPTIMIZATION                           │
    │  ┌─────────────────────────────────────────────────────────────┐ │
    │  │ • Profiling           • Vectorization                      │ │
    │  │ • Caching Results     • Parallelism                        │ │
    │  │ • Error Handling      • Resource Cleanup                   │ │
    │  └─────────────────────────────────────────────────────────────┘ │
    └─────────────────────────────────────────────────────────────────┘

🔍 Profiling and Benchmarking

🎯 Profiling LangChain Applications

python

import time
import tracemalloc
import cProfile
import pstats
from functools import wraps

# Profiling decorator
def profile_function(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        print(f"Profiling {func.__name__}...")
        tracemalloc.start()
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        current, peak = tracemalloc.get_traced_memory()
        tracemalloc.stop()
        print(f"Execution time: {end_time - start_time:.3f}s")
        print(f"Memory usage: Current={current/1024:.1f}KB, Peak={peak/1024:.1f}KB")
        return result
    return wrapper

# Example usage
@profile_function
def run_chain_example():
    from langchain_core.prompts import ChatPromptTemplate
    from langchain_core.output_parsers import StrOutputParser
    from langchain_core.messages import HumanMessage, AIMessage
    from langchain.chains import LLMChain
    
    # Mock LLM
    class MockLLM:
        def __call__(self, input_text):
            time.sleep(0.05)
            return "Mock response"
    
    prompt = ChatPromptTemplate.from_template("Tell me about {topic}")
    chain = prompt | MockLLM() | StrOutputParser()
    result = chain.invoke({"topic": "Python"})
    print(f"Chain result: {result}")

run_chain_example()

🧪 Benchmarking LLM Calls

python

import time
from langchain_openai import ChatOpenAI

# Benchmark LLM call
def benchmark_llm_call():
    llm = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo")
    prompt = "What is the capital of France?"
    start = time.time()
    response = llm.invoke(prompt)
    duration = time.time() - start
    print(f"LLM call duration: {duration:.2f}s")
    print(f"Response: {response}")

benchmark_llm_call()

⚡ Application-Level Optimizations

🚀 Async Execution and Batching

python

import asyncio
from langchain_openai import ChatOpenAI

# Async LLM calls
async def async_llm_call(prompt):
    llm = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo")
    response = await llm.ainvoke(prompt)
    return response

async def batch_llm_calls(prompts):
    tasks = [async_llm_call(prompt) for prompt in prompts]
    results = await asyncio.gather(*tasks)
    return results

# Run batch
prompts = [f"What is {city}?" for city in ["Paris", "London", "Berlin", "Rome"]]
results = asyncio.run(batch_llm_calls(prompts))
print(results)

🧠 Efficient Chains and Memory

python

from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
from langchain_openai import ChatOpenAI

# Efficient chain with memory
def efficient_conversation():
    llm = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo")
    memory = ConversationBufferMemory(return_messages=True, max_token_limit=500)
    chain = ConversationChain(llm=llm, memory=memory)
    
    # Simulate conversation
    chain.memory.chat_memory.add_user_message("Hello!")
    chain.memory.chat_memory.add_ai_message("Hi! How can I help?")
    result = chain.invoke({"input": "Tell me about Python."})
    print(result)

efficient_conversation()

🧠 Prompt Engineering for Speed

python

from langchain_core.prompts import ChatPromptTemplate

# Short, focused prompts are faster
prompt = ChatPromptTemplate.from_template("Summarize: {text}")
short_text = "LangChain is a framework for building LLM applications."
result = prompt.format(text=short_text)
print(result)

🗄️ Caching and Vectorization

🚀 Caching LLM and Retrieval Results

python

from functools import lru_cache

@lru_cache(maxsize=128)
def cached_llm_response(prompt):
    from langchain_openai import ChatOpenAI
    llm = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo")
    return llm.invoke(prompt)

response = cached_llm_response("What is the capital of France?")
print(response)

🧠 Vectorization and Efficient Retrieval

python

from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

# Efficient vector store
def efficient_vector_search():
    docs = [Document(page_content=f"Doc {i}") for i in range(100)]
    embeddings = OpenAIEmbeddings()
    vectorstore = Chroma.from_documents(docs, embeddings)
    results = vectorstore.similarity_search("Doc 42", k=5)
    print([doc.page_content for doc in results])

efficient_vector_search()

🏗️ Infrastructure Optimizations

🚀 Autoscaling and Load Balancing

yaml

# Kubernetes Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: langchain-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: langchain-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

🚀 Vector DB Indexing and Hardware Acceleration

python

# Example: Pinecone vector DB with GPU acceleration
import pinecone

pinecone.init(api_key="YOUR_API_KEY", environment="us-west1-gpu")
index = pinecone.Index("langchain-index")

# Insert and query vectors
vectors = [(f"id-{i}", [float(i)]*128) for i in range(1000)]
index.upsert(vectors)
results = index.query([0.0]*128, top_k=5)
print(results)

🔗 Next Steps

Continue with deployment and monitoring:

Deployment Strategies - Deploy optimized systems
Monitoring and Observability - Monitor performance
Cost Optimization - Reduce operational costs

Key Performance Takeaways:

Profiling and benchmarking identify bottlenecks
Async execution and batching speed up LLM calls
Efficient chains and memory reduce resource usage
Prompt engineering improves speed and quality
Caching and vectorization accelerate retrieval
Autoscaling and load balancing ensure reliability
Hardware acceleration boosts throughput
Continuous optimization is essential for production systems

Performance Optimization - Speed Up LangChain Applications ​

⚡ Performance Optimization Overview ​

🚀 Performance Optimization Pyramid ​

🔍 Profiling and Benchmarking ​

🎯 Profiling LangChain Applications ​

🧪 Benchmarking LLM Calls ​

⚡ Application-Level Optimizations ​

🚀 Async Execution and Batching ​

🧠 Efficient Chains and Memory ​

🧠 Prompt Engineering for Speed ​

🗄️ Caching and Vectorization ​

🚀 Caching LLM and Retrieval Results ​

🧠 Vectorization and Efficient Retrieval ​

🏗️ Infrastructure Optimizations ​

🚀 Autoscaling and Load Balancing ​

🚀 Vector DB Indexing and Hardware Acceleration ​

🔗 Next Steps ​

Performance Optimization - Speed Up LangChain Applications

⚡ Performance Optimization Overview

🚀 Performance Optimization Pyramid

🔍 Profiling and Benchmarking

🎯 Profiling LangChain Applications

🧪 Benchmarking LLM Calls

⚡ Application-Level Optimizations

🚀 Async Execution and Batching

🧠 Efficient Chains and Memory

🧠 Prompt Engineering for Speed

🗄️ Caching and Vectorization

🚀 Caching LLM and Retrieval Results

🧠 Vectorization and Efficient Retrieval

🏗️ Infrastructure Optimizations

🚀 Autoscaling and Load Balancing

🚀 Vector DB Indexing and Hardware Acceleration

🔗 Next Steps