Skip to content

Performance Optimization - Speed Up LangChain Applications ​

Learn advanced techniques to optimize the speed, efficiency, and scalability of LangChain applications for production workloads

⚑ Performance Optimization Overview ​

LangChain applications can be resource-intensive due to LLM calls, retrieval operations, and complex chains. This guide covers best practices for profiling, optimizing, and scaling LangChain systems.

πŸš€ Performance Optimization Pyramid ​

text
                    ⚑ PERFORMANCE OPTIMIZATION PYRAMID ⚑
                      (From code to infrastructure)

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                    INFRASTRUCTURE OPTIMIZATION                 β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
    β”‚  β”‚ β€’ Autoscaling         β€’ Load Balancing                    β”‚ β”‚
    β”‚  β”‚ β€’ Caching Layers      β€’ Vector DB Indexing                β”‚ β”‚
    β”‚  β”‚ β€’ Network Tuning      β€’ Hardware Acceleration             β”‚ β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                 APPLICATION OPTIMIZATION                       β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
    β”‚  β”‚ β€’ Async Execution     β€’ Batching Requests                  β”‚ β”‚
    β”‚  β”‚ β€’ Memory Management   β€’ Efficient Chains                   β”‚ β”‚
    β”‚  β”‚ β€’ Prompt Engineering  β€’ Model Selection                    β”‚ β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                    CODE OPTIMIZATION                           β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
    β”‚  β”‚ β€’ Profiling           β€’ Vectorization                      β”‚ β”‚
    β”‚  β”‚ β€’ Caching Results     β€’ Parallelism                        β”‚ β”‚
    β”‚  β”‚ β€’ Error Handling      β€’ Resource Cleanup                   β”‚ β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ” Profiling and Benchmarking ​

🎯 Profiling LangChain Applications ​

python
import time
import tracemalloc
import cProfile
import pstats
from functools import wraps

# Profiling decorator
def profile_function(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        print(f"Profiling {func.__name__}...")
        tracemalloc.start()
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        current, peak = tracemalloc.get_traced_memory()
        tracemalloc.stop()
        print(f"Execution time: {end_time - start_time:.3f}s")
        print(f"Memory usage: Current={current/1024:.1f}KB, Peak={peak/1024:.1f}KB")
        return result
    return wrapper

# Example usage
@profile_function
def run_chain_example():
    from langchain_core.prompts import ChatPromptTemplate
    from langchain_core.output_parsers import StrOutputParser
    from langchain_core.messages import HumanMessage, AIMessage
    from langchain.chains import LLMChain
    
    # Mock LLM
    class MockLLM:
        def __call__(self, input_text):
            time.sleep(0.05)
            return "Mock response"
    
    prompt = ChatPromptTemplate.from_template("Tell me about {topic}")
    chain = prompt | MockLLM() | StrOutputParser()
    result = chain.invoke({"topic": "Python"})
    print(f"Chain result: {result}")

run_chain_example()

πŸ§ͺ Benchmarking LLM Calls ​

python
import time
from langchain_openai import ChatOpenAI

# Benchmark LLM call
def benchmark_llm_call():
    llm = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo")
    prompt = "What is the capital of France?"
    start = time.time()
    response = llm.invoke(prompt)
    duration = time.time() - start
    print(f"LLM call duration: {duration:.2f}s")
    print(f"Response: {response}")

benchmark_llm_call()

⚑ Application-Level Optimizations ​

πŸš€ Async Execution and Batching ​

python
import asyncio
from langchain_openai import ChatOpenAI

# Async LLM calls
async def async_llm_call(prompt):
    llm = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo")
    response = await llm.ainvoke(prompt)
    return response

async def batch_llm_calls(prompts):
    tasks = [async_llm_call(prompt) for prompt in prompts]
    results = await asyncio.gather(*tasks)
    return results

# Run batch
prompts = [f"What is {city}?" for city in ["Paris", "London", "Berlin", "Rome"]]
results = asyncio.run(batch_llm_calls(prompts))
print(results)

🧠 Efficient Chains and Memory ​

python
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
from langchain_openai import ChatOpenAI

# Efficient chain with memory
def efficient_conversation():
    llm = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo")
    memory = ConversationBufferMemory(return_messages=True, max_token_limit=500)
    chain = ConversationChain(llm=llm, memory=memory)
    
    # Simulate conversation
    chain.memory.chat_memory.add_user_message("Hello!")
    chain.memory.chat_memory.add_ai_message("Hi! How can I help?")
    result = chain.invoke({"input": "Tell me about Python."})
    print(result)

efficient_conversation()

🧠 Prompt Engineering for Speed ​

python
from langchain_core.prompts import ChatPromptTemplate

# Short, focused prompts are faster
prompt = ChatPromptTemplate.from_template("Summarize: {text}")
short_text = "LangChain is a framework for building LLM applications."
result = prompt.format(text=short_text)
print(result)

πŸ—„οΈ Caching and Vectorization ​

πŸš€ Caching LLM and Retrieval Results ​

python
from functools import lru_cache

@lru_cache(maxsize=128)
def cached_llm_response(prompt):
    from langchain_openai import ChatOpenAI
    llm = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo")
    return llm.invoke(prompt)

response = cached_llm_response("What is the capital of France?")
print(response)

🧠 Vectorization and Efficient Retrieval ​

python
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

# Efficient vector store
def efficient_vector_search():
    docs = [Document(page_content=f"Doc {i}") for i in range(100)]
    embeddings = OpenAIEmbeddings()
    vectorstore = Chroma.from_documents(docs, embeddings)
    results = vectorstore.similarity_search("Doc 42", k=5)
    print([doc.page_content for doc in results])

efficient_vector_search()

πŸ—οΈ Infrastructure Optimizations ​

πŸš€ Autoscaling and Load Balancing ​

yaml
# Kubernetes Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: langchain-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: langchain-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

πŸš€ Vector DB Indexing and Hardware Acceleration ​

python
# Example: Pinecone vector DB with GPU acceleration
import pinecone

pinecone.init(api_key="YOUR_API_KEY", environment="us-west1-gpu")
index = pinecone.Index("langchain-index")

# Insert and query vectors
vectors = [(f"id-{i}", [float(i)]*128) for i in range(1000)]
index.upsert(vectors)
results = index.query([0.0]*128, top_k=5)
print(results)

πŸ”— Next Steps ​

Continue with deployment and monitoring:


Key Performance Takeaways:

  • Profiling and benchmarking identify bottlenecks
  • Async execution and batching speed up LLM calls
  • Efficient chains and memory reduce resource usage
  • Prompt engineering improves speed and quality
  • Caching and vectorization accelerate retrieval
  • Autoscaling and load balancing ensure reliability
  • Hardware acceleration boosts throughput
  • Continuous optimization is essential for production systems

Released under the MIT License.