Performance Optimization - Speed Up LangChain Applications β
Learn advanced techniques to optimize the speed, efficiency, and scalability of LangChain applications for production workloads
β‘ Performance Optimization Overview β
LangChain applications can be resource-intensive due to LLM calls, retrieval operations, and complex chains. This guide covers best practices for profiling, optimizing, and scaling LangChain systems.
π Performance Optimization Pyramid β
text
β‘ PERFORMANCE OPTIMIZATION PYRAMID β‘
(From code to infrastructure)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INFRASTRUCTURE OPTIMIZATION β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Autoscaling β’ Load Balancing β β
β β β’ Caching Layers β’ Vector DB Indexing β β
β β β’ Network Tuning β’ Hardware Acceleration β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β APPLICATION OPTIMIZATION β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Async Execution β’ Batching Requests β β
β β β’ Memory Management β’ Efficient Chains β β
β β β’ Prompt Engineering β’ Model Selection β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β CODE OPTIMIZATION β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Profiling β’ Vectorization β β
β β β’ Caching Results β’ Parallelism β β
β β β’ Error Handling β’ Resource Cleanup β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββπ Profiling and Benchmarking β
π― Profiling LangChain Applications β
python
import time
import tracemalloc
import cProfile
import pstats
from functools import wraps
# Profiling decorator
def profile_function(func):
@wraps(func)
def wrapper(*args, **kwargs):
print(f"Profiling {func.__name__}...")
tracemalloc.start()
start_time = time.time()
result = func(*args, **kwargs)
end_time = time.time()
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"Execution time: {end_time - start_time:.3f}s")
print(f"Memory usage: Current={current/1024:.1f}KB, Peak={peak/1024:.1f}KB")
return result
return wrapper
# Example usage
@profile_function
def run_chain_example():
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage, AIMessage
from langchain.chains import LLMChain
# Mock LLM
class MockLLM:
def __call__(self, input_text):
time.sleep(0.05)
return "Mock response"
prompt = ChatPromptTemplate.from_template("Tell me about {topic}")
chain = prompt | MockLLM() | StrOutputParser()
result = chain.invoke({"topic": "Python"})
print(f"Chain result: {result}")
run_chain_example()π§ͺ Benchmarking LLM Calls β
python
import time
from langchain_openai import ChatOpenAI
# Benchmark LLM call
def benchmark_llm_call():
llm = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo")
prompt = "What is the capital of France?"
start = time.time()
response = llm.invoke(prompt)
duration = time.time() - start
print(f"LLM call duration: {duration:.2f}s")
print(f"Response: {response}")
benchmark_llm_call()β‘ Application-Level Optimizations β
π Async Execution and Batching β
python
import asyncio
from langchain_openai import ChatOpenAI
# Async LLM calls
async def async_llm_call(prompt):
llm = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo")
response = await llm.ainvoke(prompt)
return response
async def batch_llm_calls(prompts):
tasks = [async_llm_call(prompt) for prompt in prompts]
results = await asyncio.gather(*tasks)
return results
# Run batch
prompts = [f"What is {city}?" for city in ["Paris", "London", "Berlin", "Rome"]]
results = asyncio.run(batch_llm_calls(prompts))
print(results)π§ Efficient Chains and Memory β
python
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
from langchain_openai import ChatOpenAI
# Efficient chain with memory
def efficient_conversation():
llm = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo")
memory = ConversationBufferMemory(return_messages=True, max_token_limit=500)
chain = ConversationChain(llm=llm, memory=memory)
# Simulate conversation
chain.memory.chat_memory.add_user_message("Hello!")
chain.memory.chat_memory.add_ai_message("Hi! How can I help?")
result = chain.invoke({"input": "Tell me about Python."})
print(result)
efficient_conversation()π§ Prompt Engineering for Speed β
python
from langchain_core.prompts import ChatPromptTemplate
# Short, focused prompts are faster
prompt = ChatPromptTemplate.from_template("Summarize: {text}")
short_text = "LangChain is a framework for building LLM applications."
result = prompt.format(text=short_text)
print(result)ποΈ Caching and Vectorization β
π Caching LLM and Retrieval Results β
python
from functools import lru_cache
@lru_cache(maxsize=128)
def cached_llm_response(prompt):
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo")
return llm.invoke(prompt)
response = cached_llm_response("What is the capital of France?")
print(response)π§ Vectorization and Efficient Retrieval β
python
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
# Efficient vector store
def efficient_vector_search():
docs = [Document(page_content=f"Doc {i}") for i in range(100)]
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embeddings)
results = vectorstore.similarity_search("Doc 42", k=5)
print([doc.page_content for doc in results])
efficient_vector_search()ποΈ Infrastructure Optimizations β
π Autoscaling and Load Balancing β
yaml
# Kubernetes Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: langchain-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: langchain-api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80π Vector DB Indexing and Hardware Acceleration β
python
# Example: Pinecone vector DB with GPU acceleration
import pinecone
pinecone.init(api_key="YOUR_API_KEY", environment="us-west1-gpu")
index = pinecone.Index("langchain-index")
# Insert and query vectors
vectors = [(f"id-{i}", [float(i)]*128) for i in range(1000)]
index.upsert(vectors)
results = index.query([0.0]*128, top_k=5)
print(results)π Next Steps β
Continue with deployment and monitoring:
- Deployment Strategies - Deploy optimized systems
- Monitoring and Observability - Monitor performance
- Cost Optimization - Reduce operational costs
Key Performance Takeaways:
- Profiling and benchmarking identify bottlenecks
- Async execution and batching speed up LLM calls
- Efficient chains and memory reduce resource usage
- Prompt engineering improves speed and quality
- Caching and vectorization accelerate retrieval
- Autoscaling and load balancing ensure reliability
- Hardware acceleration boosts throughput
- Continuous optimization is essential for production systems