MCP Streamable HTTP Transport: Building Stateless, Scalable MCP Deployments for Enterprise
Learn how MCP Streamable HTTP transport enables stateless, horizontally scalable MCP deployments for enterprise. Includes Docker, Kubernetes examples, and migration guides.
Key Takeaways
---
The Evolution of MCP Transport: From STDIO to Streamable HTTP
When the Model Context Protocol launched, it supported a single transport mechanism: STDIO (Standard Input/Output). This was perfect for local development — your AI client spawns a subprocess, pipes JSON-RPC messages through stdin/stdout, and everything works beautifully on your laptop.
Then came SSE (Server-Sent Events), which enabled remote MCP servers. You could run an MCP server in the cloud and connect to it from anywhere. This unlocked entirely new deployment models, but it brought a problem that every backend engineer recognizes: stateful connections.
SSE requires long-lived HTTP connections. Each client maintains a persistent connection to a specific server instance. This means:
For a developer running one MCP server on their laptop, none of this matters. For an enterprise running MCP at scale — powering AI assistants for thousands of employees hitting dozens of MCP servers — these constraints become serious bottlenecks.
Streamable HTTP solves all of this.
What Changed in the MCP Specification
The Streamable HTTP transport was added to the MCP specification in mid-2025 and has rapidly become the recommended transport for any production deployment. The key design principles:
1. Stateless by default — each request carries all necessary context
2. HTTP-native — works with any standard HTTP infrastructure
3. Streaming optional — supports both instant responses and streamed results
4. Backward compatible — SSE clients can connect to Streamable HTTP servers with minimal changes
> People Also Ask: Is STDIO transport deprecated?
> No. STDIO remains the best choice for local MCP servers that run as subprocesses on your machine. It's the simplest transport with zero network overhead. Streamable HTTP is designed for remote and distributed deployments. For understanding the tradeoffs, see our local vs remote MCP servers comparison.
---
How Streamable HTTP Works
Streamable HTTP is beautifully simple. At its core, it's just HTTP POST requests with JSON-RPC payloads. No WebSockets, no long-lived connections, no special protocols.
The Basic Flow
Client Server
| |
| POST /mcp |
| Content-Type: application/json|
| { "jsonrpc": "2.0", |
| "method": "tools/call", |
| "params": { ... }, |
| "id": 1 } |
|------------------------------->|
| |
| HTTP 200 |
| Content-Type: application/json|
| { "jsonrpc": "2.0", |
| "result": { ... }, |
| "id": 1 } |
|<-------------------------------|
| |
That's it. A standard HTTP POST with a JSON-RPC body, and a standard HTTP response with the result. Any HTTP client can speak this protocol. Any load balancer can route these requests. Any CDN can cache appropriate responses.
Streaming Responses
For long-running operations (database queries, code generation, complex computations), the server can stream results using chunked transfer encoding or SSE within the response:
Client Server
| |
| POST /mcp |
| Accept: text/event-stream |
| { "method": "tools/call", |
| "params": { "name": |
| "long_computation" } } |
|------------------------------->|
| |
| HTTP 200 |
| Content-Type: text/event-stream|
| |
| data: {"progress": 0.25} |
|<-------------------------------|
| data: {"progress": 0.50} |
|<-------------------------------|
| data: {"progress": 0.75} |
|<-------------------------------|
| data: {"result": {...}} |
|<-------------------------------|
| |
The client opts into streaming by sending Accept: text/event-stream. If the client sends Accept: application/json, the server buffers the complete response and returns it as a single JSON payload. This flexibility lets the same server support both interactive clients (that want progress updates) and batch clients (that just want the final result).
Session Management Without State
The key innovation is how Streamable HTTP handles sessions. Instead of maintaining server-side session state, the protocol uses a session token pattern:
// First request — server creates a session
POST /mcp
{
"jsonrpc": "2.0",
"method": "initialize",
"params": { "clientInfo": { "name": "my-client" } },
"id": 1
}// Response includes session token
HTTP 200
Mcp-Session-Id: sess_abc123
{
"jsonrpc": "2.0",
"result": {
"serverInfo": { "name": "my-server" },
"capabilities": { ... }
},
"id": 1
}
// Subsequent requests include the session token
POST /mcp
Mcp-Session-Id: sess_abc123
{
"jsonrpc": "2.0",
"method": "tools/call",
"params": { ... },
"id": 2
}
The session token can be:
For most enterprise deployments, the stateless JWT approach is ideal:
import jwt from 'jsonwebtoken';function createSessionToken(clientInfo: ClientInfo): string {
return jwt.sign({
clientId: clientInfo.name,
capabilities: clientInfo.capabilities,
createdAt: Date.now()
}, process.env.SESSION_SECRET, { expiresIn: '24h' });
}
function validateSession(token: string): SessionData {
return jwt.verify(token, process.env.SESSION_SECRET);
}
> People Also Ask: Can Streamable HTTP handle server-initiated notifications?
> Yes, through two mechanisms. First, the server can include notifications in streamed responses. Second, clients can open a long-poll endpoint (GET /mcp/notifications) that the server uses to push events. This is optional and doesn't affect the stateless nature of the core request/response flow.
---
Why Stateful Connections Became a Bottleneck
To understand why Streamable HTTP matters for enterprise, let's look at the real problems teams hit with SSE at scale.
The Sticky Session Problem
With SSE, each client maintains a persistent connection to one server instance. If you have 4 server instances behind a load balancer, client A connects to server 1 and must stay connected to server 1 for the entire session. This means:
Memory Pressure
Each SSE connection consumes memory on the server:
1,000 concurrent connections × ~50KB per connection = ~50MB
10,000 concurrent connections × ~50KB per connection = ~500MB
100,000 concurrent connections × ~50KB per connection = ~5GB
That's just for holding connections, before any actual work is done.
The Reconnection Storm
When a server instance crashes or gets restarted, all connected clients must reconnect simultaneously. This creates a "thundering herd" effect that can cascade across your infrastructure.
Enterprise Numbers
A typical enterprise deployment might look like:
Managing 25,000 persistent SSE connections across a fleet of servers is a serious operational challenge. With Streamable HTTP, those 25,000 connections become 25,000 short-lived HTTP requests — something every web infrastructure team already knows how to handle.
---
Enterprise Deployment Patterns
Here's how to deploy MCP servers with Streamable HTTP at enterprise scale.
Pattern 1: Simple Load-Balanced Deployment
The most common pattern — multiple MCP server instances behind a standard load balancer:
┌─────────────────┐
│ Load Balancer │
│ (ALB/NLB/Nginx) │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐
│ MCP Server │ │ MCP Server │ │ MCP Server │
│ Instance 1 │ │ Instance 2 │ │ Instance 3 │
└───────────┘ └───────────┘ └───────────┘
No sticky sessions needed. Round-robin or least-connections load balancing works perfectly.
Nginx configuration:
upstream mcp_backend {
least_conn;
server mcp-server-1:3000;
server mcp-server-2:3000;
server mcp-server-3:3000;
}server {
listen 443 ssl;
server_name mcp.company.com;
location /mcp {
proxy_pass http://mcp_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# Support streaming responses
proxy_buffering off;
proxy_cache off;
# Timeout for long-running tool calls
proxy_read_timeout 300s;
}
}
Pattern 2: Auto-Scaling with Kubernetes
For dynamic scaling based on load:
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mcp-server
labels:
app: mcp-server
spec:
replicas: 3
selector:
matchLabels:
app: mcp-server
template:
metadata:
labels:
app: mcp-server
spec:
containers:
name: mcp-server
image: your-registry/mcp-server:latest
ports:
containerPort: 3000
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "1000m"
memory: "1Gi"
env:
name: SESSION_SECRET
valueFrom:
secretKeyRef:
name: mcp-secrets
key: session-secret
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
---
service.yaml
apiVersion: v1
kind: Service
metadata:
name: mcp-server
spec:
selector:
app: mcp-server
ports:
port: 80
targetPort: 3000
type: ClusterIP
---
hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: mcp-server
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mcp-server
minReplicas: 3
maxReplicas: 20
metrics:
type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Pattern 3: Multi-Region with Edge Routing
For global enterprises, deploy MCP servers in multiple regions with intelligent routing:
┌─────────────────┐
│ Global DNS / │
│ Edge Router │
└────────┬────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌──────┴──────┐ ┌─────┴──────┐ ┌─────┴──────┐
│ US-East │ │ EU-West │ │ AP-South │
│ Cluster │ │ Cluster │ │ Cluster │
│ (3 pods) │ │ (3 pods) │ │ (2 pods) │
└─────────────┘ └────────────┘ └────────────┘
Since Streamable HTTP is stateless, requests can be routed to the nearest healthy region without worrying about session affinity.
> People Also Ask: What about latency compared to SSE?
> For individual tool calls, Streamable HTTP adds the overhead of HTTP connection setup per request (typically 1-5ms with HTTP/2 and connection reuse). For most MCP operations, this is negligible compared to the tool execution time itself. The trade-off is well worth it for the operational simplicity at scale.
---
Building a Streamable HTTP MCP Server
Here's a complete implementation using the TypeScript SDK:
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StreamableHTTPServerTransport } from "@modelcontextprotocol/sdk/server/streamablehttp.js";
import express from "express";
import { z } from "zod";const app = express();
app.use(express.json());
// Create the MCP server
const mcpServer = new McpServer({
name: "enterprise-tools",
version: "2.0.0"
});
// Register tools
mcpServer.tool(
"query_metrics",
"Query application metrics from the monitoring system",
{
service: z.string().describe("Service name"),
metric: z.string().describe("Metric name"),
timeRange: z.string().describe("Time range (1h, 6h, 24h, 7d)")
},
async ({ service, metric, timeRange }) => {
const data = await queryPrometheus(service, metric, timeRange);
return {
content: [{
type: "text",
text: JSON.stringify(data, null, 2)
}]
};
}
);
// Set up Streamable HTTP transport
const transport = new StreamableHTTPServerTransport({
sessionManager: {
// Stateless JWT-based sessions
createSession: async (clientInfo) => {
return jwt.sign({ client: clientInfo.name }, SECRET);
},
validateSession: async (token) => {
return jwt.verify(token, SECRET);
}
}
});
// Mount MCP endpoint
app.post('/mcp', async (req, res) => {
await transport.handleRequest(req, res, mcpServer);
});
// Health check for load balancers
app.get('/health', (req, res) => {
res.json({ status: 'healthy', uptime: process.uptime() });
});
app.get('/ready', (req, res) => {
// Check downstream dependencies
const ready = checkDependencies();
res.status(ready ? 200 : 503).json({ ready });
});
app.listen(3000, () => {
console.log('MCP server listening on port 3000 (Streamable HTTP)');
});
Dockerizing Your MCP Server
FROM node:22-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --production=false
COPY . .
RUN npm run buildFROM node:22-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package*.json ./
USER node
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s \
CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]
Build and run:
docker build -t mcp-server:latest .
docker run -p 3000:3000 -e SESSION_SECRET=your-secret mcp-server:latest
For production deployments, see our MCP deployment and DevOps guide for CI/CD pipelines and infrastructure-as-code patterns.
---
Transport Comparison: STDIO vs SSE vs Streamable HTTP
Here's a comprehensive comparison to help you choose the right transport for your use case:
STDIO
Best for: Local development, CLI tools, single-user desktop apps
| Aspect | Details |
|--------|---------|
| Connection type | Process stdin/stdout |
| Network required | No |
| Scalability | Single process |
| Load balancing | N/A |
| Session management | Implicit (process lifetime) |
| Deployment complexity | Minimal |
| Latency | ~0ms (IPC) |
| Use case | IDE plugins, local tools |
SSE (Server-Sent Events)
Best for: Small-scale remote deployments, real-time push scenarios
| Aspect | Details |
|--------|---------|
| Connection type | Persistent HTTP connection |
| Network required | Yes |
| Scalability | Limited by connection count |
| Load balancing | Requires sticky sessions |
| Session management | Connection-based |
| Deployment complexity | Moderate |
| Latency | ~1-5ms |
| Use case | Small teams, prototypes |
Streamable HTTP
Best for: Production deployments, enterprise scale, multi-region
| Aspect | Details |
|--------|---------|
| Connection type | Standard HTTP request/response |
| Network required | Yes |
| Scalability | Unlimited horizontal scaling |
| Load balancing | Any standard load balancer |
| Session management | Token-based (stateless) |
| Deployment complexity | Standard web deployment |
| Latency | ~1-10ms |
| Use case | Enterprise, production, APIs |
Decision Framework
Is your MCP server local only?
→ Yes → Use STDIO
→ No → Is it for < 100 concurrent users?
→ Yes → SSE is fine, Streamable HTTP is better
→ No → Use Streamable HTTP
For more on MCP architecture decisions, see our MCP architecture deep dive.
> People Also Ask: Can I support multiple transports simultaneously?
> Yes! The MCP SDKs let you expose the same server over multiple transports. This is common during migration — you keep SSE for existing clients while adding Streamable HTTP for new ones. The server logic is transport-agnostic.
---
Migrating from SSE to Streamable HTTP
If you have existing SSE-based MCP servers, migration is straightforward.
Step 1: Update Your SDK
npm install @modelcontextprotocol/sdk@latest
Step 2: Add the Streamable HTTP Transport
Keep your existing SSE endpoint and add Streamable HTTP alongside it:
import { StreamableHTTPServerTransport } from "@modelcontextprotocol/sdk/server/streamablehttp.js";
import { SSEServerTransport } from "@modelcontextprotocol/sdk/server/sse.js";// Existing SSE endpoint (keep for backward compatibility)
app.get('/sse', (req, res) => {
const sseTransport = new SSEServerTransport('/messages', res);
server.connect(sseTransport);
});
app.post('/messages', (req, res) => {
sseTransport.handleMessage(req, res);
});
// New Streamable HTTP endpoint
const httpTransport = new StreamableHTTPServerTransport({ / config / });
app.post('/mcp', (req, res) => {
httpTransport.handleRequest(req, res, server);
});
Step 3: Update Client Configurations
Update client configs to point to the new endpoint:
{
"mcpServers": {
"my-server": {
"transport": "streamable-http",
"url": "https://mcp.company.com/mcp",
"headers": {
"Authorization": "Bearer ${MCP_TOKEN}"
}
}
}
}
Step 4: Remove SSE After Migration
Once all clients have migrated, remove the SSE endpoints and their associated state management code.
---
Performance Optimization for Enterprise Scale
Connection Pooling
Use HTTP/2 for multiplexed connections:
// Client-side: enable HTTP/2
const transport = new StreamableHTTPClientTransport({
url: "https://mcp.company.com/mcp",
http2: true, // Multiplex requests over a single connection
maxConcurrentStreams: 100
});
Response Caching
For idempotent tools (read-only queries, static data), implement caching:
import { createHash } from 'crypto';const cache = new Map();
function getCacheKey(method: string, params: any): string {
return createHash('sha256')
.update(JSON.stringify({ method, params }))
.digest('hex');
}
app.post('/mcp', async (req, res) => {
const { method, params } = req.body;
// Check cache for read-only operations
if (method === 'tools/call' && isReadOnly(params.name)) {
const key = getCacheKey(method, params);
const cached = cache.get(key);
if (cached && cached.expiry > Date.now()) {
return res.json(cached.result);
}
}
const result = await transport.handleRequest(req, res, server);
// Cache the result
if (isReadOnly(params?.name)) {
const key = getCacheKey(method, params);
cache.set(key, { result, expiry: Date.now() + 60000 }); // 1 min TTL
}
});
Rate Limiting
Protect your MCP servers from abuse:
import rateLimit from 'express-rate-limit';const mcpLimiter = rateLimit({
windowMs: 60 * 1000, // 1 minute
max: 100, // 100 requests per minute per client
keyGenerator: (req) => {
const session = req.headers['mcp-session-id'];
return session || req.ip;
},
message: {
jsonrpc: "2.0",
error: { code: -32000, message: "Rate limit exceeded" }
}
});
app.post('/mcp', mcpLimiter, async (req, res) => {
// Handle request
});
Observability
Add structured logging and metrics for production monitoring:
import { metrics } from '@opentelemetry/api';const meter = metrics.getMeter('mcp-server');
const requestCounter = meter.createCounter('mcp.requests.total');
const requestDuration = meter.createHistogram('mcp.request.duration');
app.post('/mcp', async (req, res) => {
const start = Date.now();
const method = req.body.method;
try {
await transport.handleRequest(req, res, server);
requestCounter.add(1, { method, status: 'success' });
} catch (err) {
requestCounter.add(1, { method, status: 'error' });
throw err;
} finally {
requestDuration.record(Date.now() - start, { method });
}
});
For detailed performance tuning, see our MCP performance optimization guide.
---
Security for Enterprise Streamable HTTP Deployments
Authentication
Use standard HTTP authentication — bearer tokens, mutual TLS, or API keys:
app.post('/mcp', async (req, res) => {
const authHeader = req.headers.authorization;
if (!authHeader || !authHeader.startsWith('Bearer ')) {
return res.status(401).json({
jsonrpc: "2.0",
error: { code: -32000, message: "Authentication required" }
});
} const token = authHeader.split(' ')[1];
const user = await validateToken(token);
if (!user) {
return res.status(403).json({
jsonrpc: "2.0",
error: { code: -32000, message: "Invalid token" }
});
}
// Attach user context for authorization in tool handlers
req.mcpUser = user;
await transport.handleRequest(req, res, server);
});
Authorization
Implement per-tool authorization based on user roles:
mcpServer.tool("delete_production_data", / ... /, async (args, context) => {
if (!context.user.roles.includes('admin')) {
throw new Error("Insufficient permissions");
}
// Proceed with deletion
});
Audit Logging
Log every MCP tool call for compliance:
app.post('/mcp', async (req, res) => {
const { method, params } = req.body;
if (method === 'tools/call') {
await auditLog.write({
timestamp: new Date().toISOString(),
user: req.mcpUser.email,
tool: params.name,
arguments: params.arguments,
sourceIp: req.ip
});
}
// Handle request
});
For a comprehensive security guide, read our MCP security best practices article.
---
Real-World Case Study: Scaling to 10 Million Daily Requests
A large financial services company migrated their MCP infrastructure from SSE to Streamable HTTP. Here's what changed:
Before (SSE):
After (Streamable HTTP):
The migration took 3 weeks, with 1 week of dual-transport overlap for client migration.
---
Frequently Asked Questions
Is Streamable HTTP compatible with existing MCP clients?
Most modern MCP clients (Claude Desktop 2.x+, ChatGPT, VS Code Copilot) support Streamable HTTP natively. Older clients that only support SSE will need updates. The SDK makes it easy to support both transports during migration.
How does Streamable HTTP handle long-running tool calls?
For tools that take more than a few seconds, the server can either: (1) stream progress updates using chunked transfer encoding / SSE within the response, or (2) return immediately with a task ID and let the client poll for completion. The streaming approach is preferred for interactive use.
Can I use Streamable HTTP with serverless functions (Lambda, Cloud Functions)?
Yes, and this is one of the biggest advantages. Since each request is independent, MCP servers can run as serverless functions. This provides automatic scaling and pay-per-use pricing. Be aware of cold start latency for infrequently used tools.
What happens if the server crashes mid-request?
The client receives an HTTP error and can retry the request against any server instance. Since there's no session state to lose, retries are safe for idempotent tools. For non-idempotent tools, implement idempotency keys.
How do I handle file uploads through Streamable HTTP?
Large file uploads should use multipart form data or a separate upload endpoint that returns a file reference. The tool call then uses the file reference rather than embedding the file content in the JSON-RPC payload.
Does Streamable HTTP support WebSockets?
No, and intentionally so. WebSockets would reintroduce the stateful connection problems that Streamable HTTP was designed to solve. The streaming response pattern provides similar real-time capabilities without persistent connections.
What's the maximum request/response size?
There's no protocol-level limit, but practical limits apply. Most HTTP infrastructure handles up to 10MB request bodies comfortably. For larger payloads, use streaming or chunked transfers. Configure your reverse proxy accordingly.
How do I monitor Streamable HTTP MCP servers?
Use standard HTTP monitoring tools — Prometheus, Grafana, Datadog, New Relic. The request/response pattern maps perfectly to standard HTTP metrics (request rate, latency percentiles, error rates). This is much simpler than monitoring long-lived SSE connections.
Can Streamable HTTP work behind a CDN?
Yes, for read-only tool responses that can be cached. Configure your CDN to cache based on the request body hash. Write operations should bypass the CDN. This can dramatically reduce load for tools that return relatively static data.
What about gRPC as an alternative transport?
Google proposed a gRPC transport for MCP in early 2026. gRPC offers excellent performance and strong typing but requires HTTP/2 and adds complexity. For most teams, Streamable HTTP provides the best balance of simplicity and scalability.
---
Getting Started Today
If you're building MCP servers for production, Streamable HTTP should be your default transport choice for any remote deployment. The combination of stateless architecture, standard HTTP infrastructure, and horizontal scalability makes it the clear winner for enterprise use.
Start by updating your MCP SDK, add a Streamable HTTP endpoint alongside your existing transport, validate with your clients, and then retire the old transport. The migration path is smooth, and the operational benefits are immediate.
For a complete enterprise MCP deployment strategy, check out our MCP for enterprise guide.