Why FastAPI's scoped_session Quietly Causes Broken Pipes | Namsang LABS

#TL;DR

When you use scoped_session with FastAPI’s default settings, sync endpoints run inside a ThreadPoolExecutor, and sessions get permanently bound to threads. When connections held by idle threads get killed by MySQL’s wait_timeout, you see Broken Pipe errors. Replacing scopefunc with a ContextVar-based one isolates sessions per request without changing any existing code.

#An intermittent Broken Pipe error

(2006, "MySQL server has gone away (BrokenPipeError(32, 'Broken pipe'))")

This error showed up once or twice a month, almost always overnight or on the first request after sunrise. A retry always cleared it, and a month later it would come back at the same hour. We’d been pushing it down the priority list for a while. Then one Tuesday it spiked to ten times its usual rate, and we finally sat down and chased it. The pattern was clear:

It happens when traffic is quiet for a stretch and then a request comes in
It lines up exactly with MySQL’s wait_timeout (default 8 hours)
It’s not one specific endpoint — it shows up across different APIs
pool_pre_ping=True was already on and it still didn’t go away

#What we suspected first

The puzzling part was that pool_pre_ping=True was already on. It’s the SQLAlchemy-recommended option that validates dead connections at the moment of checkout, and we’d had it enabled in production for as long as anyone could remember. The errors came back anyway.

So we started with MySQL’s wait_timeout. Default is 28800 seconds — 8 hours. Our traffic rarely went quiet that long, or so we thought, until we noticed that between 4am and 8am some endpoints were producing exactly that length of silence. We pushed wait_timeout to 16 hours. The following week, the same thing happened.

We added pool_recycle=3600. Marginal improvement. That was when the word checkout finally caught in my head. pool_pre_ping only fires at the moment of checkout. What if, in our case, no checkout was happening at all? That was the first real lead.

Looking back, pool_pre_ping had been doing its job all along — for the connections that actually went through checkout. Those were the cases it caught. The ones it missed were the connections where checkout never happened in the first place. That was the real problem.

#The collision between scoped_session’s design assumption and FastAPI

#scoped_session was designed for the WSGI-era “1 request = 1 thread” model

You have to understand the context scoped_session was created in to see the problem. Flask, Django, Pyramid, and other traditional Python web frameworks are WSGI-based (Web Server Gateway Interface). WSGI servers (Gunicorn, uWSGI) assign each HTTP request to a single worker thread, and when the request finishes, the thread is released. 1 request = 1 thread is a guaranteed property.

The SQLAlchemy docs spell out this assumption:

“the majority of Python web frameworks utilize threads in a simple way, such that a particular web request is received, processed and completed within the scope of a single worker thread.”

— SQLAlchemy Docs: Using Thread-Local Scope with Web Applications

Under that assumption, “associating a session with a thread” is effectively the same as “associating it with a request.” So scoped_session defaults to ThreadLocalRegistry, which is built on threading.local(), keeping one session per thread. If you pass a custom scopefunc, it switches to a dictionary-based ScopedRegistry instead — and that’s the mechanism this post’s fix exploits. In the WSGI era, the default was enough.

#Why it breaks down in modern frameworks

FastAPI, Starlette, and other ASGI-based (Asynchronous Server Gateway Interface) frameworks have a fundamentally different concurrency model.

WSGI: one thread per request. Threads and requests are 1:1.
ASGI: event-loop based. async def endpoints run directly on the event loop; def endpoints run on a ThreadPoolExecutor thread pool.

The key difference is that threads in the pool get reused. In WSGI, the thread is released when the request finishes. FastAPI is different. anyio’s CapacityLimiter caps concurrent execution at 40 by default, and the same threads serve many requests in sequence. One thread handles request A, sits idle for a bit, handles request B, then handles request C. In an environment where one thread carries multiple requests across its lifetime, managing sessions through threading.local() leaves the session bound to the thread and stuck there.

SQLAlchemy itself acknowledges this limitation:

“It is however strongly recommended that the integration tools provided with the web framework itself be used, if available, instead of scoped_session. In particular, while using a thread local can be convenient, it is preferable that the Session be associated directly with the request, rather than with the current thread.”

— SQLAlchemy Docs: Using Thread-Local Scope with Web Applications

graph TB
    Client[Client Request] --> FastAPI[FastAPI<br/>async event loop]
    FastAPI -->|"def endpoint<br/>(sync)"| TPE["ThreadPoolExecutor<br/>(thread pool)"]
    FastAPI -->|"async endpoint"| EL[Runs on event loop]

    TPE --> T1[Thread-1]
    TPE --> T2[Thread-2]
    TPE --> TN[Thread-N]

    T1 -->|"threading.local()"| S1["Session-A<br/>(permanently bound)"]
    T2 -->|"threading.local()"| S2["Session-B<br/>(permanently bound)"]
    TN -->|"threading.local()"| SN["Session-N<br/>(permanently bound)"]

    style S1 fill:#fee,stroke:#c00,color:#18181b
    style S2 fill:#fee,stroke:#c00,color:#18181b
    style SN fill:#fee,stroke:#c00,color:#18181b

If the thread pool is sized at 40, you end up with up to 40 sessions, each one bound permanently to a thread.

#”But I’m calling close() — why is Broken Pipe still happening?”

This is the part that surprises everyone the first time. You’re clearly calling session.close(), so why aren’t the connections being cleaned up? The answer is in the difference between close() and remove(). Inside a scoped_session context, the two do fundamentally different things.

From the SQLAlchemy docs:

“The scoped_session.remove() method first calls Session.close() on the current Session, which has the effect of releasing any connection/transactional resources owned by the Session first, then discarding the Session itself.”

— SQLAlchemy Docs: Contextual/Thread-local Sessions

To put it plainly:

graph LR
    subgraph "session.close()"
        direction TB
        C1[Session-A] -->|"connection returned to pool"| CP1[Connection Pool]
        C1 -.->|"still in registry"| REG1["registry#91;thread_1#93; = Session-A ❌"]
    end

    subgraph "session_factory.remove()"
        direction TB
        C2[Session-A] -->|"connection returned to pool"| CP2[Connection Pool]
        C2 -->|"removed from registry"| REG2["registry#91;thread_1#93; deleted ✅"]
    end

    style REG1 fill:#fee,stroke:#c00,color:#18181b
    style REG2 fill:#efe,stroke:#0a0,color:#18181b

close() = return the connection. The session object is still in the registry. The next call returns the same session.
remove() = run close(), then delete the session from the registry. The next call creates a new one.

As the docs state, remove() runs close() and then drops the session object itself out of the registry. Identity map state, accumulated error state — all of it gets reset cleanly.

But the whole convenience of scoped_session is that you don’t have to manage this explicitly. You can just call session_factory() and get a session back without ever touching close() or remove(). That convenience is poison in FastAPI. Most code never calls close() explicitly, so the Session keeps holding its connection and stays attached to the thread.

#The full path to a Broken Pipe

sequenceDiagram
    participant C as Client
    participant T as Thread-3
    participant S as Session-A<br/>(scoped)
    participant M as MySQL

    Note over C,M: 22:00 — last request
    C->>T: GET /api/data
    T->>S: session_factory() → returns Session-A
    S->>M: query via Connection-X
    M-->>C: Response
    Note over S: Request done. close() never called.<br/>Session-A still holds Connection-X.

    Note over C,M: ⏳ 8 hours pass (no requests on Thread-3)

    Note over M: 06:00 — MySQL wait_timeout fires
    M-xS: Connection-X closed server-side<br/>(client doesn't notice)

    Note over C,M: 09:00 — a new request lands on Thread-3
    C->>T: GET /api/users
    T->>S: session_factory() → same Session-A returned
    S->>M: tries to query through the still-held Connection-X
    M--xS: ❌ Broken Pipe!
    Note over T: (2006, "MySQL server has gone away")

Because the Session is holding the connection directly, no checkout happens against the pool. pool_pre_ping only fires at checkout time, so it never gets the chance to do its job in this scenario.

While tracking this down, I found that the same symptom keeps coming up around FastAPI + SQLAlchemy. If you’re seeing the same thing, these threads are useful starting points.

Several FastAPI GitHub discussions:

Discussion #8017 — comparing scoped_session vs Dependency vs Middleware approaches, with a ContextVar + scopefunc fix shared
Discussion #6628 — the deadlock case for DI-style sessions when the thread pool fills up waiting for connections

A comment from the SQLAlchemy side noting the surge of FastAPI-related concurrency issues:

“we seem to be getting a flurry of concurrency issues involving FastAPI very suddenly. These errors, particularly the ‘lost connection’ error, are often the side effect of conditions that were ultimately caused by concurrency issues.”

— zzzeek, sqlalchemy/sqlalchemy Discussion #8891

The conditions where this typically shows up:

Using scoped_session without specifying scopefunc (default: threading.local())
Sync endpoints (def) running on the ThreadPoolExecutor
A database like MySQL/MariaDB that kills idle connections via wait_timeout
Uneven traffic patterns that leave some threads idle for long stretches

#The fix: replace scopefunc with ContextVar

#The core idea

If you pass a custom scopefunc, scoped_session switches from ThreadLocalRegistry to ScopedRegistry. The structure is just a dictionary.

# SQLAlchemy ScopedRegistry, simplified
class ScopedRegistry:
    def __init__(self, createfunc, scopefunc):
        self.registry = {}

    def __call__(self):
        key = self.scopefunc()  # what this key is — that's everything
        if key not in self.registry:
            self.registry[key] = self.createfunc()
        return self.registry[key]

If you set the key to a per-request unique value, then even when several requests run on the same thread, each one gets its own session.

graph LR
    subgraph "Before: thread ID as key"
        direction TB
        SF1["scopefunc()"] -->|"Thread-1 ID"| R1["registry#91;thread_1#93; = Session-A"]
        SF1 -->|"Thread-2 ID"| R2["registry#91;thread_2#93; = Session-B"]
    end

    subgraph "After: request ID as key"
        direction TB
        SF2["scopefunc()"] -->|"req-a1b2"| RR1["registry#91;req-a1b2#93; = Session-X"]
        SF2 -->|"req-e5f6"| RR2["registry#91;req-e5f6#93; = Session-Y"]
    end

    style R1 fill:#fee,stroke:#c00,color:#18181b
    style R2 fill:#fee,stroke:#c00,color:#18181b
    style RR1 fill:#efe,stroke:#0a0,color:#18181b
    style RR2 fill:#efe,stroke:#0a0,color:#18181b

#Implementation: 3 files, minimal change

#ContextVar module (new)

# context_session.py
from contextvars import ContextVar, Token

_session_context: ContextVar[str] = ContextVar("session_context")

def set_session_context(context_id: str) -> Token:
    return _session_context.set(context_id)

def get_session_context() -> str:
    return _session_context.get()

def reset_session_context(token: Token) -> None:
    _session_context.reset(token)

#Database class (one-line change)

class Database:
    def __init__(self, engine: Engine) -> None:
        self.session_factory = orm.scoped_session(
            orm.sessionmaker(autocommit=False, autoflush=False, bind=engine),
            scopefunc=get_session_context  # ← this one line is the whole fix
        )

#Middleware (new)

from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
from uuid import uuid4

class ContextSessionMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        context_id = str(uuid4())[:8]
        token = set_session_context(context_id)
        try:
            response = await call_next(request)
            return response
        finally:
            # Order matters: remove() before reset.
            # remove() internally calls scopefunc() to find the session for
            # the current context. If reset runs first, scopefunc() returns
            # a different value and the wrong session gets removed.
            database.session_factory.remove()  # find and remove the current context's session
            reset_session_context(token)        # then reset the context

app.add_middleware(ContextSessionMiddleware)

UseCase, Repository, endpoint code — none of it needs to change.

#Request lifecycle after the fix

sequenceDiagram
    participant C as Client
    participant MW as ContextSession<br/>Middleware
    participant T as Thread-3
    participant S as Session<br/>(per-request)
    participant M as MySQL

    C->>MW: GET /api/data
    MW->>MW: set_session_context(uuid)

    MW->>T: call_next(request)
    T->>S: session_factory()
    Note over S: scopefunc() → uuid<br/>not in registry → new Session
    S->>M: run query
    M-->>S: result
    S-->>T: result
    T-->>MW: Response

    MW->>MW: session_factory.remove()
    Note over S: removed from registry ✅
    MW->>MW: reset_session_context(token)
    MW-->>C: Response

Even if a different request lands on the same thread, its context_id is different, so a fresh session gets created. When the request finishes, remove() cleans it up completely.

The key precondition for this pattern: FastAPI/Starlette runs sync endpoints through anyio’s run_in_threadpool. Internally, anyio uses Python’s standard Context.run(), which means the ContextVar from the call site is copied verbatim into the worker thread. The context_id set in the middleware reaches session_factory() inside the endpoint with the same value.

The day after deploying this, MySQL server has gone away errors dropped to zero. They’ve been zero for a month now. No memory change (remove() cleans up every request). No P99 latency change either (the ContextVar overhead is negligible).

#Why we didn’t switch to the DI pattern

The FastAPI docs recommend the Depends(get_db) pattern.

def get_db():
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close()

@app.get("/users")
def get_users(db: Session = Depends(get_db)):
    return db.query(User).all()

The session lifecycle is explicit and isolation is perfect. But if your existing codebase injects scoped_session globally through a DI container, the migration cost is high. Every UseCase and Repository would need to change how it receives the session, and the entire test suite would need a rewrite.

The DI approach also has its own FastAPI issue. As Discussion #6628 reports, when the thread pool fills up waiting for connections, the finally block of a dependency may not get a thread to run on, causing a deadlock.

Replacing scopefunc keeps the existing structure intact and only fixes the underlying problem. It also aligns with SQLAlchemy’s recommendation to “associate the Session directly with the request.”

#Other frameworks converge on the same mechanism

Every major framework eventually arrives at the same mechanism — a registry keyed by whatever scopefunc returns. The only thing that varies is what the key is.

Implementation	scopefunc	Key unit
`scoped_session` default	`threading.local()`	thread
Flask-SQLAlchemy	Flask request context	Flask request
`async_scoped_session`	`ContextVar.get()` / `current_task()`	async task
Galaxy Project	`ContextVar` (fallback: `threading.get_ident`)	request
This post’s fix	`ContextVar.get()`	request

In FastAPI Discussion #8017, people converged on the same conclusion: “generate a UUID per request, store it in a ContextVar, use it as the scopefunc — works perfectly.” The Galaxy Project (a large bioinformatics platform) runs the same pattern in production on FastAPI.

Mike Bayer (zzzeek), the creator of SQLAlchemy, also recommends in GitHub discussions that asyncio environments use Python’s standard library contextvars instead of threading.local().

#Things to watch out for

#1. threading.local() is only safe when threads are destroyed

According to the SQLAlchemy docs, a threading.local() object is garbage-collected when its thread is destroyed. In environments that create and discard threads, you can get away without remove(). But FastAPI’s ThreadPoolExecutor does not destroy threads — it reuses them. In that environment, remove() is mandatory, and if you can’t guarantee the call, replacing scopefunc itself is the safer answer.

#2. pool_pre_ping and pool_recycle are not the real fix

engine = create_engine(
    "mysql+pymysql://...",
    pool_pre_ping=True,
    pool_recycle=3600,
)

These settings detect dead connections at the connection pool layer. But there’s a hard limit: pool_pre_ping only fires at the moment a connection is checked out from the pool.

In the scoped_session pattern, close() is often never called explicitly. When that happens, the Session is holding the connection directly, no checkout against the pool ever occurs, and pool_pre_ping never gets a chance to run. Even when you do call close(), the Session itself stays in the registry and gets returned again, so the underlying problem (sessions bound to threads) doesn’t go away.

The real fix is replacing scopefunc. pool_pre_ping is a useful defensive backup to keep alongside it, not a substitute.

#3. Skipping remove() leads to a memory leak

You must call remove() in the finally block of the middleware. With the thread-based approach, you’d only ever have as many sessions as threads (a few dozen). With the ContextVar-based approach, you can have as many as concurrent requests. If remove() is missed, the ScopedRegistry dictionary grows without bound.

#4. Know BaseHTTPMiddleware’s limitations

The example uses BaseHTTPMiddleware. It’s concise to write, but Starlette has a known limitation: the downstream app runs in a separate task, so streaming responses can misbehave. The pattern of setting a ContextVar in the middleware and reading it downstream works fine, but if you need streaming in production, consider migrating to a pure ASGI middleware.

#5. Manage session creation and cleanup entirely in the middleware

FastAPI can run middleware and dependencies in different contexts. If you set/reset the ContextVar in middleware but call remove() from a dependency, remove() may run after the context has already been reset. This issue is also reported in dependency-injector Discussion #493. It’s safest to manage the entire session lifecycle inside the middleware.

#Summary

Aspect	Default scoped_session	scopefunc + ContextVar
Session isolation unit	thread	request
Session management	implicit (threading.local)	implicit (ContextVar)
Broken Pipe risk	yes (idle threads)	none (cleaned on request end)
Existing code changes	none	scopefunc 1 line + middleware
UseCase/Repository changes	none needed	none needed
Debugging difficulty	high (manifests 8 hours later)	low (immediate)
greenlet/async compat	limited	compatible

What we’d bound in the wrong place wasn’t the code — it was the space the session lived in. We had carried “thread equals request,” a piece of WSGI-era legacy, straight into an ASGI environment. The fix is one line. Finding the right place to put that line took days.

If you’re running scoped_session on FastAPI and you sometimes catch errno 32 in the early morning, before reaching for pool_pre_ping, ask yourself first — where is my session bound?

#References

SQLAlchemy 2.0 Docs — Contextual/Thread-local Sessions — scoped_session design assumption, close() vs remove(), custom scopefunc
fastapi/fastapi Discussion #8017 — scoped_session vs Dependency vs Middleware comparison, ContextVar + scopefunc fix
fastapi/fastapi Discussion #6628 — DI session deadlock report
sqlalchemy/sqlalchemy Discussion #8891 — surge of FastAPI concurrency issues, comment from zzzeek