How One App Held an Availability Group Hostage

The Ghost in the TempDB: How One App Held an Availability Group Hostage

In the world of SQL Server, TempDB is often called the "public dumping ground." We expect it to be busy, but we also expect it to clean up after itself. Recently, we encountered a critical incident where a 16GB database caused a TempDB explosion that threatened an entire High Availability cluster.

The culprit? Not a massive JOIN, but a combination of Always On Availability Groups (AG) and a "zombie" application.

The Crisis: Disk Space Exhaustion

It started with a standard alert:

DB error: OLE DB or ODBC error: Could not allocate a new page for database 'TempDB' because of insufficient disk space in filegroup 'DEFAULT'.

Initial checks showed that the space wasn't being eaten by user-created temporary tables or massive internal worktables (sorts/hashes). Instead, it was the Version Store that was growing uncontrollably.

The Architecture: Why Always On AG Changes the Game

To understand this, we have to look at how secondary replicas work. When you use Readable Secondaries, SQL Server automatically maps all queries to Snapshot Isolation.

This is done to ensure that:

Read queries on the secondary don't block the Redo Thread (which applies changes from the primary).
Users see a consistent version of the data.

To achieve this, SQL Server stores old versions of rows in the Version Store (within TempDB). This works perfectly... until a transaction stays open too long.

The Investigation: Identifying the "Zombie"

By querying sys.dm_tran_active_snapshot_database_transactions, we found the root cause. We identified sessions from the ANeT TimeComp application (specifically via login Anet_xxx from host) that were holding transactions open for an incredible amount of time.

Key Evidence:

Transaction Longevity: We found Session with a Longest Transaction Running Time of over 40,000 seconds (approx. 11 hours).
The Log Anomaly: The user database had a data file of 16GB, but its Transaction Log was 60GB and nearly full. This is a classic symptom of a transaction preventing log truncation.
The Stuck Thread: Even after network blips, some sessions remained in a DB_STARTUP state, effectively acting as "ghosts" that the SQL Server garbage collector could not bypass.

The "Garbage Collection" Trap

The Version Store in TempDB operates on a simple rule: You cannot delete a version of a row if there is a transaction older than that version still active.

In our case, because the ANeT TimeComp application failed to commit or roll back its transactions, the SQL Server Garbage Collector was paralyzed. It couldn't clean up any newer row versions because that 11-hour-old transaction was still "active" in the eyes of the engine. Consequently, TempDB just kept growing until the physical disk was exhausted.

The Tactical Fix: Kill Jobs and Async Stats

While waiting for a vendor fix, we implemented a two-pronged defense:

The "Guillotine" Job: An automated SQL Agent job that monitors transaction duration for the specific application and kills any session exceeding a safe threshold. This forced the rollback and allowed the Version Store to finally shrink.
Asynchronous Statistics: We noticed that UPDATE STATISTICS (especially in one specific table) was occasionally getting tangled in these long locks. We toggled the database setting:

ALTER DATABASE [YOUR_DB] SET AUTO_UPDATE_STATISTICS_ASYNC ON;
GO

This allowed the engine to update stats in the background without blocking or being blocked by the application's synchronization issues.

Final Resolution and Lessons Learned

The issue was finally classified as an application-level bug where network instability caused the application to lose track of its state without closing its SQL connections.

The vendor later provided a hotfix, which was implemented as part of the plan to migrate this application to a different environment.

Master Takeaways for DBAs:

Don't just monitor space: Monitor the age of the oldest active transaction (sys.dm_tran_database_transactions).
Understand AG overhead: Readable secondaries rely heavily on TempDB. If your TempDB is on a small drive, a single poorly written app can take down your HA strategy.

Log vs. Data Ratio: If your Log is 4x larger than your Data, you don't have a "big database" you have a "big problem" with transaction management.