SQL Server has encountered 1234 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file


You might have come across about this message within your DBA life, if its a recurring issue then do not ignore as it says informational message in the SQL Server error log.

Since SQL Server 2008 version this is officially classified as error message, error 833 :- This message indicates that SQL Server has issued a read or write request from disk, and that the request has taken longer than 15 seconds to return. This error is reported by SQL Server and indicates a problem with the IO subsystem.

How It works?

One word -longer I/O requests on the server whether SQL Server database files are located. Digging deeper this is handled by Buffer Management which holds primary responsibility to ensure database engine handles store and retrieve data and at the same time disk I/O can also cope up the requirement.

For the obvious reasons disk IO operation will consume many resources and many times may take long time to finish, this is where Buffer Management comes into action by efficiently handling the operations. The key components in this management is buffer manager and buffer cache. Again Buffer Manager is responsible to manage access and updates to database pages and Buffer Cache (or you can refer it as buffer pool) to look after reduce database file IO actions.

 

In the event of slowdown of buffer operations the buffer manager identifies  that certain I/O request that has been outstanding for at least 15 seconds, then reports to the error log with an error message 833. The additional text for the 833 is as follows:

SQL Server has encountered %d occurrence(s) of I/O requests taking longer than %d seconds to complete on file [%ls] in database [%ls] (%d).

The OS file handle is 0x%p. The offset of the latest long I/O is: %#0xxxx.

Though this is classified as error, for a moment it stands as warning as it do not indicate a problem to SQL Server. However, as a DBA practice if you have observed this message for more than 5 times in a day then no doubt the underlying disk subsystem is not good!  As usual the BOL clearly indicates the reasons for long I/O as….often indicate a SQL Server workload that is too intense for the disk subsystem. An inadequate disk subsystem may be indicated when:

Multiple long I/O messages appear in the error log during a heavy SQL Server workload.
Perfmon counters show long disk latencies, long disk queues, or no disk idle time.

Not just to blame SQL Server alone, the long I/O are caused due to the disk component (driver or controller or firmware) that is not allowing the faster execution of results, also postponing servicing of I/O request that was sent out by SQL Server services. In addition to the SQL Server error log it is essential in diagnosing this error to go through Windows Event Viewer logs –> System log for any hardware related messages.

How to find root cause?

Remember Dynamic Management Views (DMVs) (good old friend) and in particular sys.dm_os_wait_stats. One of the archive post here: SQL Server latch waits – index contention & performance issues how DMVs can help you to get information?  Mind you this DMV shows the time for waits that have completed since the last restart of SQL Server service or last execution of DBCC SQLPERF(‘sys.dm_os_wait_stats’CLEAR);, not the current status of waits.  BOL reference: Specific types of wait times during query execution can indicate bottlenecks or stall points within the query. Similarly, high wait times, or wait counts server wide can indicate bottlenecks or hot spots in interaction query interactions within the server instance. For example, lock waits indicate data contention by queries; page IO latch waits indicate slow IO response times; page latch update waits indicate incorrect file layout. So to find out further status of wait stats run this query:

select * from sys.dm_os_wait_stats where wait_type like ‘PAGEIOLATCH%’
order by wait_type

and

select * from sys.dm_io_virtual_file_stats(NULL, NULL) vfs, sys.dm_io_pending_io_requests as pir
where vfs.file_handle = pir.io_handle

How to Troubleshoot?

There are few steps and pre-cautionary measures to take that will reduce occurrence of this issue. Though its a general practice to run Anti-Virus (AV) tool on all the Enterprise servers, it is good to explain to your System Administrator that to exclude the SQL Server binaries, database related folders from the AV scan. Never ever place the SQL Server  database data or log files on a drive that is marked to compress the contents.

When this error occurs on TEMPDB then its a bigger problem and more action is needed immediately. To find out the further diagnostics enable trace flag 1118 (if in doubt, see KBA936185 and Paul’s blog post on T1118), as one of the CSS engineers explained that: 

Trace flag 1118 forces uniform extent allocations instead of mixed page allocations.   The trace flag is commonly used to assist in TEMPDB scalability by avoiding SGAM and other allocation contention points. SQL Server 2008 optimized mixed extent allocation behavior reducing the need for trace flag 1118 and the contention on SGAM(s). 

By now you should be able to find out the root cause of this issue along with hardware check from the storage engineers to see if any of the disk components are faulty.

Finally if you see that this error message is thrown during a particular point of time (day) then capture the activity with a server side trace or identify the query that is causing excessive IO (use dm_exec DMVS), referring (not long ago) post- SQL Server wait types – keep a close watch on IO_COMPLETION and WRITELOG :: sqlperf.

There is more to talk on DBA practices and performance tuning topics in my book SQL Server 2008 R2 Administration cookbook.