I use SQL Agent a ton. It is like my go to place to schedule jobs and tasks. Problem is, there are pieces “missing”. Like if a job hangs, you can have it auto stop after a given amount of time. It doesn’t alert on changes to a job, etc, etc.
I asked on twitter what the DBA’s think, @SQLRockstar thinks using OpsMgr is the way, which probably would be great, if I had it. I need a quick and dirty solution.
What I did was query the sysjobs, sysjobactivity, and sysjobhistory tables to get some kind of “look” into the running jobs. That way, if a job that is normally 45 minutes has been running for 5 hours, I should be able to catch it early on, just trying to be more proactive instead of reactive.
SELECT j.job_id AS 'JobId', name AS 'JobName', start_execution_date AS 'StartTime', stop_execution_date AS 'StopTime', avgruntimeonsucceed, DATEDIFF(s,start_execution_date,GETDATE()) AS 'CurrentRunTime', CASE WHEN stop_execution_date IS NULL THEN DATEDIFF(ss,start_execution_date,stop_execution_date) ELSE 0 END 'ActualRunTime', CASE WHEN stop_execution_date IS NULL THEN 'JobRunning' WHEN DATEDIFF(ss,start_execution_date,stop_execution_date) > (AvgRunTimeOnSucceed + AvgRunTimeOnSucceed * .05) THEN 'LongRunning-History' ELSE 'NormalRunning-History' END 'JobRun', CASE WHEN stop_execution_date IS NULL THEN CASE WHEN DATEDIFF(ss,start_execution_date,GETDATE()) > (AvgRunTimeOnSucceed + AvgRunTimeOnSucceed * .05) THEN 'LongRunning-NOW' ELSE 'NormalRunning-NOW' END ELSE 'JobAlreadyDone' END AS 'JobRunning' FROM msdb.dbo.sysjobactivity ja INNER JOIN msdb.dbo.sysjobs j ON ja.job_id = j.job_id INNER JOIN ( SELECT job_id, AVG ((run_duration/10000 * 3600) + ((run_duration%10000)/100*60) + (run_duration%10000)%100) + STDEV ((run_duration/10000 * 3600) + ((run_duration%10000)/100*60) + (run_duration%10000)%100) AS 'AvgRuntimeOnSucceed' FROM msdb.dbo.sysjobhistory WHERE step_id = 0 AND run_status = 1 GROUP BY job_id) art ON j.job_id = art.job_id WHERE (stop_execution_date IS NULL) OR (DATEDIFF(ss,start_execution_date,stop_execution_date) > 60 AND CAST(LEFT(start_execution_date,11) AS DATETIME) = CAST(LEFT(GETDATE(),11) AS DATETIME)) ORDER BY start_execution_date DESC
The inner query looks at the history to get the average runtime, and I added the standard deviation to that to make it more realistic, in case you have a one off that throws the average off. Of course this all depends on you having more than 3 entries in your job history as well.
I then just compare the average to the current runtime (if the job is executing) and to the actual (if it is complete).
You could use this query in a report, to view throughout the day to catch long running jobs, you could also use it to just give you the one that are currently running and over the time threshold, and alert you, so you can catch them right away.
I tested this on a few servers running SQL 2005, but I am guessing it will work on SQL 2008 as well. Find and kill those long running jobs!!