Web Role Crash Dumps
The Problem
We recently hit a problem at work whereby one of our Azure Web Role instances suddenly spiked to ~100% Processor(_Total) and then proceeded to just sit there being unresponsive to any incoming HTTP requests.
After closer inspection, it appeared the role instance was not entirely unresponsive; it was still performing its duties (logging, sending heartbeat signals back to the fabric controller and processing HTTP requests) but because its CPU was maxed out, it was taking so long to process requests they were timing out. For this reason, the instance did not register as ‘unresponsive’ by the Azure Fabric controller and subsequently recycled.
This caused us quite a headache because we have 2 web roles (for failover and scalability reasons) but the Azure load balancer which sends the HTTP traffic to the web role instances appears to be quite basic (round robin) and is not aware that one of the role instances has become (to all intents and purposes) unresponsive, so the ‘hung’ instance continued to receive approximately half of all incoming HTTP traffic rendering our cloud app effectively unavailable.
Furthermore, because the role instance was still sending heartbeat signals back to the Fabric Controller the instance was not automatically recycled, which meant we had to do it manually.
The big problem was how to debug this issue. We couldn’t recreate the issue on the dev fabric.
What we did
We use the Azure Diagnostics framework pretty extensively throughout our cloud app (which is how we can tell the problem is a CPU spike), but the usual steps of logging etc were not ideal, since we had no idea where the CPU spike was originating from.
What would be ideal is a crash dump of the web role VM instance while its CPU was spiking, then we could inspect the call stacks/number of threads/heap etc using WinDbg and isolate the cause. But it’s quite rare to get a web role crash dump organically since any unhandled exceptions that bubble up are handled by Global.asax’s default unhandled Exception handler.
To work around this, we started a new Thread on the WebRole OnStart() method which monitored the Processor(_Total) perf counter for the web role instance.
- /// <summary>
- /// Monitors the WaWebHost process for crash dump.
- /// if CPU util >= 90%(?) Generate Mini (proc) Crash Dump
- /// </summary>
- private static void MonitorProcessForCrashDump()
- {
- Thread.Sleep(TimeSpan.FromMinutes(5));
- string cpuPerfCounterName = RoleEnvironment.GetConfigurationSettingValue("CrashDumpPerfCounterName");
- double cpuThreshold = double.Parse(RoleEnvironment.GetConfigurationSettingValue("CrashDumpThreshold"));
- while (true)
- {
- Thread.Sleep(TimeSpan.FromMinutes(1));
- DateTime now = DateTime.UtcNow;
- RoleInstance roleInstance = RoleEnvironment.CurrentRoleInstance;
- var values = DiagnosticsHelper.GetPerfCountersForPeriod(now.AddMinutes(-5), now, roleInstance.Role.Name, cpuPerfCounterName);
- if (values.Any())
- {
- double avg = DiagnosticsHelper.GetAverage(values.Select(pc => pc.CounterValue).ToList());
- LoggingHelper.LogInfomation(string.Format("Crash Dump Average = {0}", avg));
- if (avg > cpuThreshold)
- {
- UnresponsiveRoleException ure = new UnresponsiveRoleException();
- LoggingHelper.LogException("UNRESPONSIVE WEB ROLE INSTANCE - RESTARTING", ure, RoleEnvironment.CurrentRoleInstance.Id);
- throw ure;
- }
- }
- }
- }
- /// <summary>
- /// Gets the perf counters for period.
- /// </summary>
- /// <param name="from">From.</param>
- /// <param name="to">To.</param>
- /// <param name="roleName">Name of the role.</param>
- /// <param name="counterName">Name of the counter.</param>
- /// <returns></returns>
- public static List<PerformanceTableEntity> GetPerfCountersForPeriod(DateTime from, DateTime to, string roleName/*, string instanceID, */,string counterName)
- {
- var svc = ServiceHelpers.CreateAzureTableService();
- string npk = null, nrk = null;
- List<PerformanceTableEntity> perfRows = new List<PerformanceTableEntity>();
- do
- {
- var query = svc.CreateQuery<PerformanceTableEntity>(PerformanceCountersTableName)
- .Where(de => de.PartitionKey.CompareTo("0" + from.ToUniversalTime().Ticks) >= 0
- && de.PartitionKey.CompareTo("0" + to.ToUniversalTime().Ticks) <= 0
- && de.Role == roleName// && de.RoleInstance == instanceID //&& de.DeploymentId == deploymentId
- && de.CounterName == counterName).Take(1000) as DataServiceQuery<PerformanceTableEntity>;
- if (!string.IsNullOrEmpty(npk) && !string.IsNullOrEmpty(nrk))
- {
- query = query
- .AddQueryOption("NextPartitionKey", npk)
- .AddQueryOption("NextRowKey", nrk);
- }
- var response = query.Execute() as QueryOperationResponse<PerformanceTableEntity>;
- npk = (response.Headers.ContainsKey("x-ms-continuation-NextPartitionKey")) ? response.Headers["x-ms-continuation-NextPartitionKey"] : string.Empty;
- nrk = (response.Headers.ContainsKey("x-ms-continuation-NextRowKey")) ? response.Headers["x-ms-continuation-NextRowKey"] : string.Empty;
- perfRows.AddRange(response.ToList());
- } while (!string.IsNullOrEmpty(npk) && !string.IsNullOrEmpty(nrk));
- return perfRows;
- }
- /// <summary>
- /// Handles the Error event of the Application control.
- /// </summary>
- /// <param name="sender">The source of the event.</param>
- /// <param name="e">The <see cref="System.EventArgs"/> instance containing the event data.</param>
- protected void Application_Error(object sender, EventArgs e)
- {
- LoggingHelper.LogException("Global Exception caught", Server.GetLastError(), ServiceHelpers.GetAuthenticatedUsername());
- if (Server.GetLastError().GetType().Name == typeof(UnresponsiveRoleException).Name.ToString())
- {
- throw Server.GetLastError();
- }
- }
As the exception is uncaught it finally bubbles all the way up which triggers the diagnostics monitor agent to take a full crash dump of the web role instance.
Conclusion
This meant we had a crash dump to analyse and it had the added bonus that the Web Role instance was marked as unhealthy and automatically recycled by the Fabric Controller.
If you’re interested, the cause of the CPU spike was some dodgy server side image resizing we were doing, but that’s another topic and another post.