MOO | Refer a Friend

The Problem

We recently hit a problem at work whereby one of our Azure Web Role instances suddenly spiked to ~100% Processor(_Total) and then proceeded to just sit there being unresponsive to any incoming HTTP requests.

After closer inspection, it appeared the role instance was not entirely unresponsive; it was still performing its duties (logging, sending heartbeat signals back to the fabric controller and processing HTTP requests) but because its CPU was maxed out, it was taking so long to process requests they were timing out. For this reason, the instance did not register as ‘unresponsive’ by the Azure Fabric controller and subsequently recycled.

This caused us quite a headache because we have 2 web roles (for failover and scalability reasons) but the Azure load balancer which sends the HTTP traffic to the web role instances appears to be quite basic (round robin) and is not aware that one of the role instances has become (to all intents and purposes) unresponsive, so the ‘hung’ instance continued to receive approximately half of all incoming HTTP traffic rendering our cloud app effectively unavailable.

Furthermore, because the role instance was still sending heartbeat signals back to the Fabric Controller the instance was not automatically recycled, which meant we had to do it manually.

The big problem was how to debug this issue. We couldn’t recreate the issue on the dev fabric.

What we did

We use the Azure Diagnostics framework pretty extensively throughout our cloud app (which is how we can tell the problem is a CPU spike), but the usual steps of logging etc were not ideal, since we had no idea where the CPU spike was originating from.

What would be ideal is a crash dump of the web role VM instance while its CPU was spiking, then we could inspect the call stacks/number of threads/heap etc using WinDbg and isolate the cause. But it’s quite rare to get a web role crash dump organically since any unhandled exceptions that bubble up are handled by Global.asax’s default unhandled Exception handler.

To work around this, we started a new Thread on the WebRole OnStart() method which monitored the Processor(_Total) perf counter for the web role instance.

 /// <summary>
/// Monitors the WaWebHost process for crash dump.
/// if CPU util >= 90%(?) Generate Mini (proc) Crash Dump
/// </summary>
private static void MonitorProcessForCrashDump()
{
    Thread.Sleep(TimeSpan.FromMinutes(5));
 
    string cpuPerfCounterName = RoleEnvironment.GetConfigurationSettingValue("CrashDumpPerfCounterName");
 
    double cpuThreshold = double.Parse(RoleEnvironment.GetConfigurationSettingValue("CrashDumpThreshold"));
 
    while (true)
    {
        Thread.Sleep(TimeSpan.FromMinutes(1));
 
        DateTime now = DateTime.UtcNow;
 
        RoleInstance roleInstance = RoleEnvironment.CurrentRoleInstance;
 
        var values = DiagnosticsHelper.GetPerfCountersForPeriod(now.AddMinutes(-5), now, roleInstance.Role.Name, cpuPerfCounterName);
 
        if (values.Any())
        {
            double avg = DiagnosticsHelper.GetAverage(values.Select(pc => pc.CounterValue).ToList());
 
            LoggingHelper.LogInfomation(string.Format("Crash Dump Average = {0}", avg));
 
            if (avg > cpuThreshold)
            {
                UnresponsiveRoleException ure = new UnresponsiveRoleException();
 
                LoggingHelper.LogException("UNRESPONSIVE WEB ROLE INSTANCE - RESTARTING", ure, RoleEnvironment.CurrentRoleInstance.Id);
 
                throw ure;
            }
        }
    }
}
 

 /// <summary>
/// Gets the perf counters for period.
/// </summary>
/// <param name="from">From.</param>
/// <param name="to">To.</param>
/// <param name="roleName">Name of the role.</param>
/// <param name="counterName">Name of the counter.</param>
/// <returns></returns>
public static List<PerformanceTableEntity> GetPerfCountersForPeriod(DateTime from, DateTime to, string roleName/*, string instanceID, */,string counterName)
{
    var svc = ServiceHelpers.CreateAzureTableService();
 
    string npk = null, nrk = null;
 
    List<PerformanceTableEntity> perfRows = new List<PerformanceTableEntity>();
 
    do
    {
        var query = svc.CreateQuery<PerformanceTableEntity>(PerformanceCountersTableName)
            .Where(de => de.PartitionKey.CompareTo("0" + from.ToUniversalTime().Ticks) >= 0
                      && de.PartitionKey.CompareTo("0" + to.ToUniversalTime().Ticks) <= 0
                      && de.Role == roleName// && de.RoleInstance == instanceID //&& de.DeploymentId == deploymentId
                      && de.CounterName == counterName).Take(1000) as DataServiceQuery<PerformanceTableEntity>;
 
        if (!string.IsNullOrEmpty(npk) && !string.IsNullOrEmpty(nrk))
        {
            query = query
                .AddQueryOption("NextPartitionKey", npk)
                .AddQueryOption("NextRowKey", nrk);
        }
 
        var response = query.Execute() as QueryOperationResponse<PerformanceTableEntity>;
 
        npk = (response.Headers.ContainsKey("x-ms-continuation-NextPartitionKey")) ? response.Headers["x-ms-continuation-NextPartitionKey"] : string.Empty;
        nrk = (response.Headers.ContainsKey("x-ms-continuation-NextRowKey")) ? response.Headers["x-ms-continuation-NextRowKey"] : string.Empty;
 
        perfRows.AddRange(response.ToList());
 
    } while (!string.IsNullOrEmpty(npk) && !string.IsNullOrEmpty(nrk));
 
    return perfRows;
}
 

If an average in excess of a given threshold (say 90%) is detected for the Processor(_Total) perf counter, a specific Exception was thrown. This exception is then of course caught in Global.asax unhandled exception handler where it is subsequently re thrown.

 /// <summary>
/// Handles the Error event of the Application control.
/// </summary>
/// <param name="sender">The source of the event.</param>
/// <param name="e">The <see cref="System.EventArgs"/> instance containing the event data.</param>
protected void Application_Error(object sender, EventArgs e)
{
    LoggingHelper.LogException("Global Exception caught", Server.GetLastError(), ServiceHelpers.GetAuthenticatedUsername());
 
    if (Server.GetLastError().GetType().Name == typeof(UnresponsiveRoleException).Name.ToString())
    {
        throw Server.GetLastError();
    }
}
 

As the exception is uncaught it finally bubbles all the way up which triggers the diagnostics monitor agent to take a full crash dump of the web role instance.

Conclusion

This meant we had a crash dump to analyse and it had the added bonus that the Web Role instance was marked as unhealthy and automatically recycled by the Fabric Controller.

If you’re interested, the cause of the CPU spike was some dodgy server side image resizing we were doing, but that’s another topic and another post.

3 comments:

Anonymous said...: Great post - I wrote about this on my blog @ http://blogs.msdn.com/b/mikekelly/archive/2010/06/17/leveraging-azure-diagnostics-to-troubleshoot-intermittent-problems.aspx. Thanks for the detail!; 17 June 2010 at 12:53
Steven Nagy said...: Really interesting problem and creative solution.; 6 August 2010 at 15:24
Anonymous said...: Fao: Joe Wardell - I am recruiting on behalf of a leading financial Brokerage in Central London We are looking for a Senior .NET Developer with Key Skills: .NET, C#, ASP.NET, ASP.NET MVC 3, SQL Server, WCSF with MVP, WCF, WPF and Silverlight with MVVM

This is an excellent opportunity to break in to the Financial Services sector. The organisation offers excellent training opportunities. We are looking for keen technologists at heart.

We are also looking for a junior .NET Developer for this oraganisation – 2-3 years of Development experience in C# and someone who has a keen desire to learn new technologies, Silverlight, MVC, WCF, MVVM.

Kind Regards,

Shazia Hussain
Director, Enterprise Developer Group
Mob: 07595115808
shazia@e-developergroup.com; 5 August 2011 at 04:05

azure miscellany

Web Role Crash Dumps

The Problem

What we did

Conclusion

3 comments:

Post a Comment