JCardente: November 2009

Monday, November 30, 2009

400 Years of the Telescope

If you're interested in astronomy then you might enjoy this PBS documentary, 400 Years of the Telescope.

This beautifully filmed show summarizes the history of the telescope beginning with Galileo's first experiments with looking glasses up to the Hubble and radio astronomy. Along the way it (briefly) describes various recent technologies like adaptive optics and techniques for building very large mirrors.

While not a deeply technical resource, it is a nice way to spend an hour. In my case, I watched this on my iPhone while waiting for my son at a doctor's appointment and it was much better than reading out-of-date magazines - well worth the $1.99 cost.

Watching this made me want to work in an observatory someday, maybe I've seen the movie Contact too many times (if that is even possible) :-).

Sunday, November 29, 2009

N is a Number

Earlier this year I read the book "Linked: The New Science of Networks" which prominently featured the mathematician Paul Erdős. The book mentioned a documentary about Erdős entitled "N is a Number" that I was curious to see. Thanks to Netflix, I finally got the chance last weekend.

Most importantly, this movie is about Erdős "the person" and not his mathematics. While the film discusses his mathematical activities and accomplishments, it makes no material attempt to explain them.

With that focus in mind, I thought this was a nice documentary. The footage of Erdős was very entertaining and I suspect conveyed an accurate sense of his personality. I found the interviews of his various acquaintances equally entertaining - especially the elderly English women who recalled Erdős as a young man. Such focus on Erdős' human qualities really helped cast him as an "exceptional person" instead of an "un-relatable genius".

As is often the case with people of substance, Erdős suffered hardship and had a number of significant life experiences - especially during WWII. Perhaps this explains his generosity and thoughtfulness of others that was recounted a number of times in the documentary.

I was most surprised to learn of Erdős' nomadic lifestyle - for most of his life he had no permanent home or job. Instead, he followed his interests around the world attending conferences and staying with (very) supportive friends. If an acquaintance on a different continent came up with an interesting problem, Erdős would hop on a plane with all of his worldly possessions in two small suitcases and often only a small amount of money in his pockets. I found his nomadic and minimalistic lifestyle both impressive and foreign as I can't imagine being without a "home" to return to. I wonder if this is an insight into the personal sacrifice needed to achieve greatness beyond having raw-talent alone.

Clearly Erdős was a genius which I already understood from reading other references before watching this film. The depth and breadth of his work was unique and will likely be unmatched for a long time to come. But, putting his mathematical achievements aside, Erdős still seemed like an impressive, inspirational, and great man - definitely a role model worth reflecting on.

Tuesday, November 24, 2009

Bug Hunt Wrap-up

To wrap up my "Bug Hunts" series, I thought it worthwhile to quickly reflect on what made these particular bugs so special.

In the series I posted about five bugs:

Since these five represent only a small subset of all the bugs that I have diagnosed during my career I have to ask myself - what makes these bugs so special?

To me, the common thread among them is that their resolution involved a leap of thought based on skill, imagination, concentration, and an intuitive understanding of the associated technologies. In each case the answer was not self-evident from the information available and became clear only after deep thought produced an incredibly rewarding "ah-ha" moment. The short-term elation from these "ah-ha" moments are one reason why these bugs hold a special place in my memories.

However, I think the deeper answer to my question is that these bugs were milestone events that, to varying degrees, demonstrated and validated the skills that I had worked hard to acquire. The longer-term feeling of "mastery" and confidence produced by these moments were the reward required to make my past efforts satisfying and motivate further developing my skills.

In many ways, this conclusion correlates well with Malcom Gladwell's comments on satisfying work in his book Outliers. From page 49:

Those three things - autonomy, complexity, and a connection between effort and reward - are, most people agree, the three qualities that work has to have if it is to be satisfying.

The lesson that I draw from this reflection is that it is important to periodically seek out validating experiences to demonstrate acquired skill, obtain satisfaction, and refuel a continuous self-development effort.

Saturday, November 21, 2009

The Case of the Corrupted Cache Line

To culminate (for now) my "Bug Hunts" series of posts I will attempt to describe the hardest problem that I have solved to date. This was a very complicated problem which I found challenging to describe without going into excessive detail but I think the discussion below does an adequate job.

In prior posts, I described the ccNUMA system that I worked on during my first industry job. After the (relative) success of the first system, a second generation system was developed that consisted of numerous incremental improvements - faster processors, better memory controllers, faster interconnect ASICs, a few coherency protocol optimizations, etc. In general, the pre-production testing went smoothly and the system started shipping with the fullest of confidence.

Shortly after going GA, however, a coherency protocol problem appeared in the field. The problem only seemed to occur on the largest (eight node) second generation systems and had a long mean time between failure even under very heavy workloads. After ruling out the obvious suspects (those coherency protocol optimizations I mentioned), we instrumented an eight node system in the lab and started the bug hunt.

To describe the failure, I need to further explain the SCI coherency protocol used in these systems. Like other distributed shared memory protocols, the SCI protocol segmented the system's memory into fixed sized chunks called "lines" (64 bytes in SCI's case). Being a ccNUMA machine, the lines were distributed across all of the nodes in the system. For each line of memory in the system there was a single "home-node" that contained the associated physical memory and served as the coordination point for all accesses to it. Mapping tables initialized at boot time identified the home-node for each line of memory which remained constant while the machine was running. "Remote-nodes" could access the lines from home-nodes and cache them via the SCI protocol and interconnect.

One big challenge in using the SCI protocol was that it assumed an architecture where the memory and processors were located on separate nodes. In this model, home-nodes were passive and simply responded to protocol requests from remote nodes - the protocol didn't provide a means for home-nodes to obtain access to their own lines cached elsewhere in the system. Unfortunately, our system used a different architecture with each node containing both processors and memory which put it at odds with the standard SCI model. As a result, a workaround was needed in our system to allow home-nodes to obtain access to their own memory lines.

The chosen workaround was to essentially allow home-nodes to temporarily adopt multiple personalities. In order to re-gain access to a local memory line cached elsewhere in the system, a home-node would:

Temporarily create a fictitious remote node persona
Participate in the coherency protocol as a remote node to obtain the desired access rights
Act as a remote-node performing an eviction to remove the fictitious-remote-persona from the coherency sharing-chain (SCI used a distributed linked list to identify the remote-nodes with cached copies of each line).

While this dual-personality solution worked, the need to simultaneously track both the home and fake-remote-persona protocol states significantly increased the complexity of our implementation (foreshadow alert!).

Another important aspect of the system was the way that the per-line access-rights information was maintained. The Coherency ASIC was responsible for maintaining the SCI state (which indicated the access rights) for each line of local memory and remote memory cached in the L3 cache. Because of the amount of main-memory (4GB per node) and L3 cache (512MB per node), the SCI state information was stored in an external DRAM attached to the Coherency ASIC. Storing the SCI information in this way presented a problem in that it would have taken too long to query it for each transaction on the processor bus (which was necessary to maintain coherency). To get around this problem, an SRAM attached to the FSB ASIC was used to cache the access-rights information for recently accessed lines (both local and remote). Using the SRAM, the FSB ASIC could quickly determine if a processor bus operation could be allowed to continue (node had sufficient access) or needed to be postponed (node didn't have sufficient access). Although the SRAM allowed the majority of processor bus operations to complete un-delayed, it burdened the Coherency ASIC with the additional task of keeping the access-rights cached in SRAM consistent with the SCI tags (another foreshadow alert!).

With those additional explanations out of the way…

Our initial investigation into the failure revealed that the primary symptom was an inconsistency between the SCI and SRAM states for a local memory line. Specifically:

The SCI state indicated that the line was not cached anywhere else in the system - the local node had full access rights to the line.
The SRAM state indicated that the line was cached in a modified state elsewhere in the system - the local node had no access rights to the line.

Oh bother. This inconsistency was obviously incorrect and the ASICs were designed to generate a panic when it occurred to prevent data corruption (which access-rights information was correct??).

At first, we couldn't think of how such an inconsistency could occur. After much effort we managed to get a synthetic workload running in the lab that could induce the failure approximately once per day. However, because of the system's size (eight nodes), large number of busses-of-interest (3 per node, 24 total), and a limitation of only having two logic analyzers it was extremely difficult to get a clear picture of the failure. At best, all we could get (if we were lucky) was a trace of a couple of bus operations being performed against the failing line. It was clear that we stood little chance of capturing a definitive, complete trace of the failure - it was time to put on our thinking caps.

As luck would have it, this failure emerged at the end of a financial quarter and there was a substantial order being prepared for shipment. Management was unwilling to ship the order until the problem was resolved which meant there was a large amount of revenue at risk. All of sudden this problem got upper-management's full attention - not good.

Given the problem's complexity, and the substantial amount of executive attention my mentor (the MicroKid) and I isolated ourselves in his office and began brainstorming how the state inconsistency could occur. After a while, we identified a couple of areas in the microcode that could lead to such a problem but neither of us could think of a sequence of events to manifest it. After a lot of joint discussion we decided to brainstorm separately to parallelize the effort.

For three days I sat in my cube listening to music, staring at trace fragments, and assembling in my mind all the possible interactions that could cause the state inconsistency we were observing. I don't think I have ever concentrated on a single thing as hard for that long at any other time in life. Finally, everything "clicked" and I knew how the state inconsistency was occurring.

The failure involved a complex interaction between three nodes with one being the associated memory-line's home node. The following is a (simplified!) description of the failure:

The initial condition was that the line was cached in a modified state in a remote node (Node1) and the home-node was attempting to get read access. To accomplish this task, the home-node created its fake-remote-persona and performed the necessary operations to obtain a read-copy of the line. Upon getting a copy of the line, the SRAM access rights were set to "read-only" and the local processor's request was completed.
In order to remove the fake-remote-persona from the SCI sharing chain, the home-node began an eviction operation to get rid of the now unneeded second personality by sending a request to Node1.
Just as the home-node began performing the eviction operation for its fake-remote-persona, Node1 also began an operation to evict the line from its L3 cache. According to the SCI protocol, such eviction races were resolved by allowing the "older" remote node (Node1) to proceed ahead of the "younger" remote node (which in this case was the home-node's fake-persona).
While the eviction race condition between the home-node and Node1 was being resolved, a third node (Node2) started an operation to obtain read access to the same line which completed successfully.
Upon completing the eviction operation with Node1, the home-node detected that Node2 had obtained read-access to the line (Step 4) which meant that it now had to perform an eviction operation with Node2 in order to finish removing the fake-remote-persona from the SCI sharing chain. This required the home-node to downgrade the SRAM access rights to "no-access" to guard against potential SCI operations that could invalidate the local copy. The home-node then resumed its eviction effort by informing Node2 that it was leaving the sharing chain.
Node2 received the home-node's fake-remote-persona's eviction request and sent a response.
Immediately after it sent the response in step 6, Node2 began an operation to evict the line from its L3 cache.
Due to an extremely unlikely sequence of events, Node2's eviction request from step 7 got received and processed by the home-node while Node2's response from step 6 was still in-fight. In order for this to have happened, a request had to bypass a response in the interconnect which was highly improbable due to the priority given to sending responses over requests.

At the end of this sequence, the SCI state was correct in indicating that the line was not cached anywhere else in the system. The problem was that the microcode didn't explicitly check for the request-bypassing-response order of events that occurred in step 8. As a result, the microcode failed to update the SRAM information with the correct access rights thus leading to the problem. Over a decade later, I still get a brain cramp from thinking about this sequence of events.

Once the microcode bug was identified it took less than five minutes to fix. Luckily, I figured out and fixed the problem on the Friday afternoon before the end of the quarter so the big order sitting on the docks was able to ship, the company got to recognize the revenue in the quarter, and the account's sales team got a hefty commission. For my efforts, my management compensated me with, wait for it, a cup of coffee. Seriously, after "saving" millions in revenue I got a $2.00 cup of coffee from my Director. Oh well, luckily I wasn't in the role for the money :-).

And that, to date, is the hardest problem that I have ever solved.

The following diagram summarizes the failure sequence described above. Note that to simplify the diagram I avoided using the actual SCI commands and states involved as their obscure nature would have required too much explanation. Recall that each lab trace contained, at-best, only one or two of the included operations. To solve the problem, I more or less had to imagine all of the possible interactions that could be occurring and then identify this sequence as a failing case.

Thursday, November 12, 2009

The Case of the Deadlocked Queues

Updated on 12.4.2009 with a diagram.

In a previous post I described my first industry job where I got to work on the coherency subsystem for an enterprise class ccNUMA server. For my "Bug Hunts" series of posts I thought I would describe a complex problem that I root-caused during pre-production testing.

During the alpha silicon testing, the systems started hanging under heavy load. Our initial investigations discovered that operations on the processor (FSB) and inter-ASIC (F2C_Bus) busses appeared to get stuck waiting for various pregrant signals to activate. The situation looked bad, it appeared that we had a multi-ASIC deadlock on our hands.

In simple terms, the coherency subsystem on each node had two responsibilities:

accept requests from the local processors, send requests to other nodes to perform the necessary coherency & memory operations on the remote processor busses, and return responses to the local processors.
accept requests from other nodes on behalf of remote processors, perform the necessary coherency & memory operations on the local processor bus, and return responses to the other nodes for the remote processors.

To guarantee forward progress, the coherency subsystem had one golden rule - requests shall never block responses. At the transport layer this rule was enforced through the use of two independent virtual channels with priority given to the response channel. Within the ASICs, the golden rule was maintained by carefully designed scheduling and priority logic. Theoretically the golden rule should have prevented deadlocks from occurring - so much for theory.

Since this was a real system, our view was limited to the activity observed on the busses - we had no visibility into what was happening inside the ASICs. To get around the visibility problem, I worked closely with a friend on the verification team to create a script that reproduced, in simulation, the activity observed in the lab. Through these simulation tests I gained an insight into what was happening inside the ASICs. In the meantime, I consulted with the ASIC designers to better understand some of the unexpected behaviors discovered during the investigation.

I can still remember well the meeting when everything suddenly "clicked" in my head. We had brought together both ASIC teams (which were geographically distributed) to discuss the problem and attempt an explanation. To focus the conversation, I asked a series of very specific "what if" questions that finally led to an "ah ha" moment.

As with most complex failures, a number of smaller problems combined to cause the deadlock. The diagram included in my prior post may provide more context for the explanation that follows.

The first problem was that the FSB ASIC violated the golden rule by strictly ordering, in a single queue, requests from remote processors and responses to local processors. During the investigation it was discovered that this design change was made to resolve a race condition between evicting modified data from the L3 cache (accomplished via a remote processor request) and installing the replacing data into the cache (accomplished via a local processor response). Oops.

The second problem was much more subtle, there was no ordering rule between local and remote responses! The microsequencer inside the Coherency ASIC was responsible for processing all coherency requests, both from this and other nodes, and generating the associated responses. The microsequencer treated all out-bound responses equally and used a common queue to buffer them for transmission. This single out-bound response queue essentially created a cycle in the path of queues between the Coherency and FSB ASICs - a response blocked at the head of this queue would prevent all other microsequencer responses from making forward progress.

The third problem was that the microsequencer could issue more requests than it had buffers to queue out-bound responses. This meant that the microsequencer relied completely on out-bound responses making forward progress for it to be able to receive responses for the requests it issued.

Together, these three problems resulted in a deadlock when:

The coherency microsequencer sent a burst of local processor responses to the FSB ASIC. These actions freed coherency engine resources but consumed FSB ASIC resources preventing it from accepting more responses.
The coherency engine's response queue got full with an FSB ASIC bound local processor response at the head of the queue. At this point, the resource consumption described in (1) prevented the response at the head of the queue from proceeding.
Shortly thereafter, the FSB ASIC completed a number of Coherency ASIC requests and sent responses to the microsequencer. Because the microsequencer's out-bound response queue was block by (2), it couldn't send the remote responses generated by the FSB ASIC responses. This condition prevented the microsequencer from accepting more responses and caused the FSB ASIC's single request/response queue to stall.
The local processor responses already in the FSB ASIC queue from (1) were unable to make progress due to the stall described in (3). As a result, the FSB ASIC could not accept more responses to alleviate the stall described in (2).

The result, neither ASIC could make progress until the other could - a classic deadlock.

In reality, the actual deadlock scenario was much more complicated than this - it required a complex sequence of events, specific event timings, and an interplay with non-coherent inter-node accesses (i.e. accesses to another node's PCI space). However, the simplified explanation above captures the salient points - in particular the need for an additional ordering rule between local and remote responses.

Once the problem was understood, the fix was relatively easy. Luckily we had programmatic control over the depth of various ASIC queues that allowed us to configure the system to avoid the deadlock condition without having to re-spin silicon - and then there was much rejoicing.

Although the deadlock had been solved the story didn't end there for me. I was so fascinated by this problem that I spent a significant amount of time over the next few months writing a program capable of analyzing a system of queues to discover such deadlocks. 10K lines of perl code later, I had a program that took as input a queuing model expressed in a domain specific language that I designed for the task. The model language was rich enough to represent such things as different kinds of tokens, tokens spawning new tokens, global token limits, conditional routing rules for tokens, and the sharing of limited resources. The program used this information to find patterns of tokens in queues that resulted in deadlock conditions. Although I didn't know it at the time, I essentially used a constraint propagation method to reduce the search space to a reasonable size. Using the program, I was able to "predict" the deadlock condition found in the lab and validate the configuration changes made to avoid it. Although the tool never got used again, the sense of personal accomplishment from developing it made the effort worthwhile.

My history with this deadlock finally came to an end when, for the third generation system, I redesigned the coherency engine to fix this and other troublesome bugs. The opportunity to fix this problem myself really brought a fulfilling sense of closure to the experience.

This deadlock bug is one of my favorite bug hunts of all time. It was a complex problem that required a deep understanding of the system, a holistic view of its behavior, and the ability to see how the many micro-behaviors interacted to cause the deadlock. Furthermore, it led to the simulation and ASIC re-design activities which were equally rewarding experiences.

This, however, wasn't the hardest bug that I ever root caused. That will be the topic of the next Bug Hunt post once I figure out how to describe it.

UPDATE

To clarify this post I decided to add a simple diagram of the deadlock condition. The actual failure was much more complex but hopefully this diagram captures the salient points from the description above - the impact of the shared queue for requests+responses and the need to route local and remote responses differently.

Friday, November 6, 2009

A Lucky Career Start

Updated on 12.4.2009 with a better diagram of the system's architecure

I have two more bugs in mind for my "bug hunt" series but before I can post about them I need to provide a little background information about my first real industry job.

After graduate school, I had the rare opportunity to help build a large-scale, enterprise-class ccNUMA server. At the time, 1996, ccNUMA architectures were leading-edge technology and ours was one of the first to utilize the Intel platform.

The system's architecture was based on "building block" nodes each containing four PentiumPro processors, 4GB of main memory, and 12 PCI slots. Up to eight nodes could be connected together using a high-speed interconnect to create a system with 32 processors, 32GB of main memory, and 96 PCI slots. The system ran a proprietary version of UNIX carefully tuned to work efficiently on its specific architecture. At its maximum size the system consumed two 19 inch racks - big iron indeed.

The high-speed interconnect was based on the IEEE Scalable Coherent Interconnect (SCI) standard and responsible for creating a single, cache coherent memory space out of the combined node resources. Through the interconnect, any processor could access the memory on any of the nodes - albeit at variable latency based on their relative locations in the system. In addition to the processor caches, each node contained a large capacity "L3" cache to store data fetched from other nodes to avoid, as much as possible, the significantly greater "remote" access latency. The SCI interconnect inter-operated with the processor and "L3" caches to ensure that accesses to cached data were handled correctly. Together, these capabilities are what made the system a cache-coherent non-uniform memory (ccNUMA) architecture. SCI was a ring based protocol and our system used two counter rotating rings to halve the average node-to-node latency, double the bandwidth, and provide some measure of redundancy (not fault-tolerance, just reboot level redundancy if a ring failed).

The ccNUMA subsystem consisted of four custom ASICs, two designed in-house and the others jointly developed with a partner company. One of those ASICs was dedicated to maintaining the SCI coherency protocol and used a microsequencer and microcode combination to provide a programmable, high-performance coherency engine. The microsequencer utilized a VLIW-like instruction set to maximize parallel operation and minimize the control store's size.

I joined the project just after its start and my initial responsibilities were to become an SCI expert, be able to program the coherency microsequencer, and optimize the SCI coherency protocol implementation to match our machine's specific architecture. After completing those tasks ahead of schedule, I went on to do a wide variety of things including writing verification scripts for the coherency ASIC, implementing the BIOS responsible for initializing the SCI subsystem, serving as a systems engineer for the coherency subsystem, root-causing complicated bugs in all four ASICs, and writing a variety of programs to automatically identify failure root causes from simulation, emulation, and production logs. In the subsequent two follow on projects, my responsibilities continued to grow and culminated in an architect role for the third generation system.

As a computer-obsessed young engineer just starting a career this job was a dream come true. On a daily basis, I had direct interaction with ASICs, logic board design, exotic computer architectures, complicated caching protocols, low-level coding, and advanced operating systems. Stuff that I had only read about in text books was now literally at my finger tips to be studied, understood, and played with.

It was the people that I got to work with, though, that really made this job special. Everyone on the project shared an all-hands-on-deck, add-value-where-you-can, succeed-at-all-costs attitude that created an invigorating environment to work in. My principal mentor was a MicroKid from The Soul of a New Machine from whom I learned a great, great deal (and still do!). Thanks to his guidance and support I got to "touch" nearly every part of the system and felt encouraged to continuously take on new challenges. To this day I still recall the feeling of endless possibilities and excitement of knowing that my skills were improving daily.

Truly a lucky way to start my career.

UPDATE

To clarify this post, I decided to add a simplified diagram of the system's architecture. The actual architecture was more complicated, for example some of the ASICs depicted were actually multiple devices, but the diagram should sufficiently convey the overall design.