|
Track 2 Session 5
9:10 to 10:10 a.m. Wednesday
June 10, 2009
How Statistical Modeling of Field Reliability Failures Guided
Efforts on Problem Resolution
In 1999, Sun Microsystems began experiencing significant
field failures in new servers. A major effort ensued to identify the cause of these
unexpected failures. Extensive testing, failure analysis of returned circuit
boards, replacement with new parts, visits to customer sites, data collection
activity, consultation with suppliers and frequent team meetings of engineers and
management took place. Progress in understanding and isolating the root cause of the
failing e-cache memory was slow. Boards returned from the field, some of which had been
damaged in transit, were tested and typically no troubling defect could be
found (NTF).
In early 2000, analysis done on acquired field reliability
data from a large datacenter showed that a homogeneous Poisson process model fit the data
extremely well. Model
predictions were remarkably consistent with the eventual results. Consequently, the
assumptions inherent in the model led to understanding the root cause of the failures. The
source was identified and the necessary resolution steps were initiated. This presentation
will discuss the statistical basis of the model, its agreement with the data, the
consequences and the implications of the model and the steps taken to remediate the
problem. We will also describe the failure mechanism, the final solution that completely
eliminated the failures and the benefits in improved company-wide reliability that
resulted from these efforts.
Key Words: Reliability of
Repairable Systems, Statistical Modeling, Homogeneous Poisson Process,
e-Cache Memory Failures
David
C. Trindade
Sun Microsystems, Inc.
San Jose, California
|