Reliability and Maintainability Symposium: ARS, North America North America

Track 2 Session 5
9:10 to 10:10 a.m. Wednesday June 10, 2009

How Statistical Modeling of Field Reliability Failures Guided Efforts on Problem Resolution

In 1999, Sun Microsystems began experiencing significant field failures in new servers. A major effort ensued to identify the cause of these unexpected failures. Extensive testing, failure analysis of returned circuit boards, replacement with new parts, visits to customer sites, data collection activity, consultation with suppliers and frequent team meetings of engineers and management took place. Progress in understanding and isolating the root cause of the failing e-cache memory was slow. Boards returned from the field, some of which had been damaged in transit, were tested and typically no troubling defect could be found (NTF).

In early 2000, analysis done on acquired field reliability data from a large datacenter showed that a homogeneous Poisson process model fit the data extremely well. Model predictions were remarkably consistent with the eventual results. Consequently, the assumptions inherent in the model led to understanding the root cause of the failures. The source was identified and the necessary resolution steps were initiated. This presentation will discuss the statistical basis of the model, its agreement with the data, the consequences and the implications of the model and the steps taken to remediate the problem. We will also describe the failure mechanism, the final solution that completely eliminated the failures and the benefits in improved company-wide reliability that resulted from these efforts.

Key Words: Reliability of Repairable Systems, Statistical Modeling, Homogeneous Poisson Process, e-Cache Memory Failures

David C. Trindade
Sun Microsystems, Inc.
San Jose, California

Copyright © 2003 - 2012 ReliaSoft Corporation. All Rights Reserved.
Privacy Statement | Terms of Use | Contact | About Us

Organized by ReliaSoft Corporation