Failure prediction is one of the key challenges that have to be mastered for a new arena of fault tolerance techniques: the proactive handling of faults. As a definition, prediction is a statement about what will happen or might happen in the future. A failure is defined as “an event that occurs when the delivered service deviates from correct service.” The main point here is that a failure refers to misbehavior that can be observed by the user, which can either be a human or another computer system. Things may go wrong inside the system, but as long as it does not result in incorrect output (including the case that there is no output at all) there is no failure. Failure prediction is about assessing the risk of failure for some time in the future. In my approach, failures are predicted by analysis of error events that have occurred in the system. As, of course, not all events that have occurred ever since can be processed, only events of a time interval called embedding time are used. Failure probabilities are computed not only for one point of time in the future, but for a time interval called prediction interval.
Part of the book: Failure Analysis and Prevention