The Difficulty with Software
This section introduces the particular difficulties that software, and complexity, bring. It is partly
an extract from B Littlewood et al, The use of computers in safety-critical application, Final report of
the study group on the safety of operation computer systems, HSE books, ISBN 0 7176 1620 7, 1998.
□
What makes computer systems special?
□
Software state space
What makes computer systems special?
One of the most notable differences between most conventional mechanical and electrical systems, and computer systems, lies in
the essentially discontinuous behaviour of discrete logic, and in particular of software[1].
This shows itself in two main ways.
In the first place, when a computer system is tested on a particular input and found to work correctly, one cannot always be
certain that it will work on any other input, even one that is 'close' to the one tested. In other branches of engineering, on
the other hand, one can usually assume continuity. So, for example, if a structure such as a bridge survives a load of 10 tonnes,
it is often reasonable to claim that it would survive loads less than this. Because of the discrete, logical nature of software
there are usually no simple equivalents of 'load' and 'less than'. This weakens the claims that can be made from software testing,
since it does not allow us to extrapolate with certainty from a small set of successful tests to infer that the software will perform
correctly elsewhere. Of course, one usually cannot test all possible inputs, since the number of these is almost invariably astronomically
large.
The second problem arises when software is changed. In conventional engineering it is often possible to justify a claim that a change
to a system will be certain to be beneficial because there is usually well established theoretical and empirical evidence that can be used to
analyse the impacts on other parts of the system. For example, if a component in the bridge is replaced by one that is stronger but otherwise
identical, it is often reasonable to claim that the overall structure will have been strengthened (or at least not made weaker) - i.e. its
tendency to failure will have decreased. (Caution is required, of course, since impact analysis may show that such a change may transfer
stresses elsewhere and cause another component to fail earlier, thus decreasing the overall reliability of the bridge.) One generally cannot
so easily have confidence in the benevolence of a software change, and there are many well-attested examples of software changes whose impacts
have not been properly analysed which have had catastrophic effects upon system behaviour. For example, a change made to the software in telephone
switches in the US several years ago was regarded as sufficiently 'small' as not to require testing: in fact, it contained a fault that brought down
the long-distance telephone system of the Eastern seaboard for several hours.
Similar remarks apply even to those software changes that are merely
intended to remove particular faults that have been identified in a program. From a reliability point of view, the concept of replacing a failed
component by another one that is known to be working (but otherwise identical) is generally familiar, and in these circumstances it is reasonable
to assume that the structure will be restored to the state it was in before the component failed. In particular, the repaired structure will be as
reliable as it was before it failed. The nearest analogy to this in software is that of reloading a piece of software which has been corrupted -
we can assume that the reloaded software will be as reliable (or otherwise) as the original load. Software 'maintenance', on the other hand, refers
to changes in the software design that are either intended to correct a fault, or are in response to a change in the specification because the
software does not perform as the user wishes. This is a completely different set of circumstances from the replacement of an item of failed hardware,
which merely recovers the original system functionality. The removal of a software fault constitutes a change to the design of the system. It is
therefore much harder, and sometimes impossible, to be certain that such a change has not introduced a new fault which has made the program less
reliable than it was before: there are many recorded examples of 'small' changes, supposedly just fixing faults, causing serious reductions in
reliability.
These difficulties associated with software changes are not only of concern to the designers of these systems, whose main concern is
with the achievement of reliability, but also to those with a responsibility for assessing software reliability and its impact upon plant safety.
Thus every change must be analysed to establish it is correct and that its impact on all aspects of the system is understood. In principle, if not
always in practice, the disciplines and tools that software engineering provides for change control (see Section 7) can ensure that the required
analysis is performed and the necessary understanding obtained. However, in many cases the only safe course is to treat a program that has been
changed as if it were a new program, for which evaluation of reliability or safety must begin afresh.
It is the nature of software that failures
of a program can only occur as a result of design faults[2] - what are commonly called bugs. Non-software-based systems can also suffer from design
faults, of course, and a recent study of failures and accidents in pipework and vessels under the Seveso directive found that 23% of accidents were
due to design faults (a further 3% were due to inadequate human factors review).
A particularly interesting class of software design fault is the
so-called 'millennium bug'. Many programs will not be able to handle the transition between the years 1999 and 2000 because they represent the year
by using only two digits - they might, for example, treat 2000 as if it were 1900 with potentially catastrophic consequences. What is striking about
the problem is its widespread nature: since representation of time is needed in most programs, a large proportion of all programs is likely to be
affected. The costs of checking to determine whether software is susceptible to this problem, and making appropriate changes, is reported to be
billions of dollars world-wide.
Design faults pose particular difficulties to those responsible for building and assessing safety-critical
computer systems. Although good design practices might be expected to minimise the number of faults that find their way into a program,
there are no general procedures that allow us to avoid them completely. Indeed, since software systems often are designed to provide much
more complex functionality than the conventionally engineered systems they replace, they are more prone to design faults. We must therefore
assume that any program that is of a reasonable size will contain bugs.[3]
Given that any program will contain bugs, a major concern is
the unpredictability of the outcome when one of these is triggered. Traditional hardware systems embody much of their functionality within
the components which comprise the systems. The failure modes of these components are relatively few and can be dependably predicted -
hence the system impacts can be analysed a priori. It is also relatively easy to devise tests which can prove the satisfactory functioning
of these components within a system prior to permitting its operational usage. When equivalent functionality is invested in software, the
possible failure modes become extremely difficult to bound, and hence to test. Similarly, complete analysis of the possible failure modes
in terms of their expected effects upon the overall system is not practicable.
Finding and removing design faults in software is difficult
for reasons that are given above. Error detection and recovery provisions at some higher level of the system can help to mitigate the effects
of residual design faults, but achieving a level of fault tolerance that will entirely mask the effects of such faults is much more difficult.
Clearly, since any design fault will be reproduced in identical copies of a program, simple notions of redundancy based on component replication,
such as those used to protect against random failures of hardware systems, cannot be applied. Software diversity - where two or more versions of
a program are developed 'independently' and their output adjudicated at execution time - has been used with some success in several industries to
achieve a useful degree of fault masking. However, there is considerable evidence from several experiments and from recent theoretical work that
the benefits fall far short of what could be expected had statistical independence of failures of the versions been achieved. Since independence
cannot be claimed, it is necessary for the assessor to measure the degree of dependence that is present in order to evaluate the reliability of
the fault-tolerant diverse system.[4] There are no agreed mechanisms for assessing the degree of dependence of two pieces of software, other than
testing them back-to-back and looking for coincident failures: if the aim is to demonstrate the achievement of a specific reliability, this will
require as many tests as would be needed for a single version.
Finally, we must mention that software-based systems pose some of their most
challenging intellectual problems to their designers in coping with tight timing constraints and concurrent activities of a number of inter-connected
computers. Not only is it very hard to solve these problems, but convincing an independent assessor that they have been solved, and thus will not be
a source of failure in operation, can itself be an immense technical challenge.
Software state space
The behaviour of software based systems is often conceptualised in terms of the program state space:

The state space is defined from the program internal state and the state of the input vectors to the program. Execution of the program on inputs
presented by the environment can be seen as defining trajectories in the state space. Many systems can themselves lead to changes in the
environment and provide coupling between state space instances. Defects in the program are represented by regions ? often of complex geometry ?
that produce erroneous states and can subsequently lead to other failed states. In general the state space is characterised by two important features: