Brief Excerpts from Warren Weaver’s Introduction to:
Claude Shannon’s The Mathematical Theory of Communication


2.2. Information

The word information, in this theory, is used in a special sense that must not be confused with its ordinary usage. In particular information must not be confused with meaning.

In fact, two messages, one of which is heavily loaded with meaning and the other of which is pure nonsense, can be exactly equivalent, from the present viewpoint, as regards information. It is this, undoubtedly, that Shannon means when he says that the semantic aspects of communication are irrelevant to the engineering aspects. But this does not mean that the engineering aspects are necessarily irrelevant to the semantic aspects.

To be sure, this word information in communication theory relates not so much to what you do say, as to what you could say.

That is, information is a measure of one’s freedom of choice when one selects a message. If one is confronted with a very elementary situation where he has to choose one of two alternative messages, then it is arbitrarily said that the information, associated with this situation, is unity. Note that is misleading (although often convenient) to say that one or the other message conveys unit information. The concept of information applies not to the individual messages (as the concept of meaning would), but rather to the situation as a whole, the unit information indicating that in this situation one has an amount of freedom of choice, in selecting a message, which it is convenient to regard as a standard or unit amount.

The two message between which one must choose, in such, a selection, can be anything one likes. One might be the text of the King James Version of the Bible, and the other might be Yes. The transmitter might code these two messages so that zero is that a closed circuit (current flowing is the signal for the first, and an open circuit(no current flowing) the signal for the second. Thus the two positions, closed and open, of a simple relay, might correspond to the two messages.

To be somewhat more definite, the amount of information is defined, in the simplest cases, to be measured by the logarithm of the number of available choices. It being convenient to use logarithms to the base 2, rather than common or Briggs’ logarithm to the base 10, the information, when there are only two choices, is proportional to the logarithm of 2 to the base 2. But this is unity; so that a two-choice situation is characterized by information of unity, as has already been stated above. This unit if information is called a bit, this word, first suggested by John W. Tukey, being a condensation of binary digit. When numbers are expressed in the binary system there are only two digits, namely 0 and 1; just as ten digits, 0 to 9 inclusive, are used in the decimal number system which employs 10 as a base. Zero and one may be taken symbolically to represent any two choices, as noted above; so that binary digit or bit is natural to associate with the two-choice situation which has unit information.

If one has available say 16 alternative messages among which he is equally free to choose, then since 16=24 so that log2 16=4, one says that this situation is characterized by bits of information.

It doubtless seems queer, when one first meets it, that information is defined as the logarithm of the number of choices. But in the unfolding of the theory, it becomes more and more obvious that logarithmic measures are in fact the natural ones. At the moment, only one indication of this will be given. It was mentioned above that one simple on -or-off relay, with its two positions labeled, say, 0 and 1 respectively, can handle a unit information situation, in which there are but two message choices. If one relay can handle unit information, how much can be handled by say three relays? It seems very reasonable to want to say that three relays could handle three times as much information as one. And this indeed is the way it works out if one uses the logarithmic definition of information. For three relays are capable of responding to 23 or 8 choices, which symbolically might be written as 000, 001, 011, 010, 100, 110, 101, 111, in the first of which all three relays are open, and in the last of which all three relays are closed. And the logarithm to the base 2 of 23 is 3, so that the logarithmic measure assigns three units of information to this situation, just as one would wish. Similarly, doubling the available time squares the number of possible messages, and doubles the logarithm; and hence doubles the information if it is measured logarithmically.

Now let us return to the idea of information. When we have an information source which is producing a message by successively selecting discrete symbols (letters, words, musical notes, spots of a certain size, etc.), the probability of choice of the various symbols at one stage of the process being dependent on the previous choices (i.e., a Markoff process), what about the information associated with this procedure?

The quantity which uniquely meets the natural requirements that one sets up for information turns out to be exactly that which is known in thermodynamics as entropy. It is expressed in terms of the various probabilities involved--those of getting to certain stages in the process of forming messages, and the probabilities that, when in those stages, certain symbols be chosen next. The formula, moreover, involves the logarithm of probabilities, so that it is a natural generalization of the logarithmic measure spoken of above in connection with simple cases.

To those who have studied the physical sciences, it is most significant that an entropy-like expression appears in the theory as a measure of information. Introduced by Clausius nearly one hundred years ago, closely associated with the name of Boatzmann, and given deep meaning by Gibbs in his classic work on statistical mechanics, entropy has become so basic and pervasive a concept that Eddington remarks The law that entropy always increases - the second law of thermodynamics - holds, I think, the supreme position among the laws of Nature.

In the physical sciences, the entropy associated with a situation is a measure of the degree of randomness, or of shuffledness if you will, in the situation; and the tendency of physical systems to become less and less organized, to become more and more perfectly shuffled, is so basic that Eddington argues that it is primarily this tendency which gives time its arrow - which would reveal to us, for example, whether a movie of the physical world is being run forward or backward.

Thus when one meets the concept of entropy in communication theory, he has a right to be rather excited - - a right to suspect that one has hold of something that may turn out to be basic and important. That information be measured by entropy is, after all, natural when we remember that information, in communication theory, is associated with the amount of freedom of choice we have in constructing messages. Thus for a communication source one can say, just as he would also say it of a thermodynamic ensemble, This situation is highly organized, it is not characterized by a large degree of randomness or of choice -- that is to say, the information (0r the entropy) is low. We will return to this point later, for unless I am quite mistaken, it is an important aspect of the more general significance of this theory.

Having calculated the entropy (or the information, or the freedom of choice) of a certain information source, one can compare this to the maximum value this entropy could have, subject only to the condition that the source continue to employ the same symbols. The ratio of the actual to the maximum entropy is called the relative entropy of the source. If the relative entropy of a certain source is, say 8, this roughly means that this source is, in its choice of symbols to form a message, about 80 per cent as free as it could possibly be with these same symbols. One minus the relative entropy is called the redundancy. This is the fraction of the structure of the message which is determined not by the free choice of the sender, but rather by the accepted statistical rules governing the use of the symbols in question. It is sensibly called redundancy, for this fraction of the message is in fact redundant in something close to the ordinary sense; that is to say, this fraction of the message is unnecessary (and hence repetitive or redundant ) in the sense that if it were missing the message would still be essentially complete, or at least could be completed.

It is most interesting to note that redundancy of English is just about 50 per cent, so that about half of the letters or words we choose in writing or speaking are under our free choice, and about half (although we are not ordinarily aware of it) are really controlled by the statistical structure of the language.

Apart from more serious implications, which again we will postpone to our final discussion, it is interesting to note that a language must have at least 50 per cent of real freedom (or relative entropy) in the choice of letters if one is to be able to construct satisfactory crossword puzzles. If it has complete freedom, then every array of letters is a crossword puzzle. If it has only 20 per cent of freedom, then it would be impossible to construct crossword puzzles in such complexity and number as would make the game popular. Shannon has estimated that if the English language has only about 30 per cent redundancy, then it would be possible to construct three-dimensional crossword puzzles.

2.5. Noise

How does noise affect information? Information is, we must steadily remember, a measure of one’s freedom of choice in selecting a message. The greater this freedom of choice, and hence the greater the information, the greater is the uncertainty that the message actually selected is some particular one. Thus greater freedom of choice, greater uncertainty, greater information go hand in hand.

If noise is introduced, then the received message contains certain distortions, certain errors, certain extraneous material, that would certainly lead one to say that the received message exhibits, because of the effects of the noise, an increased uncertainty. But if the uncertainty is increased, the information is increased, and this sounds as though the noise were beneficial!

It is generally true that when there is noise, the received signal is exhibits greater information--or better, the received signal is selected out of a more varied set than is the transmitted signal. This is a situation which beautifully illustrates the semantic trap into which one can fall if he does not remember that information is used here with a special meaning that measures freedom of choice and hence uncertainty as to what choice has been made. It is therefore possible for the word information to have either good or bad connotations. Uncertainty which arises by virtue of freedom of choice on the part of the sender is desirable uncertainty. Uncertainty which arises because of errors or because of the influence of noise is undesirable uncertainty.

It is thus clear where the joker is in saying that the received signal has more information. Some of this information is spurious and undesirable and has been introduced via the noise. To get the useful information in the received signal we must subtract out this spurious portion.

However clever one is with the coding process, it will always be true that after the signal is received there remains some undesirable (noise) uncertainty about what the message was; and this undesirable uncertainty--this equivocation--will always be equal to or greater than H (x) --C. Furthermore, there is always at least one code which is capable of reducing this undesirable uncertainty, concerning the message, down to a value which exceeds H (X )--C by an arbitrarily small amount.

The most important aspect, of course, is that the minimum undesirable or spurious uncertainties cannot be reduced further, no matter how complicated or appropriate the coding process. This powerful theorem gives a precise and almost startlingly simple description of the utmost dependability one can ever obtain from a communication channel which operates in the presence of noise.

One practical consequence, pointed out by Shannon, should be noted. Since English is about 50 per cent redundant, it would be possible to save about one-half the time of ordinary telegraphy by a proper encoding process, provided one were going to transmit over a noiseless channel. When there is noise on a channel, however, there is some real advantage in not using a coding process that eliminates all of the redundancy. For the remaining redundancy helps combat the noise. This is very easy to see, for just because of the fact that the redundancy of English is high, one has, for example, little or no hesitation about correcting errors in spelling that have arisen during transmission.

This is a theory so general that one not does not need to say what kinds of symbols are being considered-- whether written letters or words, or musical notes, or spoken words, or symphonic music, or pictures. The theory is deep enough so that the relationships it reveals indiscriminately apply to all these and to other forms of communication. This means, of course, that the theory is sufficiently imaginatively motivated so that it is dealing with the real inner core of the communication problem--with those basic relationships which hold in general, no matter what special form the actual case may take.

It is an evidence of this generality that the theory contributes importantly to, and in fact is really the basic theory of cryptography which is, of course, a form of coding. In a similar way, the theory contributes to the problem of translation from one language to another, although the complete story here clearly requires consideration of meaning, as well as of information. Similarly, the ideas developed in this work connect so closely with the problem of the logical design of great computers that it is no surprise that Shannon has just written a paper on the design of a computer which would be capable of playing a skillful game of chess. And it is of further direct pertinence to the present contention that this paper closes with the remark that either one must say that such a computer things, or one must substantially modify the conventional implication of the verb to think.

Thirdly, it seems highly suggestive for the problem at all levels that error and confusion arise and fidelity decreases, when, no matter how good the coding, one ties to crowd too much over a channel (i.e., H > C). Here again a general theory at all levels will surely have to take into account not only the capacity of the channel but also (even the words are right!) the capacity of the audience. If one tries to overcrowd the capacity of the audience, it is probably true, by direct analogy, that you do not, so to speak, fill the audience up and then waste only the remainder by spilling. More likely, and again by direct analogy, if you overcrowd the capacity of the audience you force a general and inescapable error and confusion.

Fourthly, it is hard to believe that levels B and C do not have much to learn from, and do not have the approach to their problems usefully oriented by, the development in this theory of the entropic ideas in relation to the concept of information.

The concept of information developed in this theory at first seems disappointing and bizarre--disappointing because it has nothing to do with meaning, and bizarre because it deals not with a single message but rather with the statistical character of a whole ensemble of messages, bizarre also because in these statistical terms the two words information and uncertainty find themselves to be partners.

........


END