Wednesday, November 30, 2011

On exa, zetta, and beyond

Anyone who lives in metric system countries knows what "kilo" means. A kilogram is 1000 grams, a kilometer is 1000 meters. Of course frequencies are measured in kiloHertz and in the computer world we have kilobits and kilobytes (although we are never quite sure if that is 1000 or 1024!).

Most people even know that "mega" means a million. Power stations output megawatts of electricity, FM radios receive at megaHertz frequencies, and atomic bombs deliver megatons. For years our disks were measured in megabytes, and for most of us our Internet connections are in megabits (although we are not quite sure whether that is 1,000,000 of 1024*1024!).

People with state-of-the-art computers are aware that giga means a (US) billion (a thousand million), and that tera means a thousand of those, but only because disk capacities have increased so rapidly. When you ask people what comes next, you tend to get puzzled looks. Most people aren't even sure whether when they say a billion they mean a thousand million or a million million, so don't expect them to be expert in anything bigger than that!

Up to now only astrophysicists were interested in such large numbers, but with global data traffic increasing at over 30% per year, networking people are getting accustomed to them as well.

For those who are interested, the following numbers are peta (10^15), exa (10^18), zetta (10^21), and finally yotta (10^24). The last two names were only formally accepted in 1991. For those who prefer powers of two, the
IEC has standardized kibi (Ki) for 2^10, mebi (Mi) for 2^20, gibi (Gi) for 2^30, tebi (Ti) for^40, etc., although these terms don't seem to have caught on.

Several years ago I heard that the total amount of printed information in the worlds' libraries does not exceed a few hundred petabytes. On the other hand, present estimates are that global IP traffic now amounts to about 30 exabytes per month, or about ten times the world's accumulated printed knowledge every day. By the middle of this decade should surpass 100 exabytes per month, i.e., about the entire world's printed knowledge per hour.

These datarates, and particularly their time derivatives, present the telecommunications community with major challenges. We have grown accustomed to sophisticated applications that transfer massive amounts of data. A prime example is the new breed of cellphone voice/meaning recognition that sends copious amounts of raw data back to huge servers for processing. Such applications can only continue to be efficiently and inexpensively provided if the transport infrastructure can keep up with the demand for datarates.

And that infrastructure is simply not scaling up quickly enough. We haven't found ways to continually increase the number of bits/sec we can put into long-distance fiber to compensate for >30% annual increase in demand (although new research into mode division multiplexing may help). Moore's law is only marginally sufficient to cover increases in raw computation power needed for routers, but we will need Koomey's law for power consumption (MIPS / unit of energy doubles every year and a half) to continue unabated as well. And we haven't even been able to transition away from IPv4 after all of its addresses were exhausted!

If we don't find ways to make the infrastructure scale, then keeping up with exponential increases in demand will require exponential increase in cost.


Wednesday, November 23, 2011

My new CTO job

As you all probably know, I have changed job titles.
I am now RAD's Chief Technology Officer instead of (or perhaps in addition to?) Chief Scientist.

Our previous CTO, Prof. Daniel Kofman, is still in touch with the company. However, he is a bit busy since in addition to his position as Professor at Telecom ParisTech (formerly ENST), he has been appointed by France's Minister of Research and Innovation as director of LINCS (Laboratory of Information, Networking, and Communication Sciences), a new research center in Paris.

So, what will I do be doing? Well, I will no longer be managing any R&D teams. The physical layer DSL chip development department I used to run closed many years ago, and last year my DSP software development department was dissolved as well. With my new appointment my HW/FPGA/Innovations department has transitioned to the newly formed Hardware and Innovations department, and my software team is moving to the new Advanced Technologies department. The Algorithmic Research department will still report to me.

I will continue to be responsible for tracking fundamental technology trends, and to steer RAD's participation in standardization forums (IETF, ITU, MEF, BBF, etc.). I will be working with academic research groups here in Israel, and perhaps abroad as well. I will be spending more time on IPR work - over the last few years this work has tended to be more defensive than creative. I will be doing more lecturing and more writing, and will function as editor in chief of the RAD Series on Essentials of Telecommunications (more on that some other time).

And I hope to have more time to blog.


Thursday, November 17, 2011

MPLS-TP update

At the MPLS Working Group meeting this week it was announced that the core set of MPLS-TP RFCs have been finished.

Indeed, we now have (I hope that I haven't missed too many):
•RFC 5586 MPLS Generic Associated Channel (G-ACh and GAL)
•RFC 5654 Requirements of an MPLS Transport Profile
•RFC 5718 An In-Band Data Communication Network for MPLS-TP
•RFC 5860 Requirements for OAM in MPLS Transport Networks
•RFC 5921 A Framework for MPLS in Transport Networks
•RFC 5950 Network Management Framework for MPLS-TP
•RFC 5951 Network Management Requirements for MPLS-TP
•RFC 5960 MPLS-TP Data Plane Architecture
•RFC 5994 Application of Ethernet Pseudowires to MPLS Transport Networks
•RFC 6370 MPLS-TP Identifiers
•RFC 6371 OAM Framework for MPLS-TP
•RFC 6372 MPLS-TP Survivability Framework
•RFC 6373 MPLS-TP Control Plane Framework
•RFC 6374 Packet Loss and Delay Measurement for MPLS Networks
•RFC 6375 Packet Loss and Delay Measurement Profile for MPLS-TP
•RFC 6378 Linear Protection MPLS-TP
•RFC 6425 Detecting Data-Plane Failures in Point-to-Multipoint MPLS - Extensions to LSP Ping
•RFC 6426 MPLS On-Demand Connectivity Verification and Route Tracing
•RFC 6427 MPLS Fault Management Operations, Administration, and Maintenance (OAM)
•RFC 6428 Proactive Connectivity Verification, Continuity Check, and Remote Defect Indication for the MPLS-TP
•RFC 6435 MPLS Transport Profile Lock Instruct and Loopback Functions

In addition, before the IETF meeting the ITU issued a statement reasserting that the IETF holds the pen on MPLS-TP.

It seems that the game is over.


Wednesday, November 16, 2011

The notorious IP checksum algorithm

I have been asked several times to explain the checksum calculation used in the IP suite (IPv4, TCP and UDP all utilize the same checksum algorithm).

RFC 791, which defines IPv4, gives the checksum algorithm as follows :
The checksum field is the 16 bit one's complement of the one's
complement sum of all 16 bit words in the header. For purposes of
computing the checksum, the value of the checksum field is zero.
and the algorithm description was further updated in RFCs 1071, 1141, and 1624.

RFC 791 further states
This is a simple to compute checksum and experimental evidence
indicates it is adequate, but it is provisional and may be replaced by a CRC procedure, depending on further experience.

Back in 1981 when the RFC was written, Jon Postel already realized that this algorithm is very limited in its error detection capabilities (see below), but at the time CRC computation was too expensive computationally.

RFC 793, which defines TCP, says
The checksum field is the 16 bit one's complement of the one's complement sum of all 16 bit words in the header and text.

while RFC 768 for UDP says the same thing, but leaves a loophole
Checksum is the 16-bit one's complement of the one's complement sum of a pseudo header of information from the IP header, the UDP header, and the data, padded with zero octets at the end (if necessary) to make a multiple of two octets

If the computed checksum is zero, it is transmitted as all ones (the equivalent in one's complement arithmetic). An all zero transmitted checksum value means that the transmitter generated no checksum (for debugging or for higher level protocols that don't care).

On this latter issue, RFC 1180 “A TCP/IP Tutorial” adds
An incoming IP packet with an IP header type field indicating "UDP" is passed up to the UDP module by IP. When the UDP module receives the UDP datagram from IP it examines the UDP checksum. If the checksum is zero, it means that checksum was not calculated by the sender and can be ignored. Thus the sending computer's UDP module may or may not generate checksums. If Ethernet is the only network between the 2 UDP modules communicating, then you may not need checksumming. However, it is recommended that checksum generation always be enabled because at some point in the future a route table change may send the data across less reliable media.

and RFC 1122 “Requirements for Internet Hosts” adds
Some applications that normally run only across local area networks have chosen to turn off UDP checksums for efficiency. As a result, numerous cases of undetected errors have been reported. The advisability of ever turning off UDP checksumming is very controversial.

IPv6, as defined in RFC 2460, doesn’t bother with a header checksum, but closes the UDP loophole
Unlike IPv4, when UDP packets are originated by an IPv6 node,
the UDP checksum is not optional. That is, whenever originating a UDP packet, an IPv6 node must compute a UDP checksum over the packet and the pseudo-header, and, if that computation yields a result of zero, it must be changed to hex FFFF for placement in the UDP header. IPv6 receivers must discard UDP packets containing a zero checksum, and should log the error.

So, how precisely does the IP checksum algorithm work, and why is it designed this way?

The simplest method to protect against bit errors would be to xor bytes (or 16-bit words) together. This method suffers from the disadvantage that two bit errors in the same column cancel out, leaving no trace. Checksums are slightly stronger since they add words together instead of xoring them. Thus, two bit errors in the same column indeed leave that column correct in the sum, but the carry to the next column will be different.

Why does the IP checksum algorithm take the ones complement after adding together all of the words? Since this is a one-to-one transformation, it obviously doesn’t reduce the number of undetected errors. It does, however, protect against one special case – that of all zeros. If somehow the entire packet were wiped out and replaced by all zeros, the sum would still be OK (sum of zeros is zero). By flipping the result we catch this kind of bug.

To compute the IP checksum of some sequence of an even number of bytes (if the length is odd one pads with a zero byte), one groups the bytes in pairs which are considered as 16-bit words. Were one to have a computer that employs ones complement arithmetic, the algorithm would be simple to describe. One adds all of these words together, and returns the negative of this sum.

Unfortunately, ones complement machines are no longer in vogue, and essentially all computers now use twos complement representation. Ones complement and twos complement agree on how to represent positive numbers - they have a zero in the MSB. They also agree that negative numbers have a one in the MSB, but disagree about all the rest. Neither simply set the MSB (that’s called “signed magnitude” representation). In ones complement machines the negative of a positive number (its ones complement) is made by flipping all its bits. In two complement machines one flips all the bits and then adds one. Note that in ones complement representation there are two zeros – positive zero is all zeros and negative zero is all ones. Twos complement has only one zero - all zeros (all ones means -1).

Because of the difference in representation, the addition algorithms are also somewhat different for the two machine types. Twos complement machines add bits from LSB to MSB, and discard any carry from the MSB. Ones complement addition similarly adds the bits, but if a carry remains from the MSB it is added back to the LSB.

So if everyone uses twos complement arithmetic today, why does the IP checksum algorithm use ones complement addition and ones complement negation? Well, perhaps when the checksum algorithm was chosen one’s complement machines were more common (sigh).

More importantly, ones complement arithmetic has two (minor) advantages.

The first has to do with big- and little-endian conventions. Saying that a machine uses twos complement arithmetic still doesn’t completely pin things down. When building larger integers from bytes big endian machines place the higher order bytes to the left of the lower ones, thus if A and B are bytes, AB means A*256+B. Little-endian machines do the opposite – AB means B*256+A. Ones complement arithmetic has an interesting characteristic – addition is the same for big-endian and little-endian machines. This is not the case for twos complement arithmetic due to the discarding of the MSB carries.

For example,
in ones complement FF.FF+02.00=02.00 while FF.FF+00.02=00.02
in twos complement FF.FF+02.00=01.FF while FF.FF+00.02=00.01
thus one can write generic IP checksum code that directly uses 16-bit words that runs correctly on little-endian or big-endian machines, without knowing which kind of machine you have and without putting in compilation conditionals (#ifdef).

Another reason for ones complement is that it is slightly better at catching errors. Remember that twos complement addition discards MSB carries, so two bit errors in MSB positions are not caught, while ones complement propagates these carries back to the LSBs, thus catching this type of error. The difference is minor (for large TCP or UDP payloads the percentage of two bit errors missed by XORing is 6.25%, twos complement summing misses 3.32%, while ones complement summing only misses 3.125%).

Do these two small advantages justify the added complexity of using ones complement arithmetic? Probably not, but it is too late to change. With the greater computational power now available, stronger error detection algorithms should now be implemented. However, when IP is sent via Ethernet, it enjoys Ethernet's Frame Check Sequence, which is not only a CRC rather than a checksum, but is 32 bits in length! This makes the IP checksums superfluous.