Sunday, May 20, 2012

What is time ?

There is an anecdote about a foreigner visiting London asking a man on the street “what is time?” and receiving the answer “I’m sorry, but I am not a philosopher”.

I don’t want to discuss here the philosophical or physical questions of what time is, but rather what we mean by time in telecommunications applications. In particular, we frequently hear the terms “UTC”, “GPS time”, “NTP time”, and “1588 time”, and I would like to clarify what these terms mean.

Everything starts with the question “what is a second?”. Until 1960 the second’s duration was based on the rotation of Earth. Specifically, the second was defined as the unit of time of which there are precisely 24*60*60 =86,400 of them in a mean solar day. Unfortunately, the Earth’s rotation is slowing down due to tidal friction, and so between 1960 and 1967 the second was redefined as a particular fraction of the duration of the year 1900. Since it is hard to reproduce the year 1900 in the lab, the second was finally linked to a stable, reproducible, physical phenomenon, namely the radiation emitted when an electron transitions between the two hyperfine levels of the ground state of the cesium 133 atom. Cesium atomic clocks need only count 9,192,631,770 oscillations and declare that a second has passed. (Cesium is chosen because all of its 55 electrons except the outermost one are in stable shells, minimizing their effect on the outermost electron.)

Even such a stable phenomenon as the hyperfine transition is somewhat subject to variability (due to contaminants, undesired fields, and General Theory of Relativity corrections due to height above sea level) leading to variability on the order of a nanosecond or two per day. In order to remove even this small variability the TAI international time scale (TAI stands for “Temps Atomique International” or International Atomic Time) maintained by the International Bureau of Weights and Measures (BIPM) in Paris, is defined as the weighted average of over 300 atomic clocks located around the world (the higher weightings going to the more stable clocks).

TAI is precisely defined, but has become entirely divorced from the Earth’s rotation. Were we to adopt only TAI the time of day would slowly lose connection with the position of the sun in the sky, and after a long enough time we would be having breakfast at 12 noon. IN order to resynchronize the two definitions of the second UTC is defined. UTC stands for Coordinated Universal Time (the order of letters is a compromise between several languages), and it replaced older time standards such as “GMT”. It is defined in ITU-R Recommendation TF.460-6 to be TAI adjusted by leap seconds introduced to compensate for the changing of Earth’s rotational velocity. When to introduce leap seconds is now determined by the International Earth Rotation and Reference Systems Service (IERS). While leap seconds can be either positive or negative, and can be introduced at the end of any month, there have only been positive ones (corresponding to slowing down of Earth’s rotation) and they have only been introduced on the last day of June or December. There are presently proposals to eliminate leap seconds entirely (in which case TAI would be abolished), and perhaps introduce leap hours should the need arise.

UTC is now exactly 34 seconds behind TAI, because of a 10 second introduced in 1972 when the present system was adopted, and 24 positive leap seconds that have been declared since then. The next leap second will be at the end of June 2012, increasing the difference to 35 seconds.

Actually there are several versions of Universal Time. UT0 and UT1 are found by observing the motion of stars (UT0) or distant quasars (UT1), as well as from laser ranging of the Moon and artificial Earth satellites (such as GPS satellites). UT1R and UT2R are smoothed versions of UT1, filtered to remove periodic and stochastic variations in the Earth’s rotation. UT2R is smoother than UT1, and any variations left in it are because of erratic changes in the Earth’s rotation, due to plate tectonics and climate change.

So, what kind of time do we use in GPS and our time distribution protocols ?

The time of day reported by GPS, which is often called “GPS time”, is not UTC. Every GPS satellite has several on-board atomic clocks, and these clocks are set according to the master clock at the US Naval Observatory in Boulder Colorado. “GPS time” does not include leap seconds, but GPS satellites periodically transmit a UTC offset message for this purpose (the GPS-UTC offset field is 8 bits and can thus accommodate 255 leap seconds, which should be sufficient for several hundred years). Once thus compensated, USNO time is within tens of nanoseconds of UTC. However, it can take over 10 minutes until you receive an offset message.

It is interesting that the on-board atomic clocks must be corrected for relativistic effects. Since the satellites are moving at high speeds with respect to an observer on the ground, the Special Theory of Relativity predicts that the on-board clocks will seem to be running about 7 microseconds per day slower than were they stationary with respect to the observer. On the other hand, the General Theory of Relativity predicts that because the satellite is high above the Earth, and thus experiences a weaker gravitational field, the on-board clocks will seem to be running faster by about 45 microseconds per day. The net relativistic correction is about 38 microseconds per day. After compensating for relativistic effects, the accuracy of time derived from a good GPS receiver is about 50 nanoseconds.

NTP (and that included SNTP) distributes UTC (i.e., it does takes leap seconds into account) and specifies UTC in seconds since Jan 1, 1900. The NTP 64-bit timestamp consists of 32 bits of whole seconds (about 136 years until roll-over) and 32 bits of fractional seconds (about 233 picoseconds of resolution). However, any specific NTP server distributes time according to the stratum of its reference clock. Of course, the time a particular NTP client obtains depends on the network between the client and the NTP server. You can expect an NTP client to be within tens of milliseconds of its server on a LAN, but only 100s of milliseconds of error over the Internet. However, NTP allows a client to track several servers, and thus improve its accuracy.

IEEE 1588 distributes TAI according to UNIX epochs. Since the UNIX time epoch started Jan 1, 1970, 1588 time is now ahead of UTC by 24 seconds (soon to be 25). The 1588v2 10-byte timestamp consists of 48 bits of whole seconds, and 32-bits of nanoseconds. Once again, the precise time accuracy depends on the type of grand master to which the 1588 master is synchronized. The big difference between 1588 and NTP is the possibility of on-path support in the network. If you have Boundary Clocks (BCs) or Transparent Clocks (TCs) in your network, the time error should be very small (perhaps a microsecond or less). 1588 can't simultaneously track multiple masters, but it can choose the best one from a list.

So that's what we mean by time.

Y(J)S

Tuesday, May 8, 2012

iPhone Storms

A few years ago RAD’s president Zohar Zisapel asked me to accompany him to a meeting with another Israeli company concerning possible cooperation on an important issue. On our way I asked him what this important issue was. He replied the iPhone problem and I immediately understood.

He informed me that he had been in the US the previous week, and although he carried a Blackberry and not an iPhone, he had experienced inability to connect to the network even for voice calls, calls dropping in the middle, cell breathing (which he graphically described as the signal strength bars undulating up and down), and of course inability to connect to data services. Once back in Tel Aviv, he had contacted companies with whom RAD could cooperate in trying to solve the problem.

I had seen many reports on the problems AT&T was experiencing in New York and San Francisco since the introduction of Apple’s iPhone, but had not known it was really that bad. Obviously the iPhone brought significantly increased bandwidth usage due to users being “always on” and consuming more video streaming and other high-datarate services rather than just sporadically sending an email or downloading a file. However, networks in other parts of the world with many different kinds of smartphones were not experiencing such catastrophic failures; in fact, many operators with whom I had spoken were not observing any problems at all!

What could be causing these problems? There were really only three possibilities:
  1. lack of resources in the air interface (known as spectrum crunch or spectral exhaustion),
  2. under-provisioning of the backhaul network,
  3. failure of the signaling servers (due to what are known as signaling storms);
and if the second item was the problem (or at least a major chunk of it), then RAD was uniquely positioned to help.

Why did we expect that the second problem to be at the root of the problem? Well, the backhaul network is extremely cost sensitive, and increasing bandwidth there was an expensive and time consuming task. We didn’t expect the air interface to be already congested (although we expected the spectrum to eventually become exhausted) since AT&T had already deployed HSPA+. We ruled out signaling as the major issue, since denser networks of smartphones were not experiencing similar problems.

Of course we now know that we were completely wrong, and that signaling server failure was the major problem. The explanation was intimately related to the slim design of the iPhone, and to fact that Americans had never adopted text and multimedia messaging as avidly as Europeans did.

To understand what went wrong and how the issue was eventually solved, I need to explain 3G Radio Resource Control (RRC) states. The RRC protocol is the control plane between the 3G network and the UE (User Equipment, e.g., cellphone). It is responsible for handling many interactions such as locating the UE, waking it up, establishing/releasing connections for voice and data, and for sending SMS’es.

The UE can be in one of five possible RRC states, called Idle, URA_PCH, Cell_PCH, Cell_FACH, and Cell_DCH. In Idle mode the UE is only known to the network by its IMSI (telephone number), and only listens to system broadcasts and paging information. It only very rarely transmits (and even then only location updates) and barely uses its receiver (waking up periodically to check if it has been paged). Battery drain is thus extremely low. At the other extreme is the Cell Dedicated Channel state. Here the UE is using a dedicated high-speed data channel, and may be consuming 100 times more battery power. In between are the PCH states where the UE is connected but still relatively inactive, consuming only a little battery power; and the FACH state where the UE is using shared channels for exchange of small bursts of data, and consuming perhaps half of what it would consume in DCH.

Now, a UE in the Cell_PCH state that needs to send a short data packet (e.g., an application keepalive) will need to transition to Cell_FACH. It does this by sending a single signaling message and receiving a single reply. After sending its data packet, the UE will only drop back to Cell_PCH after a relatively long timeout (several seconds), and in the meantime will be wasting battery power. In order to conserve battery power many manufacturers, starting with RIM in its Blackberry, but more notably Apple in the iPhone and various manufacturers for Android devices, devised a trick. The UE sends a SCRI Signaling Connection Release Indication message, a message that was intended to convey that some unexpected error has occurred in the UE, and that the network should immediately release its connection. The UE drops into the Idle state, with almost no battery drain. However, the network effectively forgets it, and the next time the UE needs to transmit something, it needs to go from idle state to FACH, which is a signaling-intensive (over 25 messages) and lengthy operation.

The consequences of this trick were not very apparent when it was only used by Blackberry handsets, which are mainly used for email and occasional short data transfers. On the other hand, iPhone users tend to continually pull and push data, watch and stream videos, and are generally “always on”. In addition, the iPhone’s iconic slimness meant that Apple couldn’t use anything larger than a 1400 mAh battery, so that Apple was particularly aggressive in sending SCRIs. Finally, in the US where SMS had never been as popular as in Europe, the signaling infrastructure was woefully undersized for millions of iPhones disconnecting and reconnecting to the network.

The initial resolution involved increasing server resources and freeing up bandwidth for signaling channels. The eventual solution was a signaling enhancement in 3GPP Release 8 called Fast Dormancy, which Apple adopted towards the end of 2010. This enhancement enables the UE to transition quickly from FACH state to PCH, rather than to Idle as in the trick. Thus the network remembers the UE, and it can rapidly transition back and forth between FACH and PCH states.

Of course, iPhones are not alone in having caused signaling storms. In mid 2011 the Android port of Angry Birds caused significant signaling traffic that stressed networks until an update solved the problem, and in January 2012 NTT Docomo suffered a 4½ hour outage in Tokyo due to an Android application that overloaded the signaling plane.

And according to many reports, spectral exhaustion is right around the corner.

Y(J)S