Carrier Grade Communications: backhaul network

Showing posts with label backhaul network. Show all posts

Tuesday, May 8, 2012

iPhone Storms

A few years ago RAD’s president Zohar Zisapel asked me to accompany him to a meeting with another Israeli company concerning possible cooperation on an important issue. On our way I asked him what this important issue was. He replied the iPhone problem and I immediately understood.

He informed me that he had been in the US the previous week, and although he carried a Blackberry and not an iPhone, he had experienced inability to connect to the network even for voice calls, calls dropping in the middle, cell breathing (which he graphically described as the signal strength bars undulating up and down), and of course inability to connect to data services. Once back in Tel Aviv, he had contacted companies with whom RAD could cooperate in trying to solve the problem.

I had seen many reports on the problems AT&T was experiencing in New York and San Francisco since the introduction of Apple’s iPhone, but had not known it was really that bad. Obviously the iPhone brought significantly increased bandwidth usage due to users being “always on” and consuming more video streaming and other high-datarate services rather than just sporadically sending an email or downloading a file. However, networks in other parts of the world with many different kinds of smartphones were not experiencing such catastrophic failures; in fact, many operators with whom I had spoken were not observing any problems at all!

What could be causing these problems? There were really only three possibilities:

lack of resources in the air interface (known as spectrum crunch or spectral exhaustion),
under-provisioning of the backhaul network,
failure of the signaling servers (due to what are known as signaling storms);

and if the second item was the problem (or at least a major chunk of it), then RAD was uniquely positioned to help.

Why did we expect that the second problem to be at the root of the problem? Well, the backhaul network is extremely cost sensitive, and increasing bandwidth there was an expensive and time consuming task. We didn’t expect the air interface to be already congested (although we expected the spectrum to eventually become exhausted) since AT&T had already deployed HSPA+. We ruled out signaling as the major issue, since denser networks of smartphones were not experiencing similar problems.

Of course we now know that we were completely wrong, and that signaling server failure was the major problem. The explanation was intimately related to the slim design of the iPhone, and to fact that Americans had never adopted text and multimedia messaging as avidly as Europeans did.

To understand what went wrong and how the issue was eventually solved, I need to explain 3G Radio Resource Control (RRC) states. The RRC protocol is the control plane between the 3G network and the UE (User Equipment, e.g., cellphone). It is responsible for handling many interactions such as locating the UE, waking it up, establishing/releasing connections for voice and data, and for sending SMS’es.

The UE can be in one of five possible RRC states, called Idle, URA_PCH, Cell_PCH, Cell_FACH, and Cell_DCH. In Idle mode the UE is only known to the network by its IMSI (telephone number), and only listens to system broadcasts and paging information. It only very rarely transmits (and even then only location updates) and barely uses its receiver (waking up periodically to check if it has been paged). Battery drain is thus extremely low. At the other extreme is the Cell Dedicated Channel state. Here the UE is using a dedicated high-speed data channel, and may be consuming 100 times more battery power. In between are the PCH states where the UE is connected but still relatively inactive, consuming only a little battery power; and the FACH state where the UE is using shared channels for exchange of small bursts of data, and consuming perhaps half of what it would consume in DCH.

Now, a UE in the Cell_PCH state that needs to send a short data packet (e.g., an application keepalive) will need to transition to Cell_FACH. It does this by sending a single signaling message and receiving a single reply. After sending its data packet, the UE will only drop back to Cell_PCH after a relatively long timeout (several seconds), and in the meantime will be wasting battery power. In order to conserve battery power many manufacturers, starting with RIM in its Blackberry, but more notably Apple in the iPhone and various manufacturers for Android devices, devised a trick. The UE sends a SCRI Signaling Connection Release Indication message, a message that was intended to convey that some unexpected error has occurred in the UE, and that the network should immediately release its connection. The UE drops into the Idle state, with almost no battery drain. However, the network effectively forgets it, and the next time the UE needs to transmit something, it needs to go from idle state to FACH, which is a signaling-intensive (over 25 messages) and lengthy operation.

The consequences of this trick were not very apparent when it was only used by Blackberry handsets, which are mainly used for email and occasional short data transfers. On the other hand, iPhone users tend to continually pull and push data, watch and stream videos, and are generally “always on”. In addition, the iPhone’s iconic slimness meant that Apple couldn’t use anything larger than a 1400 mAh battery, so that Apple was particularly aggressive in sending SCRIs. Finally, in the US where SMS had never been as popular as in Europe, the signaling infrastructure was woefully undersized for millions of iPhones disconnecting and reconnecting to the network.

The initial resolution involved increasing server resources and freeing up bandwidth for signaling channels. The eventual solution was a signaling enhancement in 3GPP Release 8 called Fast Dormancy, which Apple adopted towards the end of 2010. This enhancement enables the UE to transition quickly from FACH state to PCH, rather than to Idle as in the trick. Thus the network remembers the UE, and it can rapidly transition back and forth between FACH and PCH states.

Of course, iPhones are not alone in having caused signaling storms. In mid 2011 the Android port of Angry Birds caused significant signaling traffic that stressed networks until an update solved the problem, and in January 2012 NTT Docomo suffered a 4½ hour outage in Tokyo due to an Android application that overloaded the signaling plane.

And according to many reports, spectral exhaustion is right around the corner.

Y(J)S

Wednesday, September 8, 2010

Deployment, R&D, and protocols

In my last entry I discussed why the last mile is a bandwidth bottleneck while the backhaul network is a utilization bottleneck. Since I was discussing the access network I did not delve into the core, but it is clear that the core is where the rates are highest, and where the traffic is the most diverse in nature.

Based on these facts, we can enumerate the critical issues for deployment and R&D investment in each of these segments. For the last mile the most important deployment issue is maximizing the data-rate over existing infrastructures, and the area for technology improvement is data-rate enhancement for these infrastructures.

For the backhaul network the deployment imperative is congestion control, while development focuses on OAM and control plane protocols to minimize congestion and manage performance and faults.

For the core network the most costly deployment issue is large-capacity, fast and redundant network forwarding elements, along with rich connectivity. Future developments involve a huge range of topics, from optimized packet formats (MPLS) through routing protocols, to management plane functionality.

A further consequence of these different critical issues is the preference of protocols used in each of these segments. In the last mile efficiency is critical, but there no little need for complex connectivity. So physical-layer framing protocols rule. As there may be the need for multiplexing or inverse multiplexing, one sometimes sees non-trivial use of higher-layer protocols. However, these are usually avoided. For example, Ethernet has long had an inefficient inverse multiplexing mechanism (LAG), but this is being replaced with the more efficient sub-Ethernet PAF (EFM bonding) alongside physical layer (m-pair) bonding for DSL links.

In the backhaul network carrier-grade Ethernet has replaced ATM as the dominant protocol, although MPLS-TP advocates are proposing it for this segment. Carrier-grade Ethernet acquired all the required fault and performance mechanisms with the adoption of Y.1731, while the MEF has worked hard in developing the needed shaping, policing, and scheduling mechanisms.

In the core the IP suite is sovereign. MPLS was originally developed to accelerate IP forwarding, but advances in algorithms and hardware have made IPv4 forwarding remarkably fast. IP caters to a diverse set of traffic types, and the large number of RFCs attests to the richness of available functionality.

Of course it is sometimes useful to use different protocols. A service provider that requires out-of-footprint connectivity might prefer IP backhaul to Ethernet. An operator with regulatory constraints might prefer a pure Ethernet (PBBN) core to an IP one. Yet, understanding the nature and constraints of each of the segments helps us weigh the possibilities.

Y(J)S

Thursday, August 26, 2010

Bandwidth and utilization bottlenecks

Let us consider an end-to-end data transport path that can be decomposed into the following segments
* end-to-end path = LAN + access network + core network + access network + LAN
There may be distinct service providers for each of these segments, thus many different decompositions may make sense from the business perspective. Yet, the identity of the access network, and of its components
* access network = last mile + backhaul network
are useful constructs for more fundamental reasons.

These reasons emanate from the concepts of bandwidth and bandwidth utilization (the ratio of required to available bandwidth). In general :
1) LAN and core have high bandwidth, while the last mile has low bandwidth.
2) LAN and core enjoy low utilizations, while the backhaul network suffers from high utilization.
Let's see why.

LANs are the most geographically constrained of the segments, and thus physics enables them to effortlessly run at high bandwidth. On the other hand, LANS handle only their owner’s traffic, and thus the required bandwidth is low as compared with that available. And if the bandwidth requirements increase, it is a relatively simple and inexpensive matter for the owner to upgrade switches or cabling. So utilization is low.

Core networks have the highest bandwidth requirements, and are geographically unconstrained. This is indeed challenging, however, the challenge is actually financial rather than physical. Physics allows transporting without error any quantity of digital data over any distance; it just extracts a monetary penalty when both bandwidth and distance are large. Since it is the core function of core network operators to provide this transport, the monetary penalty of high bandwidth is borne. Whenever trends show that bandwidth is becoming tight, network engineering comes into play – that is, either some of the traffic is rerouted or the network infrastructure is upgraded.

Shannon’s capacity law universally restricts the bandwidth of DSL, radio, cable or PON links used in the last mile. However, utilization is usually not a problem as customers purchase bandwidth that is commensurate with their needs, and understand that it is worthwhile to upgrade their service bandwidth as these needs increase.

On the other hand, the backhaul network is a true utilization bottleneck. Frequently the access provider does not own the infrastructure, and purchases bandwidth caps instead. Since the backhaul is shared infrastructure, overprovisioning these rings or trees would seriously impact OPEX overhead. Even when the infrastructure is owned by the provider, adding new segments involves purchasing right-of-way or paying license fees for microwave links.

So, the sole bandwidth bottleneck is the last mile, while the sole utilization bottleneck is the backhaul network. Understanding these facts is critical for proper network design.

Y(J)S

Thursday, August 19, 2010

The access network equation

My last entry provoked several emails on the subject of the terms last/first mile vs. access networks. While answering these emails I found it useful to bring in an additional term – the backhaul network. Since these discussions took place elsewhere, I thought it would be best to summarize my explanation here.

Everyone knows what a LAN is and what a core network is. Simply put, the access network sits between the LAN or user and the core. For example, when a user connects a home or office LAN to the Internet via a DSL link, we have a LAN communicating over an access network with the Internet core. Similarly, when a smartphone user browses the Internet over the air interface to a neighboring cellsite, the phone connects over an access network to the Internet core.

However, the access network itself naturally divides into two segments, based on fundamental physical constraints. In the first example the DSL link can’t extend further than a few kilometers, due to the electrical properties of twisted copper pairs. In the second case when the user strays from the cell served by the base-station, the connection is reassigned to a neighboring cell, due to electromagnetic properties of radio waves. Such distance-limited media are the last mile (or first mile if you prefer).

DSLAMs and base-stations are examples of first aggregation points; they terminate last mile segments from multiple users and connect them to the core network. Since the physical constraints compel the first aggregation point to be physically close to its end-users, it will usually be physically remote from the core network. So an additional backhaul segment is needed to connect the first aggregation point to the core. Sometimes additional second aggregation points are used to aggregate multiple first aggregation points, and so on. In any case, we label the set of backhaul links and associated network elements the backhaul network.

We can sum this discussion up in a single equation:
* access network = last mile + backhaul network

I’ll discuss the consequences of this equation in future blog entries.

Y(J)S