The Scalability of Mesh Networks: Part II Protocols and Practice
In Part I of this series, we looked at the scalability of mesh networks in a fundamental and theoretical sense, that is: is there a limit? We saw that whether there is a limit or not depends to a large extent on the nature of traffic flows as the network grows. In particular, if the traffic is sufficiently local, then the network can scale without limit, and if not, there exists some limit — however large — beyond which the network cannot be operated.
For designing and deploying real-world mesh networks, though, the question of “does the network scale without limit” is probably less important than the question of “does the network scale to what I need to scale to”? The answers can be different: an inherently unscalable network, i.e., one where the traffic is not local, may yet scale to the number required for a specific application. Conversely, a fundamentally scalable network may have a limit in practice due to the overhead produced by networking protocols.
In-practice or “symptotic” scalability is the maximum number of nodes a given network can reasonably operate at.
In this part, we explore the topic of the in-practice scalability first introduced in this paper : how many nodes does a specific mesh network scale to? What factors determine the scalability limits, and how can we engineer the network to maximally approach that limit? Unlike the asymptotic scalability studied in Part 1, we now seek the actual non-asymptotic expressions. Such an approach is referred to as the symptotic framework (cancelling the two negatives “non” and “a”!) in this paper.
As in Part I, we shall work through a transportation analogy so as to intuitively understand the underlying concepts. But this time we consider a version of the Washington Beltway — a circular highway encircling the D.C area. The simplicity and symmetry of the shape eases computation that would be too hard for the level of this article if we used the Manhattan grid of Part 1.
Our version of the Beltway is shown in the figure below. It has N equally spaced entry/exit points, and can carry W cars per minute on each stretch between entry/exits. Cars enter the highway at a certain rate L per minute from each entry ramp. The circular Beltway is analogous to a type of network called a ring network: exits are nodes arranged in the form of a ring, the stretches of highway between exits are the links between adjacent nodes, lanes are channels, and cars are messages.
Each car entering the highway is bound for a uniformly randomly chosen exit, and has the choice of going clockwise or counterclockwise to reach its destination. The choice needs to be made not only based on the relative locations of the entry and exit points, but also on the traffic situation.
There are many ways to monitor traffic and accidents, but let us suppose that this is being done using special vehicles run by the transportation authority — we call them monitors. In the figure below, regular cars are shown in red, and the monitors in yellow. We assume there are M monitors per minute injected into the Beltway at each entry point that go all the way around the Beltway before exiting (more on this overkill later).
Now for the question: how scalable is the imaginary Beltway? Specifically, as we increase the number of entry/exit points (N), with each ramp still admitting L cars/min and M monitors/min, at what point will it choke up to the point that on-ramp queues will build up indefinitely? Note that we are interested not only in “is there a limit / does it scale”, but also in “what is the limit / what does it scale TO”?
Working through some simple math for this will be conceptually helpful, so let us roll up our sleeves and do it. First, at the limit, the available car-carrying capacity equals the used capacity. Assuming no traffic-slowing congestion on the highway or ramps, the available capacity is simply W*N since there are N stretches each with W cars/min capacity. To calculate the used capacity, we need to know — similar to part 1 — the average distance travelled by each car. Assuming that the Beltway is large enough and cars choose their exits randomly enough, the regular (non-monitor) traffic on each stretch should eventually stabilize to about the same amount, and hence each car will use the shortest path. Since every exit is at most N/2 exits away from an entry point, and each exit within these is equally probable, the average distance travelled is roughly N/4.
Thus, there are L cars/min from each of N points travelling on average N/4 stretches, giving a total of L*N*(N/4) capacity usage per minute. Further, there are M monitors injected per minute at each of the N entry points, going all of the N stretches (M*N*N). Therefore, “at capacity”, that is, when usage is equal to what is available, the following should hold:
W*N = L*N*(N/4) + M*N*N
Solving for N yields N = 4*W / (L + 4*M)
If a picture is worth a thousand words, a formula is worth a thousand pictures, as quoted by the legendary computer scientist E.W. Dijkstra. From the above formula, it is easy to see that scalability increases in linear proportion to the road capacity W, and decreases as either the rate of admitted cars L decreases or the number of monitors M increases. All this makes intuitive sense. We also observe that the monitors M are weighted more than L, so their effects are different.
Note that the formula not only shows that this version of the Beltway with uniformly randomly chosen exits is not scalable in the fundamental sense — for any non-zero L or M, N is finite — but also to what it will scale to. For example, if there are 10 lanes with 30 cars/min per lane capacity (W=300), 40 cars and 2 monitors/min per entry (L=40, M=2), then per the formula the Beltway scales up to N = 25 exits.
By analogy, the formula also holds for the scalability of a ring mesh network where N is the number of nodes, W is the radio bitrate, L is the offered load, i.e., the number of messages generated by each node, and M is the number of topology monitoring messages (called control messages) generated by the routing protocol to figure out which “direction” or next hop to forward the data message. All units are in messages per minute.
In-practice scalability can be estimated as function of bitrate, message load, protocols and topology using numerical models.
While this formula only applies to ring mesh networks, more complex formulas for several other networks are derived in this paper, including a line network, a “clique” network, a randomized Manhattan grid, and mobile convoys. The formulas include the overhead of routing and access protocols to compute scalability, and verifies the formula with simulation. However, for an arbitrary network topology, it is very difficult to compute the scalability mathematically and therefore one typically resorts to simulations.
Actual formulas also allow us to do impact analysis, that is, roughly estimate the impact of improving some aspect of the network without actually doing so. Impact analysis for mesh networks is studied rigorously in this paper, but for this article, we apply it to answer a simple question: if I want to scale to more nodes, is it more effective to decrease the message load, increase the bitrate, or decrease the protocol overhead?
Let us do this for the ring network topology using the Beltway analogy and the formula we derived for it. Using the example worked out earlier as the base case, the table below computes the scalability improvement for a 2x “improvement” in each of the parameters — note that “improvement” means doubling for road capacity (W) and halving for the others.
We observe that while doubling the bitrate W doubles the scalability, halving the message rate only increases it by a factor of 1.7x. Also, halving the number of monitors does little to improve the scalability — just 9%.
But suppose we consider a circular road whose capacity is an order of magnitude (10x) less than the Beltway, say like the Kathmandu Ring Road in Nepal, but with the same assumptions, including monitors. It can obviously not carry the same level of traffic as the Beltway, but what if the traffic load is also reduced by an order of magnitude? Then, it is reasonable to expect that the scalability remains about the same since both are cut by the same amount. Is this true?
No, it is not. Per the formula, dropping W from 300 to 30 and L from 40 to 20, gives a scalability of only 10 exits, much lower than the 25 exits we saw for the original numbers. This is due to the presence of the monitors. Since the monitor traffic remains the same, when the road capacity drops, the monitors occupy a much larger fraction of the capacity. So even though the entry rate L drops by 10x, the total traffic on the road doesn’t drop as much. In other words, the monitors — whose purpose is to serve the other cars — end up utilizing much of the capacity themselves.
A less extreme case of this issue is illustrated in the figure, where both the number of lanes and regular cars are cut down by 4x, but the number of monitors is the same. Notice how more than half of the capacity is occupied by monitors — reducing by 10x would make it even worse.
This is pretty much what happens in low-capacity long-range short-burst mesh networks where the radio bitrate is traded off for range. Such networks are invaluable in public safety and outdoor Internet-of-Things (IoT) applications amongst others. In such networks, the use of networking protocols designed for higher-capacity networks result in routing control messages (analogous to monitors) taking up a large fraction of the network capacity, thereby limiting the scalability. Indeed, as shown in this paper, well-known routing protocols when run on low bitrate networks exhibit congestion collapse above a certain size.
Impact analysis shows that scalability is especially sensitive to protocol overhead in low capacity mesh networks.
Clearly, the reason for the above situation is that the monitor cars weren’t trimmed down in proportion to the actual traffic. To do so is not straightforward since monitoring needs don’t necessarily reduce with capacity. But first, does trimming overhead even help — after all, we say that it only gave us a 9% lift in the 10-lane case. But re-doing the impact analysis with the new base case of a lower-capacity lower-traffic Kathmandu Ring Road shows that it does now give a 50% improvement – so well worth doing!
Doubling the road capacity (the bitrate of a mesh network device) also gives hefty gains, but comes with tradeoffs in cost, range or battery life. Restricting car entry (originated messages) moves the penalty to the user, and should be avoided. This leaves the reduction of protocol overhead as the best option to improve scalability.
There has been quite a bit of research over the past several decades towards reducing the routing control overhead in mesh networks. Broadly speaking, the ideas attempt to intelligently limit when the control messages (monitors) are triggered, or how many hops the control messages are sent. For example, you may have noted in the Beltway analogy that sending the monitors all the way around seemed like an overkill — a single monitor could pick up traffic information on all the stretches, or each monitor could go only a certain distance.
Although this seems obvious for the stationary ring network, it is difficult to cut down the overhead significantly in general for a large, mobile network with arbitrary and dynamic topology. First, topology changes need to be detected in a timely manner, and hence one cannot reduce the monitor rate without penalty. Second, control messages cannot be easily localized since a network partition far away may influence how we select our route from the get go, and so we need “global” information in a timely manner. Indeed, one of the IETF standards called OLSR works just like the Beltway monitors — each node periodically generates an update about its links and broadcasts it to every other node. While there are many improvements to OLSR such as Hazy-Sighted Link State and B.A.T.M.A.N, and there exist other approaches such as on-demand routing (e.g. AODV used in the IoT standard Zigbee), none of them cut down the overhead nearly enough to make long-range low-bitrate networks scalable.
How then does one scale low-bitrate mesh networks? At goTenna, faced with this problem, we have invented a new approach that simply eliminates control messages. Instead, we piggyback topology monitoring information as part of data messages themselves. In the transportation analogy, this is like tasking the regular cars themselves with reporting traffic information, much like the Waze crowd-sourced traffic app does, except that ALL cars have that responsibility (messages, unlike car drivers, don’t have free will!). This is illustrated in the figure below — there are no yellow monitor cars, but each red car is equipped with monitoring equipment (represented in the figure by a yellow top half). As you can see, there is a lot more available capacity now, and therefore, higher scalability.
In terms of a real mesh network protocol, piggybacking monitoring on messages amounts to adding some information fields to the header of a message. This does increase the message size, but the overhead in doing so turns out to be far less than using explicit control packets. In terms of our formula, this is equivalent to setting M=0 but adding a small number of bits ε to L, that is N = 4*W/(L + ε). Further, this ε is a constant, and unlike M, does not grow with size and only comes into play when there is data traffic — thus saving energy when there is none. It improves security by eliminating a key attack vector — control spoofing. The zero-control-packet approach underpins several protocols within the goTenna Aspen Grove suite which does the meshing for the goTenna Pro line of products. The protocols address many details to take the above basic idea to fully functional protocols as described here and here.
Leveraging decades of research and expertise, goTenna has pioneered a zero-control-packet approach that enables unprecedented scaling.
Finally, although we have anchored our discussion of scalability entirely around unicast (point-to-point) communications, the same principles and methods apply to broadcast communications in a mesh network as well, as detailed here.
Hope that this journey through the roads of Manhattan, D.C, and Kathmandu has helped in a better understanding of the intricacies of mesh network scalability! Please feel free to post your thoughts in the comments section and continue the discussion.
Comments