Scaling: How We Process 10^30 Network Traffic Flows

What do we do?

At Forward Networks we build digital twins for computer networks. Enterprises with large networks use our software to build a software copy of their network and use that for searching, verifying, and predicting behavior of their network. It is not a simulation. It is a mathematically accurate model of their network.

Why is it a hard problem?

Large enterprise networks contain thousands of devices (switches, routers, firewalls, load balancers, etc). Each of these devices can have complex behaviors. Now imagine a large graph with thousands of nodes where each node represents one of these devices and the links between nodes show how they are connected. You need to model exactly how traffic originating from edge devices is propagated through the network.

To do so, you need to understand the exact behavior of each device in handling different packets. A typical enterprise network not only includes different types of devices (routers, firewalls, etc), but they are built by different vendors (Cisco, Arista, Juniper, etc) and even for the same device type from the same vendor, you typically see many different firmware versions. To build a mathematically accurate model you need to model every corner case and a lot of these are not even documented by vendors.

At Forward we have built an automated testing infrastructure for inferring forwarding behavior of devices. We purchase or lease these devices; put them in our lab; inject various types of traffic to them and observe how these devices behave.

Where are we today?

I’m proud to show off publicly today, for the first time, that we can process networks with more than 45,000 devices on a single box (e.g. a single ec2 instance). Here is a screenshot of an example network with about 45k devices:

Some of our customers send us their obfuscated data to help us identify performance bottlenecks and further improve performance. It is a win-win scenario. Our software gets better over time and they get even faster processing time. The data is fully obfuscated in that every IP and MAC address is randomly changed to a different address and every name is also converted to a random name and these mappings are irreversible. These changes do not materially change the overall behavior of the model and the obfuscated data is still representative of the complexity and diversity of network behaviors of the original network. The network in the above example is built from those data.

This network includes more than 10^30 flows. Each flow shows how a group of similar packets traverses the network. For example one flow might show how email traffic originating from a specific host and destined to another host starts from a datacenter then goes through several backbone devices and finally arrives at the destination data center. 

Each of these flows can be complex. If we were to spend 1 microsecond to compute each of these flows, it would still take us more than 10^17 years to compute this. But with a lot of hard engineering work, algorithmic optimizations and performance optimizations we are able to process this network in under an hour and we are capable of processing this on a single box. You don’t need a massive cluster for such computation. The best part is that the majority of the computation scales linearly. So, if customers want faster processing speed or higher search and verification throughput they can use our cluster version and scale based on their requirements.

 

How long did it take us to get here?

Forward Networks was founded in July 2013. Our founders are Stanford PhD grads and as a result the very first test data that we got was a 16 device collection from part of the Stanford network. I joined Forward in Sep 2014 after spending a couple of years building and working with large distributed systems in Facebook and Microsoft. I started leading the effort to scale our computation to be able to finish the computation of that 16 device network in a reasonable amount of time and it took us about two years to get there (Mar 2015).

Then almost every year we were able to process a 10x larger network. Today, we have tested our software on a very complex network with 45k devices. We are currently working on further optimization and scaling efforts and our projection is to get to 100k devices in Dec 2020. The following graph shows our progress our last couple of months and the projection till Dec 2020 on logarithmic scale:

Lessons learned

It takes time to build complex enterprise software

As I mentioned above, we started with data of a very small network. As we made our software better, faster and more scalable, we were able to go to customers with larger networks to get the next larger dataset; find the next set of bottlenecks and work on those. We had to rewrite or significantly change the computation core of our software multiple times because as we got access to larger data we would see patterns that we hadn’t anticipated before.

Could we have reduced the time it took us to get here if we had access to large data on early days of our start up? Yes. Was it feasible? No. Why would a large enterprise spend the time to install our software, configure their security policies to allow our collector to connect to thousands of devices in their production network to pull their configs and send the data to a tiny startup that doesn’t have a proven product yet? It is only going to happen if it is a win-win situation. Every time we got access to the next larger dataset from a customer, we optimized our software based on that and went to other customers that had networks of that size where our software was already capable of processing all or majority of their network and when they shared this data with us. We would either find new data patterns that needed to be optimized or combine all the data we had received from customers to build larger datasets for scale testing and improvements. It is a cycle and it takes time and patience to build complex enterprise software.

Customers with large networks typically have much more strict security policies which means that they wouldn’t share their data with us. This is why we had to spend the time and build data obfuscation capabilities in our software to allow them to obfuscate their data and share the result with us which would reveal the performance bottlenecks without sharing their actual data. Some customers have such strict policies that even that is not possible and for those we have built tools that aggregate overall statistics which are typically useful for narrowing down the root cause of performance bottlenecks.

When selling enterprise software, customers typically don’t spend a large amount of money on a software platform if they’re not 100% sure that it would work for them. There is typically a trial or proof-of-concept period where they install the software and evaluate it in their environment. In our early years, we worked very hard with our first few trial customers to make the software work well for them. There were cases which didn’t end up in immediate purchase but their data gave us invaluable insight in improving our software.

 

On-prem software should work on minimal hardware

These days it is pretty easy to provision an instance in AWS, Azure or other cloud providers with 1TB or more RAM. But you would be amazed to know how many times we have had to wait for weeks or months for some customers to provision a single on-prem instance with 128GB or 256GB RAM. Large enterprises typically allow provisioning small instances pretty quickly. But as soon as your software needs a more powerful instance, there can be a lot of bureaucracy to get it done. And remember, during the initial interactions with customers, you want them to start using your software quickly to finish the proof-of-concept period. During this time, they are still evaluating your software and they haven’t yet seen the value in their environment. So, if someone in a large organization opens a ticket to the infra teams to provision a software he/she wants to try, it may not be among the highest priority tickets that would get resolved.

At Forward Networks, we have learned to be very careful with any new tool, framework or dependency we add to our system. In fact our resource requirements are so low that our developers run the entire stack on their laptops which is very critical for fast debugging and quick iterations.

We have also spent a lot of engineering time and effort on making this possible. Here are some of the high level approaches:

  • Avoiding repeated computation.
  • Deduping data structures in memory and on disk.
  • Lazy computation: delaying processing to when it is actually needed.
  • Making core data structures as compact as possible and with very low serialization and access overheads (with our in-house serialization implementation with ideas similar to Google’s FlatBuffers).
  • Using fastutil for fast and memory efficient collections in Java. We even improved its performance and added support for immutable structures in our fork of it.
  • Profiling to detect and optimize performance of actual bottlenecks.

When you need to scale to 1000x or 10000x, you can’t simply use a cluster with 1000 nodes. Even if it is possible, there is no economic justification to that. You have to do the hard engineering work to get the same done with minimal resources. Majority of our customers run our software on a single box. But we also provide the cluster version for those customers that want to ensure high availability or have more concurrent users and want to have higher search or compute throughput. 

One of our customers was telling us that they had to provision and operate a few racks of servers for another software (in the same space as us but not exactly our competitor) and how they were pleased and amazed on what our software delivers with such low requirements. Of course not only this can speed up adoption of the software, it saves customers money and allows you as a software vendor to have better margins.

 

Open source tools are not always the answer

In the early years of our startup, we were using off-the-shelf platforms and tools like Elasticsearch and Apache Spark for various usages. Over time it became clear that while these platforms are generic enough to be applicable to a wide range of applications, they weren’t a great fit when you need to have major customizations that are critical to your application.

For example, initially we were computing all end to end network behaviors and were indexing and storing them in Elasticsearch. But later it became clear that it is computationally infeasible to pre-compute all such behaviors to be able to store them in Elasticsearch and even if it was possible, such an index would be enormous in size. We had to switch to a lazy computation approach where we would pre-compute just enough data that would be needed to perform quick searches and at search time we would do the rest of the computation that was specific to user query. 

Initially we were trying to write plugins or customizations for Elasticsearch to adapt it to such a lazy computation approach but soon it became clear that it just won’t work and we had to create our own homegrown distributed compute and search platform.

 

Moving fast without breaking things needs sophisticated testing

Every month we release one major release of our software. Currently, each of these releases includes about 900 changes (git commits); and this is just going to increase as we hire more engineers. At this rate of change, we have to have a lot of testing in place to make sure we don’t have regressions in our releases. 

Every git commit is required to be verified by Jenkins jobs that run thousands of unit and integration tests to ensure there are no regressions. Any test failure would prevent the change from getting merged. In addition to these tests, we also use Error Prone to detect common bugs and Checkstyle to enforce a consistent coding style.

We also have many periodic tests that every few hours run more expensive tests against latest merged changes. These tests typically take a few hours to complete and hence it is not feasible to run them on individual changes. Instead when they detect issues, we use git bisect to identify the root causes. Not only these periodic tests check for correctness, they also ensure there are no performance regressions. These tests upload their performance results to SignalFx and we receive alerts on Slack and email if there are significant changes.

 

Are we done?

While we believe we have already built a product that is a significant step forward on how networks are managed and operated, our journey is 1% complete. Our vision is to become the essential platform for the whole network experience and we have just started in that direction. If this is something that interests you please join us. We are hiring for key  positions across several departments. Note that having prior networking experience  is not a requirement for most of our software engineering positions.

 

If you operate a large-scale complex network, please request a demo to see how our software can de-risk your network operations and return massive business value.