Nephelion

A previous article covered the challenges that can be faced when debugging network protocol problems.
In this article we reveal some of the solutions that we are developing to minimize most of the work that is currently required before getting to the most interesting part, the investigation and correction of the problem.

Watch

One frequently overlooked issue in network protocol debugging is the quality of the data. It is not rare to discover after a few hours of work that the data provided or captured are in fact irrelevant to the problem at hand. That can be because the person or mechanism that captured them has its own assumptions about the problem, or made a mistake during the capture, or misunderstood the problem. All these issues can be reduced if the person doing the analysis could capture the data with as few interposing layers as possible by eliminating unnecessary human, software or hardware intermediaries.

To achieve this goal, we recommend our own device, called an E-mim (patent pending). This is an enhanced network tap that is light enough to be installed hanging between two Ethernet cables. It is powered by the USB 3.0 cable connecting it to the computer that manages the data, but it still acts as if the two Ethernet cables are directly connected when not powered. This allows it to be left in place between debugging sessions, even with devices requiring Power over Ethernet.

This device already exists as an alpha release prototype, and we plan to have at least one more alpha prototype and one beta prototype before starting a Kickstarter campaign for the final product, sometime in the last quarter of 2015.

Catch

A normal network protocol debugging cycle consists of capturing and looking at a set of packets that contains clues to the problem to diagnose. This is an iterative process, which starts by looking at an overview of all the packets exchanged and then, by filtering out the packets that are not relevant and looking at the details of the packets that are, ends by pinpointing the solution to the problem.

The E-mim device is connected through its USB port to a computer running Windows, Mac OS or Linux. The person investigating the problem use the Nephelion software installed on this computer to interact with the E-mim device. Although the most common tasks can be executed with simple commands, Nephelion is, at its core, a powerful developer tool, which can be used with any language that runs on a Java Virtual Machine.

With the increasing adoption of encryption in protocol networks, traditional tools will slowly lose their usefulness as they won’t be able to look at the packets flowing through the network. In contrast, Nephelion is capable of receiving encryption keys from devices that are under the control of its user, so problems can be debugged without weakening the security of the whole network.

Patch

The previous step provides at best a sense of what went wrong, but it can never result in an absolute certainty. The problem can be marked as resolved only after the software that created the issue is fixed and tested. But software takes time to be fixed, which may mean months or even years if the source code is not directly available, as is the case for most network devices.

The E-mim device is, in fact, more than a simple network tap in that it can also modify network packets on the fly. This is how the diagnosis of a network problem can be immediately put to the test by patching the faulty packets directly as they travel through the network.

By providing a simple way to watch, catch and patch protocol network problems directly where and when they happen, Nephelion and the E-mim device provide a unique solution to the increasing complexity of network protocols.

In network protocol analysis, many have learnt the hard way not to trust the various logs and other debugging aids that programmers put in their code, but to look directly at the real data that are exchanged down the wire. Unfortunately, that requires a lot more time and involvement.

A network protocol debugging cycle is pretty much always the same: Reproduce the problem, collect network packets on the wire, analyze them, fix the problem. Let’s look at the issues surrounding each individual step.

Reproducing the problem

Being able to reproduce a problem is the cornerstone of the art of debugging – it being hardware, software or network based. The time spent on solving a problem multiplied by the quantity of available information related to this problem is pretty much a constant, so accumulating as much information as possible about the problem is a sure way to speed up its resolution. Nothing permits to accumulate more data than being able to reproduce a problem on demand, so that should the first step in debugging a problem.

Unfortunately there is already a lot of obstacles there, between the unwillingness of system administrators to run experiments in production, to the difficulty of building a meaningful setup in the confines of a lab.

In worst cases the only solution is simply to capture as much data as possible and to hope that the problem will manifest itself when data is being collect – and that the data will contain enough information to solve the problem. This is rarely the case, and at best one can hope to get enough clues to decide on what kind of data needs to be collected when the problem happens the next time.

Constantly collecting packets just to be sure that the correct data will be collected when a problem happens also creates security and privacy issues.

Collecting data

Collecting the data is always fraught with issues. We already hinted that logs should never, ever be trusted to contain a faithful view of what is happening in the network because the data contained in the logs were selected by the programmer of the code, and programmers have their own bias on how the network protocol is supposed to work. The best solution is to use a program like tcpdump or dumpcap (the capture component of Wireshark) to capture the real packets directly on the network, but then the question is where to run this program.

If the system that generates or consumes the packets is accessible, then running the packet collection program there seems to be a good choice, until one realizes that it only indicates that the packets were sent to the network card, not that they were really sent down the wire.

Using a network tap (which can be an Ethernet hub, a mirroring port on a switch or a real network tap) is a better option, as we now have the certainty that the packets collected really crossed the network tap. One issue is that some types of network tap need to be inserted into the network and so require that it is cut momentarily, which can be a no-no in production. In some cases a network tap can slow down the link it is installed in, so that’s another concern when using it in production.

On top of that, collecting packets from any network element requires resources to do so: bandwidth, CPU and storage. Bandwidth and CPU act as bottleneck, which may result in some packets not being captured. Even if capturing everything were possible, data storage can still be a problem.

The common solutions to these problems are to reduce the number of packets captured and to reduce the size of the packets stored. The question now is: what are the right criteria for doing this? If not enough data are captured, there is a high risk that the relevant information will not be there, mandating running more collections with different filters. Murphy’s law being what it is, the missing data are probably the ones needed to solve the problem. On the other hand, if too much data are collected then, in addition to the risk of not capturing everything, the amount of data may be too large to be processed effectively.

The problem is even worse when the people responsible for capturing the data are not the people analyzing them, as they may have different ideas on what the problem is. They would certainly filter the data differently according to their bias, requiring more debugging cycles than necessary.

Analyzing the data

Analyzing the packets collected generally requires a graphical tool like Wireshark (born Ethereal). A file containing the collected packets is loaded into the program, which displays them in a form that is close to how they are described in the various standard documents that define the network protocols. Then by filtering packets and by applying some of the analysis tools available, one can narrow down the reasons for the problem.

The first issue is the quantity of data to process. The less filtering was applied during the capture, the bigger the file will probably be, and the longer it will take to find the few packets that are related to the problem (think of finding a specific needle in a stack of needles). Wireshark is not capable of running a filter or analysis tool using the multiple cores or processors that may be available, so buying a more expensive computer is not even an option there. In some cases, a file has to be split into manageable chunks just to load it, which complicates the task of finding the problem.

Moreover, the query language used in Wireshark is limited to Boolean and comparison operators on fields in individual packets. There is no provision to select packets using a higher level of understanding of how the network protocol works. For instance, one cannot create a query that would select all the packets that compose a transaction with retransmission, or all the packets that deviate from the protocol standard, etc. But even without entering into that level of sophistication, if a Wireshark developer has not written a dissector for the protocol being debugged (which may be because it is a proprietary protocol or because few people use it), then Wireshark will not be of much help.

There are protocol analysis tools that can be used to do these sort of things, but they are limited to what the Wireshark developers thought was useful. This is not specific to Wireshark though, as this is a common shortcoming of graphical user interfaces: in exchange for making the program quickly and easily usable by a large base of users, its usefulness is restricted to the top use cases without providing options for problems that do not fit these use cases.

Thankfully, Wireshark is free software, so there is an option to break out the C compiler and extend Wireshark itself. Doing so is probably not a problem for someone who has already exhausted the built-in possibilities of Wireshark, but it certainly adds a lot of time and effort to a debugging session.

One additional irritating issue with Wireshark is that the language used to select the packets to display is different from the language used to select what packets to capture.

Fixing the problem

Now that the problem is understood, it is time to fix it. In a perfect world, everything would be tested in a lab, where the source codes of all the network elements involved in the problem are available for modification. Considering the huge task required to track down the correct source codes and the tools required to build and install them in the network elements, it is clear we are not living in a perfect world.

In the real world, we rarely have a fully equipped lab to reproduce problems and we rarely have access to the source code of the network elements. So, unless the problem was a bug in a piece of software under the direct control of the person debugging it, you have to wait for the vendor to recognize and fix the issue, which can take years. Things being what they are, short-term solutions in the form of workarounds are generally applied, solutions that generally do not increase the stability or maintainability of the whole system.

Nephelion, a better network protocol debugging tool

Nephelion is the codename of a project whose goal is to solve some of the problems listed in this article using a combination of hardware and software.

Follow this blog for more updates during the countdown to the release of our first product, sometime at the end of 2015.

Category: Uncategorized

Watch. Catch. Patch