The art of debugging network protocol problems

In network protocol analysis, many have learnt the hard way not to trust the various logs and other debugging aids that programmers put in their code, but to look directly at the real data that are exchanged down the wire. Unfortunately, that requires a lot more time and involvement.

A network protocol debugging cycle is pretty much always the same: Reproduce the problem, collect network packets on the wire, analyze them, fix the problem. Let’s look at the issues surrounding each individual step.

Reproducing the problem

Being able to reproduce a problem is the cornerstone of the art of debugging – it being hardware, software or network based. The time spent on solving a problem multiplied by the quantity of available information related to this problem is pretty much a constant, so accumulating as much information as possible about the problem is a sure way to speed up its resolution. Nothing permits to accumulate more data than being able to reproduce a problem on demand, so that should the first step in debugging a problem.

Unfortunately there is already a lot of obstacles there, between the unwillingness of system administrators to run experiments in production, to the difficulty of building a meaningful setup in the confines of a lab.

In worst cases the only solution is simply to capture as much data as possible and to hope that the problem will manifest itself when data is being collect – and that the data will contain enough information to solve the problem. This is rarely the case, and at best one can hope to get enough clues to decide on what kind of data needs to be collected when the problem happens the next time.

Constantly collecting packets just to be sure that the correct data will be collected when a problem happens also creates security and privacy issues.

Collecting data

Collecting the data is always fraught with issues. We already hinted that logs should never, ever be trusted to contain a faithful view of what is happening in the network because the data contained in the logs were selected by the programmer of the code, and programmers have their own bias on how the network protocol is supposed to work. The best solution is to use a program like tcpdump or dumpcap (the capture component of Wireshark) to capture the real packets directly on the network, but then the question is where to run this program.

If the system that generates or consumes the packets is accessible, then running the packet collection program there seems to be a good choice, until one realizes that it only indicates that the packets were sent to the network card, not that they were really sent down the wire.

Using a network tap (which can be an Ethernet hub, a mirroring port on a switch or a real network tap) is a better option, as we now have the certainty that the packets collected really crossed the network tap. One issue is that some types of network tap need to be inserted into the network and so require that it is cut momentarily, which can be a no-no in production. In some cases a network tap can slow down the link it is installed in, so that’s another concern when using it in production.

On top of that, collecting packets from any network element requires resources to do so: bandwidth, CPU and storage. Bandwidth and CPU act as bottleneck, which may result in some packets not being captured. Even if capturing everything were possible, data storage can still be a problem.

The common solutions to these problems are to reduce the number of packets captured and to reduce the size of the packets stored. The question now is: what are the right criteria for doing this? If not enough data are captured, there is a high risk that the relevant information will not be there, mandating running more collections with different filters. Murphy’s law being what it is, the missing data are probably the ones needed to solve the problem. On the other hand, if too much data are collected then, in addition to the risk of not capturing everything, the amount of data may be too large to be processed effectively.

The problem is even worse when the people responsible for capturing the data are not the people analyzing them, as they may have different ideas on what the problem is. They would certainly filter the data differently according to their bias, requiring more debugging cycles than necessary.

Analyzing the data

Analyzing the packets collected generally requires a graphical tool like Wireshark (born Ethereal). A file containing the collected packets is loaded into the program, which displays them in a form that is close to how they are described in the various standard documents that define the network protocols. Then by filtering packets and by applying some of the analysis tools available, one can narrow down the reasons for the problem.

The first issue is the quantity of data to process. The less filtering was applied during the capture, the bigger the file will probably be, and the longer it will take to find the few packets that are related to the problem (think of finding a specific needle in a stack of needles). Wireshark is not capable of running a filter or analysis tool using the multiple cores or processors that may be available, so buying a more expensive computer is not even an option there. In some cases, a file has to be split into manageable chunks just to load it, which complicates the task of finding the problem.

Moreover, the query language used in Wireshark is limited to Boolean and comparison operators on fields in individual packets. There is no provision to select packets using a higher level of understanding of how the network protocol works. For instance, one cannot create a query that would select all the packets that compose a transaction with retransmission, or all the packets that deviate from the protocol standard, etc. But even without entering into that level of sophistication, if a Wireshark developer has not written a dissector for the protocol being debugged (which may be because it is a proprietary protocol or because few people use it), then Wireshark will not be of much help.

There are protocol analysis tools that can be used to do these sort of things, but they are limited to what the Wireshark developers thought was useful. This is not specific to Wireshark though, as this is a common shortcoming of graphical user interfaces: in exchange for making the program quickly and easily usable by a large base of users, its usefulness is restricted to the top use cases without providing options for problems that do not fit these use cases.

Thankfully, Wireshark is free software, so there is an option to break out the C compiler and extend Wireshark itself. Doing so is probably not a problem for someone who has already exhausted the built-in possibilities of Wireshark, but it certainly adds a lot of time and effort to a debugging session.

One additional irritating issue with Wireshark is that the language used to select the packets to display is different from the language used to select what packets to capture.

Fixing the problem

Now that the problem is understood, it is time to fix it. In a perfect world, everything would be tested in a lab, where the source codes of all the network elements involved in the problem are available for modification. Considering the huge task required to track down the correct source codes and the tools required to build and install them in the network elements, it is clear we are not living in a perfect world.

In the real world, we rarely have a fully equipped lab to reproduce problems and we rarely have access to the source code of the network elements. So, unless the problem was a bug in a piece of software under the direct control of the person debugging it, you have to wait for the vendor to recognize and fix the issue, which can take years. Things being what they are, short-term solutions in the form of workarounds are generally applied, solutions that generally do not increase the stability or maintainability of the whole system.

Nephelion, a better network protocol debugging tool

Nephelion is the codename of a project whose goal is to solve some of the problems listed in this article using a combination of hardware and software.

Follow this blog for more updates during the countdown to the release of our first product, sometime at the end of 2015.

2 thoughts on “The art of debugging network protocol problems”

  1. That’s a really interesting problem to solve. In two former companies I’ve been used I used to run tcpdump in combination with Unix scripts to scan through the logs.

    Storage was the biggest issue, especially because I sometimes had to analyse audio issues from VoIP packets content. I had setup a system to store a few hours of the traffic at the time meaning that to analyse the captures a problem had to be reported during that small capture window. Granted these were not protocol issues, which is what you’re trying to address.

    Have you thought about applying Big Data techniques to detect network protocol issues in a very large production traffic ?

    Good luck with your product.

  2. Nephelion does not differentiate between pure protocol issues and, say, audio call quality issues. Both are problems to be solved and a tool aimed at solving problems should help wherever the analysis of the problem bring us. It will be possible to do both – verifying protocol conformance, recalculating MOS scores, and more than this. A future post will explain a little bit about that.

    As for the Big Data techniques, it is interesting to note that this project started few years back as a Big Data project. I was at the time very frustrated by how slow Wireshark was at processing huge capture files. Using MapReduce to process a capture file is possible, although there is two obstacles to this. The first one is that the capture file format is definitively not adapted to parallel processing. Even pcapng makes it unnecessarily difficult to process packets in parallel, and I gave up for now trying to fix it. But that’s secondary to the fact that processing a packet capture is far to be easily parallelizable. Network packets are not really independent to each other as very few protocols are stateless. That means that packets needs to be selected, reordered and grouped together to rebuilt some kind of context that is then used to make sense of more captured packets (think of all the DNS/STUN/TURN/SIP packets that needs to be processed together to get a chance to find the RTP/RTCP packets that belong to a SIP call).

    Since that time Nephelion evolved to be a realtime debugging tool so processing huge capture files is no longer its main purpose. But it kept from its Big Data origin the ability of using all the CPU cores in parallel in its processing pipeline.

Comments are closed.