Ticket #843 (closed patch: fixed)

Opened 2 years ago

Last modified 4 months ago

plugins fw_conntrack and fw_forwarded_local cause severe network lags on big firewalls

Reported by: feiner.tom Assigned to: kenyon
Priority: normal Milestone: Munin 1.4.4
Component: plugins Version: 1.2.5
Severity: normal Keywords:
Cc: Sven Hartge <sven@svenhartge.de>

Description (Last modified by feiner.tom)

Forwarded from: http://bugs.debian.org/565565

Hi.

Some days ago I noticed a very severe problem with the fw_conntrack and fw_forwarded_local plugins on one of my firewalls.

When the system exceeded about 20,000 conntrack entries, both plugins would interrupt all data flow through this system for about 5 to 10 seconds, long enough for a failover mechanism kicking into action.

I can manually reproduce this by simply using "cat /proc/net/ip_conntrack" or "cat /proc/net/nf_conntrack".

Now look at the runtimes in comparison with the usage of "conntrack -L":

root@fw01-1:~# time cat /proc/net/nf_conntrack | wc -l
5657

real    0m0.608s
user    0m0.010s
sys     0m0.600s
root@fw01-1:~# time cat /proc/net/ip_conntrack | wc -l
5703

real    0m0.580s
user    0m0.000s
sys     0m0.580s
root@fw01-1:~# time conntrack -L |wc -l
5481

real    0m0.071s
user    0m0.050s
sys     0m0.020s

Even an unloaded system takes more than half a second while the conntrack command takes only 10% of the time.

With more and more connections in the conntrack table the times scale exponentially when using the files in /proc, while "conntrack -L" nearly stays the same.

The disturbing problem is the total halt of all network operations during the cat from /proc, while conntrack -L does not interrupt anything.

While the "cat /proc/net/ip_conntrack" does no harm to small systems, bigger and loaded systems will be severly impacted by this problem.

For the fw_conntrack and fw_forwarded_local plugins found in 1.2.5 (pre-1.4) you can simply replace the "cat /proc/net/ip_conntrack" command with a "conntrack -L", because the formats of both are identical.

With /proc/net/nf_conntrack this is not yes possible.

Grüße, Sven.

Attachments

fw_conntrack (3.5 kB) - added by alext on 09/23/11 17:35:08.
fw_forwarded_local (1.9 kB) - added by alext on 09/23/11 17:42:06.

Change History

01/17/10 07:55:10 changed by feiner.tom

  • description changed.

01/17/10 08:06:35 changed by feiner.tom

Hi,

Just a quick note: conntrack binary is not installed by default in my debian & ubuntu systems, so using it instead of cat /proc/net/ip_conntrack will break things for many users.

Tom

01/18/10 13:16:13 changed by janl

But those users are probably not in need of a fw_ plugin?

01/18/10 13:16:20 changed by janl

  • milestone changed from Munin 1.5 to Munin 1.4.4.

01/18/10 13:55:15 changed by SvenHartge

The best way might be to check for the existence of a conntrack binary and use it, if available and falling back to the old method if none is found, maybe logging a warning into the node-log about the problems arising if directly using /proc/net/ip_conntrack.

As for the Debian packages: a Recommend: conntrack or Suggest: conntrack would be in order, I think.

09/23/11 17:35:08 changed by alext

  • attachment fw_conntrack added.

09/23/11 17:36:52 changed by alext

I've taken the liberty to re-write these 2 plugins in perl.

They now try to use conntrack, /proc/net/nf_conntrack and /proc/net/ip_conntrack in order.

I've also fixed fw_conntrack to count natted connections correctly (#725).

09/23/11 17:42:06 changed by alext

  • attachment fw_forwarded_local added.

09/23/11 20:01:24 changed by janl

  • type changed from defect to patch.

We should remember to check if the new plugin generates the same time series in the same way (counter vs. gauge) and if it does it can have the same name and replace the old "in place".

09/23/11 20:31:16 changed by alext

That was my intention when writing them. The config output should be identical to the original ones. The data output should also be the same.

If there are differences, let me know and I'll fix them. I'm currently using these as a drop-in replacement in my network, and the graphs haven't changed at all, they continue as before (except the NATed line is now correct).

01/30/12 00:43:04 changed by kenyon

  • owner changed from nobody to kenyon.
  • status changed from new to assigned.

01/30/12 01:49:50 changed by kenyon

  • status changed from assigned to closed.
  • resolution set to fixed.

Fixed in r4619, thanks!