Ticket #877 (closed defect: fixed)

Opened 2 years ago

Last modified 3 months ago

if_ plugin reports anormal values (Petabits !)

Reported by: dooblem Assigned to: nobody
Priority: normal Milestone: Munin 2.0
Component: plugins Version: 1.4.3
Severity: normal Keywords:
Cc:

Description

Hello all,

On some servers, we have problems with the if_ plugin reporting huge anormal values (see the attached graph).

I've seen : http://munin.projects.linpro.no/wiki/HowToWritePlugins#DERIVEvs.COUNTER

It seems that the problem is the COUNTER rrd type used by the plugin. Patching the plugin replacing the COUNTER type by the DERIVE type solved the problem.

On the attached graph, the peaks occur the same time we reboot our server, on a weekly basis. When we reboot the server, /proc/net/dev counters are reset. I think this is the cause of the problem.

Why not patch the if_ plugin to use the DERIVE type ?

Thanks in advance, and for all your work on this great project ! Marc

Attachments

if_bond0-month.png (15.9 kB) - added by dooblem on 03/04/10 15:47:45.

Change History

03/04/10 15:47:45 changed by dooblem

  • attachment if_bond0-month.png added.

04/14/11 09:36:33 changed by bldewolf

There's two directions to go with this:

1) Change COUNTER to DERIVE, add min value

  • pro: Discards bad data caused by rollover detection during reboots for both 32/64-bit systems
  • con: 32-bit systems at <Gbit link speeds benefit from rollover detection

2) Add max value for when it isn't successfully detected

  • pro: Discards bad data from mistaken rollovers, while allowing real rollovers to still occur
  • con: a 64-bit counter on a reboot can be mistaken for a 32-bit rollover, messing with 64-bit graphs
  • con: have to use a magic number (for my testing I used 1 Pb)
  • con: detection doesn't exist in most of the platform-specific plugins, so they'd always use magic numbers

Also, neither of these routes cause loss of information (both trunk and 1.4-stable can apply these settings while preserving existing data).

I originally was leaning more towards 2, but then I ran into the second con which definitely makes it worse. I feel bad leaving the 32-bit systems out in the cold, but if they're using gig it's likely they won't get reliable information from if_ anyway (if you push gig full tilt, you can hit a counter reset in 32 seconds).

So the best course of action seems to be changing if_ from COUNTER to DERIVE. Is anything wrong with my reasoning? I'll wait a few days before pushing the change.

04/14/11 11:52:04 changed by dooblem

Hello,

Thanks for your response.

We are using Munin on hundred of servers. All of them have the if_ plugin patched with the DERIVE type, and we seen no problem.

We patched it one year ago.

Marc

04/15/11 06:42:37 changed by bldewolf

  • status changed from new to closed.
  • resolution set to fixed.

Thanks for your feedback. Sorry for the delay on the bug, but it's good to know a change like this has seen plenty of time in production.

I also got positive feedback from Nicolai on this, so I went ahead and fixed it in 1.4-stable and trunk in r4163.

04/15/11 09:55:37 changed by dooblem

thanks a lot bldewolf. Marc