Ticket #569 (closed defect: fixed)

Opened 4 years ago

Last modified 3 months ago

munin-update gets awfully confused when a plugin goes awol

Reported by: tore Assigned to: janl
Priority: high Milestone: Munin 1.4.3
Component: master Version: 1.2.3
Severity: major Keywords:
Cc:

Description (Last modified by janl)

I'm running Munin 1.2.3 with the -r911:912 patch (for huge RRD files) applied. Maybe this has been fixed since then, but since it's a quite severe error (data loss) I'm submitting a ticket just to make sure. It's been working fine so far, until I deleted a plugin (/etc/munin/plugins/vlan_bandwidth_apressen-underskog) from the node, and forgot to restart it so it would stop listing it as an available plugin. After that the server process started gathering completely bogus data for a number of the plugins on that host (curiously enough it didn't apply to all of them). It logged things like this:

Feb 28 09:00:18 [32629] - Unable to update hmg9.no.linpro.net -> vr0.hmg9.no.linpro.net -> vlan_bandwidth_braastadklynga -> out_transit: No such field (no "label" field defined when running plugin with "config").
Feb 28 09:00:18 [32629] - Unable to update hmg9.no.linpro.net -> vr0.hmg9.no.linpro.net -> vlan_bandwidth_braastadklynga -> out_telenor: No such field (no "label" field defined when running plugin with "config").
Feb 28 09:00:18 [32629] - Unable to update hmg9.no.linpro.net -> vr0.hmg9.no.linpro.net -> vlan_bandwidth_braastadklynga -> out_nix: No such field (no "label" field defined when running plugin with "config").

The vlan_bandwidth_* plugins never had fields named out_{transit,telenor,nix} - so it doesn't really make any sense. The plugin that was removed did have those fields though. Compare:

config vlan_bandwidth_braastadklynga
host_name vr0.hmg9.no.linpro.net
graph_order in out
graph_category Bandwidth
graph_args --base 1000
graph_vlabel bps in (-) / out (+)
graph_title Braastadklynga dom0 og ilo (702)
in.label bps
in.cdef in,8,*
in.skipdraw 1
in.type DERIVE
in.min 0
out.label bps
out.cdef out,8,*
out.type DERIVE
out.min 0
out.negative in
.
fetch vlan_bandwidth_braastadklynga
in.value 568914
out.value 748979
.
config vlan_bandwidth_splitout_apressen
host_name vr0.hmg9.no.linpro.net
graph_category Bandwidth
graph_args --base 1000 -l 0
graph_vlabel bps out
graph_total Total
graph_title A-Pressen Interaktiv AS (509) outgoing (split)
out_transit.label transit (except telenor)
out_transit.cdef out_transit,8,*
out_transit.type DERIVE
out_transit.min 0
out_transit.draw AREA
out_telenor.label transit (telenor)
out_telenor.cdef out_telenor,8,*
out_telenor.type DERIVE
out_telenor.min 0
out_telenor.draw STACK
out_nix.label nix1
out_nix.cdef out_nix,8,*
out_nix.type DERIVE
out_nix.min 0
out_nix.draw STACK
.
fetch vlan_bandwidth_splitout_apressen
out_transit.value 513406364719
out_telenor.value 3906463
out_nix.value 89245664210
.

Those are wild card plugins (the part after the last underscore changes, the fields and other config does not). Another weird thing is that it also created RRD files for the "out" field for some vlan_bandwidth_splitout_ plugins (which doesn't have any field called "out").

I've got all the logs, and a copy of the RRD files and the generated HTML at stat:~tore/munin-bug/. Not attaching it here due to its semi-confidential nature as well the size (around 600M in total).

I know that the values gathered are completely bogus (they're not based on the correct values somehow). For instance the vlan_bandwidth_nix1 plugin has always had 0 for both the in and out fields - it's actually impossible for it to have anything else due to our network layout (so it's indeed quite useless). However in the period the error lasted the graph did display activity.

Tore

Change History

02/27/09 10:48:35 changed by janl

  • owner changed from nobody to kjellm.

Kjell Magne: This is "FYI" priority.

10/21/09 00:33:07 changed by janl

  • version deleted.
  • description changed.
  • milestone set to Munin 1.4.

10/28/09 16:09:08 changed by janl

  • owner changed from kjellm to janl.
  • priority changed from normal to high.
  • status changed from new to assigned.

11/06/09 13:34:12 changed by janl

  • owner changed from janl to holger.
  • status changed from assigned to new.
  • version set to 1.2.3.
  • milestone deleted.

I can't reproduce this bug in 1.4-alpha. I'm leaving the bug and assigning it to a debian person.

(in reply to: ↑ description ) 11/19/09 22:56:19 changed by rbanffy

It logged things like this:

Something very similar is happening here, but removing the offending plugin did not solve the problem. The files resulting are like:

svti00.ig.corp-df_inode-assured-g.rrd
svti00.ig.corp-df_inode-established-g.rrd
svti00.ig.corp-df_inode-fin_wait-g.rrd
svti00.ig.corp-df_inode-nated-g.rrd
svti00.ig.corp-df_inode-syn_sent-g.rrd
svti00.ig.corp-df_inode-time_wait-g.rrd
svti00.ig.corp-df_inode-total-g.rrd
svti00.ig.corp-df_inode-udp-g.rrd
svti00.ig.corp-df-tmpfs-g.rrd
svti00.ig.corp-df-udev-g.rrd
svti00.ig.corp-entropy-volt1-g.rrd
svti00.ig.corp-entropy-volt2-g.rrd
svti00.ig.corp-entropy-volt3-g.rrd
svti00.ig.corp-entropy-volt4-g.rrd
svti00.ig.corp-entropy-volt5-g.rrd
svti00.ig.corp-entropy-volt6-g.rrd
svti00.ig.corp-entropy-volt7-g.rrd
svti00.ig.corp-entropy-volt8-g.rrd
svti00.ig.corp-exim_mailqueue-Airflow_Temperature_Cel-g.rrd
svti00.ig.corp-exim_mailqueue-Calibration_Retry_Count-g.rrd
svti00.ig.corp-exim_mailqueue-Current_Pending_Sector-g.rrd
svti00.ig.corp-exim_mailqueue-down-c.rrd
svti00.ig.corp-exim_mailqueue-Multi_Zone_Error_Rate-g.rrd
svti00.ig.corp-exim_mailqueue-Offline_Uncorrectable-g.rrd
svti00.ig.corp-exim_mailqueue-Power_Cycle_Count-g.rrd
svti00.ig.corp-exim_mailqueue-Power_On_Hours-g.rrd
svti00.ig.corp-exim_mailqueue-Raw_Read_Error_Rate-g.rrd
svti00.ig.corp-exim_mailqueue-Reallocated_Event_Count-g.rrd
svti00.ig.corp-exim_mailqueue-Reallocated_Sector_Ct-g.rrd
svti00.ig.corp-exim_mailqueue-Seek_Error_Rate-g.rrd
svti00.ig.corp-exim_mailqueue-smartctl_exit_status-g.rrd
svti00.ig.corp-exim_mailqueue-Spin_Retry_Count-g.rrd
svti00.ig.corp-exim_mailqueue-Spin_Up_Time-g.rrd
svti00.ig.corp-exim_mailqueue-Start_Stop_Count-g.rrd
svti00.ig.corp-exim_mailqueue-Temperature_Celsius-g.rrd
svti00.ig.corp-exim_mailqueue-UDMA_CRC_Error_Count-g.rrd
svti00.ig.corp-exim_mailqueue-up-c.rrd
svti00.ig.corp-forks-max-g.rrd
svti00.ig.corp-forks-used-g.rrd
svti00.ig.corp-fw_forwarded_local-idle-d.rrd
svti00.ig.corp-fw_forwarded_local-iowait-d.rrd
svti00.ig.corp-fw_forwarded_local-irq-d.rrd
svti00.ig.corp-fw_forwarded_local-nice-d.rrd
svti00.ig.corp-fw_forwarded_local-softirq-d.rrd

There are also a couple nonsensical messages in the munin-update.log:

Nov 19 19:47:34 [6782] - Unable to update ig.corp -> svti00.ig.corp -> processes -> volt2: No such field (no "label" field de
fined when running plugin with "config").
Nov 19 19:47:34 [6782] - Unable to update ig.corp -> svti00.ig.corp -> processes -> volt3: No such field (no "label" field de
fined when running plugin with "config").
Nov 19 19:47:34 [6782] - Unable to update ig.corp -> svti00.ig.corp -> processes -> volt4: No such field (no "label" field de
fined when running plugin with "config").
Nov 19 19:47:34 [6782] - Unable to update ig.corp -> svti00.ig.corp -> processes -> volt5: No such field (no "label" field de
fined when running plugin with "config").
Nov 19 19:47:34 [6782] - Unable to update ig.corp -> svti00.ig.corp -> processes -> volt6: No such field (no "label" field de
fined when running plugin with "config").
Nov 19 19:47:34 [6782] - Unable to update ig.corp -> svti00.ig.corp -> processes -> volt7: No such field (no "label" field de
fined when running plugin with "config").
Nov 19 19:47:34 [6782] - Unable to update ig.corp -> svti00.ig.corp -> processes -> volt8: No such field (no "label" field de
fined when running plugin with "config").

Any extra info I can give you? Any way to totally reset the munin-node for this machine only and let it start over? I tried deleting all svti00*.rrd files and, while the data is gone (it's not a big deal with this server) the errors persist.

I am using munin and munin-node straight out of the Debian packages with no patches whatsoever (Debian built-in excepted):

ii  munin                     1.2.6-10~lenny1           network-wide graphing framework (grapher/gatherer)
ii  munin-node                1.2.6-10~lenny1           network-wide graphing framework (node)
ii  munin-plugins-extra       1.2.6-10~lenny1           network-wide graphing framework (user contributed plugins for node

12/04/09 15:41:57 changed by ossi

Hello,

at the moment I am running munin from the svn repository at revision 3172 and I've seem to have got the same problem. It does not correlate with any particular plugin. An example of the effect of this would be:

munin-update.log: 2009/12/04 15:15:48 Missing required attribute 'label' for data source 'getattr' in service vmstat on ...

2009/12/04 15:15:48 Missing required attribute 'label' for data source 'setattr' in service vmstat on ...

2009/12/04 15:15:48 Missing required attribute 'label' for data source 'readdir' in service vmstat on ...

2009/12/04 15:15:48 Missing required attribute 'label' for data source 'rename' in service vmstat on ...

2009/12/04 15:15:48 Missing required attribute 'label' for data source 'rmdir' in service vmstat on ...

2009/12/04 15:15:48 Missing required attribute 'label' for data source 'symlink' in service vmstat on ...

The data sources come from the nfs plugins (I think). Furthermore if that happens (and it happens not every update ... just most of the time) the graph output gets mixed up too.

I am not really sure what logs or further info would be helpful. Please contact me if you need anything.

12/04/09 23:11:42 changed by janl

  • milestone set to Munin 1.4.2.

Ossi: Hi. I can't see your email address. Could you please email me at janl@redpill-linpro.com or the users list? I'm unable to reproduce the problem, and it really needs to be debugged with a debugger.

12/07/09 10:45:14 changed by janl

  • owner changed from holger to janl.

12/14/09 09:54:02 changed by janl

  • status changed from new to assigned.

One down: Better timeout handling in munin-update to avoid munin-node and munin-update getting confused.

03/08/10 13:17:06 changed by janl

  • status changed from assigned to closed.
  • resolution set to fixed.
  • milestone changed from Munin 1.4.4 to Munin 1.4.3.

This seems to be definitly solved in 1.4.3.