Ticket #539 (closed defect: fixed)

Opened 4 years ago

Last modified 3 weeks ago

munin-update and cron can cause gaps in graphs on large infrastructures

Reported by: holoway Assigned to: snide
Priority: high Milestone: Munin 2.0
Component: master Version:
Severity: critical Keywords:
Cc:

Description

As munin is currently architected, if munin-cron does not finish in < 5 minutes, you will have gaps in your graphs (universally) for the next 5 minute period. The first workaround to this problem is breaking munin-cron into two steps; one for munin-node and munin-limits, the other for munin-graph and munin-html. This solves the issue in the 90% case, as making the graphs takes far more time (normally) than munin-update does.

However, in a sufficiently large infrastructure, you will probably have some nodes in distress. In this case, a single node may take more than 5 minutes to respond in munin-update... and still blocking out the next 5 minute period in the graphs.

To solve this problem, I propose a few changes to munin:

1. We should make the poller (munin-cron) no longer run from cron. This will solve the issue of Cron outsmarting the poller. In it's place, we should have a well timed daemon that handles munin's polling and update tasks.

2. We should make munin-update have a per-node lock, instead of a single global lock. This will enable munin-update to run multiple times in parallel, with only slow-responding nodes being afflicted with gaps in the graphs (as opposed to the entire infrastructure.)

Thoughts?

Change History

11/12/07 16:17:46 changed by janl

  • owner changed from nobody to janl.
  • status changed from new to assigned.

We have planned architectural enhancements that will fix this.

Nicolai

(follow-up: ↓ 3 ) 15/01/08 10:42:47 changed by janl

  • status changed from assigned to new.

There is a fork of munin addressing this (Moonin), it will be integrated with munin.

(in reply to: ↑ 2 ; follow-up: ↓ 4 ) 21/07/08 19:06:28 changed by do

Replying to janl:

There is a fork of munin addressing this (Moonin), it will be integrated with munin.

I am very interested in moonin. Is there some kind of documentation available for this? Browsing through the svn I did not find anythin :(

thanks.

(in reply to: ↑ 3 ) 29/10/08 22:20:32 changed by btm

Replying to do:

I am very interested in moonin. Is there some kind of documentation available for this? Browsing through the svn I did not find anythin :(

http://github.com/adamhjk/moonin/tree/master

27/02/09 10:43:38 changed by janl

  • owner changed from janl to kjellm.

09/10/09 16:11:30 changed by r4v1

When munin will have a fix for this issue?

As moonin does not provide a documentation respectively i do not find any i would appreciate it, if there comes a solution directly from the munin developers. We do not want to use a third party solution which could probably causes incompatibility with newer munin versions, so the old graphs can not be migrated.

As our local munin installation checks up a hugh amount of servers within their plugins the gaps are going bigger and bigger as more server we are adding new hosts.

Is there a possibility that for example only the munin-graph processes could be splitted in several (parallelized) processes, which can use all cores of the system? We have already splitted the munin-cron in two parts, but the gaps still appears.

21/10/09 00:47:05 changed by janl

  • owner changed from kjellm to janl.
  • version deleted.
  • milestone set to Munin 1.5.

Munin 1.4 will still have this mis-feature. There are two things adressing it though: Better SNMP scalability, and (probably) paralell munin-graph and (probably) more granular locks for munin-update so only slow nodes will be locked out from updating the rrd files.

The more fundamental fixes will come later (spooling nodes and super quick munin-update runs).

05/12/09 00:02:46 changed by janl

  • status changed from new to assigned.
  • milestone changed from Munin 1.5 to Munin 1.4.2.

Note: We'll never be able to coverge Moonin and Munin it seems. It's less work to let munin go its own way, and one thing we do not have enough of is developer time.

We can ensure that munin-update complates in 4.5 minutes by adding a "global" timeout.

14/12/09 10:19:35 changed by janl

  • priority changed from normal to high.

15/12/09 00:21:08 changed by janl

Fix munin_update plugin to autoconf'igure better. Install warning and critical limits in it to warn of slow nodes.

30/12/09 14:40:15 changed by janl

Update to check if trac sends email.

30/12/09 14:47:29 changed by janl

and again

08/03/10 13:16:15 changed by janl

  • milestone changed from Munin 1.4.4 to Munin 1.4.5.

17/01/12 11:21:47 changed by snide

  • owner changed from janl to snide.
  • status changed from assigned to new.
  • milestone changed from Munin 1.4.7 to Munin 2.0.

trunk offers a new timeout code.

17/01/12 11:21:51 changed by snide

  • status changed from new to closed.
  • resolution set to fixed.