nanog mailing list archives

Re: Monitoring system recommendation


From: "Crier, Brent" <Brent.Crier () nsight com>
Date: Tue, 7 Jun 2016 12:32:50 +0000

We use Zabbix here pretty heavily. Monitoring roughly 10,000 hosts 13,000 interfaces and a mirage of services.

-Brent


On Jun 7, 2016, at 2:42 AM, Mikael Falkvidd <mikael.falkvidd () op5 com> wrote:


On Monday, June 6, 2016, Manuel Marín <mmg () transtelco net> wrote:

Dear Nanog community

We are currently planning to upgrade our monitoring system (Opsview) due
to
scalability issues and I was wondering what do you recommend for
monitoring
5000 hosts and 35000 services. We would like to use a monitoring system
that is compatible with the nagios plugin format, however we are not sure
if systems like Icinga/Shinken/Op5 are the way to go.

Is someone using systems like Op5 or Icinga2 for monitoring > 5000 hosts?
Would you recommend commercial systems like Sevone, Zabbix, etc instead
of
open source ones?


We (op5) have customers running > 50,000 hosts and > 300,000 services. So
5,000 hosts is generally not a problem.

As mentioned by Jeff, the forking model *can* become a problem. Small
binaries
that don't load a lot of libraries fork pretty fast. A test we made some
time ago
showed a 15 minute load peak at 3.89 (on 24 cores/hyperthreads) when
checking
100,000 services every 5 minutes. Check latencies were 0.8 seconds max and
0.002 seconds avg. Average cpu load was 15%.

Specs for the machine used:
Dell PowerEdge R620
2x Intel Xeon E5-2620
24 GB ram
Dell PERC H710 hardware RAID card
RAID10 on 4x300GB 15kRPM SAS drives

So a single (now almost vintage) server can handle 300 plugin executions per
second without breaking a sweat. Scaling up is definitely a possibility, but
scaling out (using mod gearman, mk or merlin, all open source) is available
as
well.

Complex plugins, for example check_vmware_api which loads the large VMware
perl SDK can get you in trouble though. I suggest you run a test with the
plugin
mix you are planning to use.

If scaling out is not an option, and you want to stay in the nagios/naemon
world,
a custom worker can be developed to get rid of the loading overhead.
Documentation is available at
http://www.naemon.org/documentation/developer/workers.html

Full disclosure: I work as development team lead at op5

best regards
Mikael Falkvidd

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail


Current thread: