Skip to content

Commit 74cb196

Browse files
authored
Merge pull request #318 from netdata/learn-content-to-blogs
Creating Blogs for Content on learn which don't fit there
2 parents ebfe5c6 + 4612c63 commit 74cb196

File tree

6 files changed

+643
-0
lines changed

6 files changed

+643
-0
lines changed
Loading
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
---
2+
slug: netdata-processes-monitoring-comparison-with-console-tools
3+
title: "Netdata Processes monitoring and its comparison with other console based tools"
4+
authors: satya
5+
tags: [processes, top, htop, atop, glances, application-monitoring, apm]
6+
keywords: [processes, top, htop, atop, glances, application-monitoring, apm]
7+
image: ./img/stacked-netdata.png
8+
---
9+
10+
![netdata-prometheus-grafana-stack](./img/stacked-netdata.png)
11+
12+
Netdata reads `/proc/<pid>/stat` for all processes, once per second and extracts `utime` and
13+
`stime` (user and system cpu utilization), much like all the console tools do.
14+
15+
But it also extracts `cutime` and `cstime` that account the user and system time of the exit children of each process.
16+
By keeping a map in memory of the whole process tree, it is capable of assigning the right time to every process, taking
17+
into account all its exited children.
18+
19+
It is tricky, since a process may be running for 1 hour and once it exits, its parent should not
20+
receive the whole 1 hour of cpu time in just 1 second - you have to subtract the cpu time that has
21+
been reported for it prior to this iteration.
22+
23+
It is even trickier, because walking through the entire process tree takes some time itself. So,
24+
if you sum the CPU utilization of all processes, you might have more CPU time than the reported
25+
total cpu time of the system. Netdata solves this, by adapting the per process cpu utilization to
26+
the total of the system.
27+
28+
<!--truncate-->
29+
30+
## Comparison with console tools
31+
32+
SSH to a server running Netdata and execute this:
33+
34+
```sh
35+
while true; do ls -l /var/run >/dev/null; done
36+
```
37+
38+
In most systems `/var/run` is a `tmpfs` device, so there is nothing that can stop this command
39+
from consuming entirely one of the CPU cores of the machine.
40+
41+
As we will see below, **none** of the console performance monitoring tools can report that this
42+
command is using 100% CPU. They do report of course that the CPU is busy, but **they fail to
43+
identify the process that consumes so much CPU**.
44+
45+
Here is what common Linux console monitoring tools report:
46+
47+
### top
48+
49+
`top` reports that `bash` is using just 14%.
50+
51+
If you check the total system CPU utilization, it says there is no idle CPU at all, but `top`
52+
fails to provide a breakdown of the CPU consumption in the system. The sum of the CPU utilization
53+
of all processes reported by `top`, is 15.6%.
54+
55+
```
56+
top - 18:46:28 up 3 days, 20:14, 2 users, load average: 0.22, 0.05, 0.02
57+
Tasks: 76 total, 2 running, 74 sleeping, 0 stopped, 0 zombie
58+
%Cpu(s): 32.8 us, 65.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 1.3 hi, 0.3 si, 0.0 st
59+
KiB Mem : 1016576 total, 244112 free, 52012 used, 720452 buff/cache
60+
KiB Swap: 0 total, 0 free, 0 used. 753712 avail Mem
61+
62+
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
63+
12789 root 20 0 14980 4180 3020 S 14.0 0.4 0:02.82 bash
64+
9 root 20 0 0 0 0 S 1.0 0.0 0:22.36 rcuos/0
65+
642 netdata 20 0 132024 20112 2660 S 0.3 2.0 14:26.29 netdata
66+
12522 netdata 20 0 9508 2476 1828 S 0.3 0.2 0:02.26 apps.plugin
67+
1 root 20 0 67196 10216 7500 S 0.0 1.0 0:04.83 systemd
68+
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
69+
```
70+
71+
### htop
72+
73+
Exactly like `top`, `htop` is providing an incomplete breakdown of the system CPU utilization.
74+
75+
```
76+
CPU[||||||||||||||||||||||||100.0%] Tasks: 27, 11 thr; 2 running
77+
Mem[||||||||||||||||||||85.4M/993M] Load average: 1.16 0.88 0.90
78+
Swp[ 0K/0K] Uptime: 3 days, 21:37:03
79+
80+
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
81+
12789 root 20 0 15104 4484 3208 S 14.0 0.4 10:57.15 -bash
82+
7024 netdata 20 0 9544 2480 1744 S 0.7 0.2 0:00.88 /usr/libexec/netd
83+
7009 netdata 20 0 138M 21016 2712 S 0.7 2.1 0:00.89 /usr/sbin/netdata
84+
7012 netdata 20 0 138M 21016 2712 S 0.0 2.1 0:00.31 /usr/sbin/netdata
85+
563 root 20 0 308M 202M 202M S 0.0 20.4 1:00.81 /usr/lib/systemd/
86+
7019 netdata 20 0 138M 21016 2712 S 0.0 2.1 0:00.14 /usr/sbin/netdata
87+
```
88+
89+
### atop
90+
91+
`atop` also fails to break down CPU usage.
92+
93+
```
94+
ATOP - localhost 2016/12/10 20:11:27 ----------- 10s elapsed
95+
PRC | sys 1.13s | user 0.43s | #proc 75 | #zombie 0 | #exit 5383 |
96+
CPU | sys 67% | user 31% | irq 2% | idle 0% | wait 0% |
97+
CPL | avg1 1.34 | avg5 1.05 | avg15 0.96 | csw 51346 | intr 10508 |
98+
MEM | tot 992.8M | free 211.5M | cache 470.0M | buff 87.2M | slab 164.7M |
99+
SWP | tot 0.0M | free 0.0M | | vmcom 207.6M | vmlim 496.4M |
100+
DSK | vda | busy 0% | read 0 | write 4 | avio 1.50 ms |
101+
NET | transport | tcpi 16 | tcpo 15 | udpi 0 | udpo 0 |
102+
NET | network | ipi 16 | ipo 15 | ipfrw 0 | deliv 16 |
103+
NET | eth0 ---- | pcki 16 | pcko 15 | si 1 Kbps | so 4 Kbps |
104+
105+
PID SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPU CMD 1/600
106+
12789 0.98s 0.40s 0K 0K 0K 336K -- - S 14% bash
107+
9 0.08s 0.00s 0K 0K 0K 0K -- - S 1% rcuos/0
108+
7024 0.03s 0.00s 0K 0K 0K 0K -- - S 0% apps.plugin
109+
7009 0.01s 0.01s 0K 0K 0K 4K -- - S 0% netdata
110+
```
111+
112+
### glances
113+
114+
And the same is true for `glances`. The system runs at 100%, but `glances` reports only 17%
115+
per process utilization.
116+
117+
Note also, that being a `python` program, `glances` uses 1.6% CPU while it runs.
118+
119+
```
120+
localhost Uptime: 3 days, 21:42:00
121+
122+
CPU [100.0%] CPU 100.0% MEM 23.7% SWAP 0.0% LOAD 1-core
123+
MEM [ 23.7%] user: 30.9% total: 993M total: 0 1 min: 1.18
124+
SWAP [ 0.0%] system: 67.8% used: 236M used: 0 5 min: 1.08
125+
idle: 0.0% free: 757M free: 0 15 min: 1.00
126+
127+
NETWORK Rx/s Tx/s TASKS 75 (90 thr), 1 run, 74 slp, 0 oth
128+
eth0 168b 2Kb
129+
eth1 0b 0b CPU% MEM% PID USER NI S Command
130+
lo 0b 0b 13.5 0.4 12789 root 0 S -bash
131+
1.6 2.2 7025 root 0 R /usr/bin/python /u
132+
DISK I/O R/s W/s 1.0 0.0 9 root 0 S rcuos/0
133+
vda1 0 4K 0.3 0.2 7024 netdata 0 S /usr/libexec/netda
134+
0.3 0.0 7 root 0 S rcu_sched
135+
FILE SYS Used Total 0.3 2.1 7009 netdata 0 S /usr/sbin/netdata
136+
/ (vda1) 1.56G 29.5G 0.0 0.0 17 root 0 S oom_reaper
137+
```
138+
139+
### why does this happen?
140+
141+
All the console tools report usage based on the processes found running *at the moment they
142+
examine the process tree*. So, they see just one `ls` command, which is actually very quick
143+
with minor CPU utilization. But the shell, is spawning hundreds of them, one after another
144+
(much like shell scripts do).
145+
146+
### What does Netdata report?
147+
148+
The total CPU utilization of the system:
149+
150+
![image](https://cloud.githubusercontent.com/assets/2662304/21076212/9198e5a6-bf2e-11e6-9bc0-6bdea25befb2.png)
151+
<br/>***Figure 1**: The system overview section at Netdata, just a few seconds after the command was run*
152+
153+
And at the applications `apps.plugin` breaks down CPU usage per application:
154+
155+
![image](https://cloud.githubusercontent.com/assets/2662304/21076220/c9687848-bf2e-11e6-8d81-348592c5aca2.png)
156+
<br/>***Figure 2**: The Applications section at Netdata, just a few seconds after the command was run*
157+
158+
So, the `ssh` session is using 95% CPU time.
159+
160+
Why `ssh`?
161+
162+
`apps.plugin` groups all processes based on its configuration file.
163+
The default configuration has nothing for `bash`, but it has for `sshd`, so Netdata accumulates
164+
all ssh sessions to a dimension on the charts, called `ssh`. This includes all the processes in
165+
the process tree of `sshd`, **including the exited children**.
166+
167+
> Distributions based on `systemd`, provide another way to get cpu utilization per user session
168+
> or service running: control groups, or cgroups, commonly used as part of containers
169+
> `apps.plugin` does not use these mechanisms. The process grouping made by `apps.plugin` works
170+
> on any Linux, `systemd` based or not.
363 KB
Loading

0 commit comments

Comments
 (0)