Fluentd性能问题排查

| 标签 Fluentd  浏览次数: -

版本信息

  • OS: Red Hat Enterprise Linux Server release 7.4 (Maipo)
  • kernel: Linux hostname 3.10.0-693.11.1.el7.x86_64 #1 SMP Fri Oct 27 05:39:05 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
  • ruby: ruby 2.0.0p648 (2015-12-16) [x86_64-linux]
  • td-agent: td-agent 1.3.3

现象说明

在配置文件中配置了大量的重复日志采集(比如 /var/log/messages* 重复配置了10份),ruby安装后没有做任何的调优和参数限制,监听的日志文件数量为136,td-agent运行了两天内存使用就达到了5.2G,cpu占有率间歇性持续走高,大概是两分钟一次会飙升到100% 。

  • 配置文件内容

配置文件内容

以下的配置文件的文本内容

# /etc/td-agent/conf.d/messages.conf
<source>
  @type tail
  path /var/log/messages
  exclude_path []
  path_key tailed_path
  pos_file /var/log/td-agent/messages.messages.pos
  tag messages

  format multiline
  multiline_flush_interval 5s
  format_firstline /\w+ \d+ \d+:\d+:\d+/
  format1 /^(?<log_time>\w+ \d+ \d+:\d+:\d+) (?<hostname>\S*) (?<message>.*)/

  time_format %b %d %H:%M:%S
</source>

<filter messages>
  @type record_transformer
  <record>
    hostname xxx
    ip x.x.x.x
    log_type messages

  </record>
</filter>

<filter messages>
  @type dio
  keys log_time
</filter>

<match messages>
  @type copy
  @include messages.conf.d/*.conf
</match>
# /etc/td-agent/conf.d/messages.conf.d/x.x.x.x_10015.conf
<store>
  @type elasticsearch
  host x.x.x.x
  port 10015

  time_key time
  flush_interval 2s
  buffer_queue_limit 4096
  buffer_chunk_limit 1024m
  num_threads 4
  logstash_format true
</store>
  • 配置采集日志数量

配置采集日志数量

  • CPU内存使用情况

CPU内存使用情况

寻找原因

  • 查看日志

当发现这个现象后,第一反应是在查看 /var/log/td-agent/td-agent.log 日志,相关警告信息如下。

2018-01-10 09:12:51 +0800 [info]: #0 detected rotation of /var/log/xxxx_exporter.log.6; waiting 5 seconds
2018-01-10 09:12:51 +0800 [warn]: #0 dump an error event: error_class=TypeError error="no implicit conversion of Fixnum into String" location="/opt/td-agent/embedded/lib/ruby/2.1.0/time.rb:325:in `_parse'" tag="xxxx-test2" time=2018-01-08 13:06:29.829007891 +0800 record={"log_time"=>1515387560, "level"=>"info", "message"=>"output:  PING x.x.x.x (x.x.x.x) 56(84) bytes of data.\\n\\n--- x.x.x.x ping statistics ---\\n1 packets transmitted, 0 received, 100% packet loss, time 0ms\\n\\n\" source=\"helper.go:101", "tailed_path"=>"/var/log/xxxx_exporter.log.3", "hostname"=>"xxxx", "ip"=>"x.x.x.x", "log_type"=>"xxxx"}
2018-01-10 09:12:52 +0800 [warn]: #0 dump an error event: error_class=TypeError error="no implicit conversion of Fixnum into String" location="/opt/td-agent/embedded/lib/ruby/2.1.0/time.rb:325:in `_parse'" tag="xxxx" time=2018-01-08 13:48:39.054929007 +0800 record={"log_time"=>1515389254, "level"=>"info", "message"=>"output:  PING 192.168.1.88 (192.168.1.88) 56(84) bytes of data.\\n\\n--- 192.168.1.88 ping statistics ---\\n1 packets transmitted, 0 received, 100% packet loss, time 0ms\\n\\n\" source=\"helper.go:101", "tailed_path"=>"/var/log/xxxx_exporter.log.9", "hostname"=>"xxxx", "ip"=>"x.x.x.x", "log_type"=>"xxxx"}
2018-01-10 09:12:52 +0800 [warn]: #0 dump an error event: error_class=TypeError error="no implicit conversion of Fixnum into String" location="/opt/td-agent/embedded/lib/ruby/2.1.0/time.rb:325:in `_parse'" tag="xxxx-test5" time=2018-01-08 17:58:35.875962904 +0800 record={"level"=>"INFO", "log_time"=>1515401906, "module"=>"util", "message"=>"cmd: tail -n 161 /var/log/messages", "tailed_path"=>"/usr/local/xxxx/logs/xxxx.log", "hostname"=>"xxxx", "ip"=>"x.x.x.x", "log_type"=>"xxxx"}
2018-01-10 09:12:52 +0800 [warn]: #0 dump an error event: error_class=TypeError error="no implicit conversion of Fixnum into String" location="/opt/td-agent/embedded/lib/ruby/2.1.0/time.rb:325:in `_parse'" tag="xxxx-test6" time=2018-01-08 16:40:35.797752215 +0800 record={"level"=>"INFO", "log_time"=>1515385707, "module"=>"util", "message"=>"cmd output:\n    ", "tailed_path"=>"/usr/local/xxxx/logs/xxxx.log.1", "hostname"=>"xxxx", "ip"=>"x.x.x.x", "log_type"=>"xxxx"}
2018-01-10 09:12:52 +0800 [warn]: #0 dump an error event: error_class=TypeError error="no implicit conversion of Fixnum into String" location="/opt/td-agent/embedded/lib/ruby/2.1.0/time.rb:325:in `_parse'" tag="xxxx-test5" time=2018-01-08 13:23:58.640995044 +0800 record={"log_time"=>1515388531, "level"=>"info", "message"=>"command:  su - grid -c 'sqlplus '/as sysasm''\" source=\"helper.go:100", "tailed_path"=>"/var/log/xxxx_exporter.log.4", "hostname"=>"xxxx", "ip"=>"x.x.x.x", "log_type"=>"xxxx"}
2018-01-10 09:13:13 +0800 [warn]: #0 dump an error event: error_class=TypeError error="no implicit conversion of Fixnum into String" location="/opt/td-agent/embedded/lib/ruby/2.1.0/time.rb:325:in `_parse'" tag="xxxx-test5" time=2018-01-08 15:43:33.195488773 +0800 record={"level"=>"INFO", "log_time"=>1515377090, "module"=>"mxutils", "message"=>"Mxvote cmd: dmsetup table mxvoting\n     output: 0 12582456 linear 147:0 0\n    \n     error message: ", "tailed_path"=>"/usr/local/xxxx/logs/xxxx.log.2", "hostname"=>"xxxx", "ip"=>"x.x.x.x", "log_type"=>"xxxx"}
2018-01-10 09:13:13 +0800 [warn]: #0 dump an error event: error_class=TypeError error="no implicit conversion of Fixnum into String" location="/opt/td-agent/embedded/lib/ruby/2.1.0/time.rb:325:in `_parse'" tag="xxxx-test2" time=2018-01-08 16:58:28.176962647 +0800 record={"level"=>"INFO", "log_time"=>1515368606, "module"=>"util", "message"=>"cmd output:\n    ", "tailed_path"=>"/usr/local/xxxx/logs/xxxx.log.4", "hostname"=>"xxxx", "ip"=>"x.x.x.x", "log_type"=>"xxxx"}
2018-01-10 09:13:13 +0800 [warn]: #0 dump an error event: error_class=TypeError error="no implicit conversion of Fixnum into String" location="/opt/td-agent/embedded/lib/ruby/2.1.0/time.rb:325:in `_parse'" tag="xxxx-test3" time=2018-01-08 18:01:06.570689759 +0800 record={"level"=>"INFO", "log_time"=>1515395454, "module"=>"util", "message"=>"cmd output:\n    ", "tailed_path"=>"/usr/local/xxxx/logs/xxxx.log.2", "hostname"=>"xxxx", "ip"=>"x.x.x.x", "log_type"=>"xxxx"}
2018-01-10 09:13:13 +0800 [info]: #0 detected rotation of /var/log/xxxx_exporter.log.8; waiting 5 seconds
2018-01-10 09:13:14 +0800 [warn]: #0 dump an error event: error_class=TypeError error="no implicit conversion of Fixnum into String" location="/opt/td-agent/embedded/lib/ruby/2.1.0/time.rb:325:in `_parse'" tag="xxxx" time=2018-01-08 12:43:42.215172692 +0800 record={"log_time"=>1515386471, "level"=>"info", "message"=>"stderr:  \" source=\"helper.go:102", "tailed_path"=>"/var/log/xxxx_exporter.log.1", "hostname"=>"xxxx", "ip"=>"x.x.x.x", "log_type"=>"xxxx"}
2018-01-10 09:13:14 +0800 [info]: #0 detected rotation of /var/log/xxxx_exporter.log.1; waiting 5 seconds
2018-01-10 09:13:14 +0800 [warn]: #0 dump an error event: error_class=TypeError error="no implicit conversion of Fixnum into String" location="/opt/td-agent/embedded/lib/ruby/2.1.0/time.rb:325:in `_parse'" tag="xxxx-test6" time=2018-01-08 15:41:33.081963752 +0800 record={"level"=>"INFO", "log_time"=>1546856351, "module"=>"util", "message"=>"cmd output:\n    ", "tailed_path"=>"/usr/local/xxxx/logs/xxxx.log.9", "hostname"=>"xxxx", "ip"=>"x.x.x.x", "log_type"=>"xxxx"}
2018-01-10 09:13:14 +0800 [warn]: #0 dump an error event: error_class=TypeError error="no implicit conversion of Fixnum into String" location="/opt/td-agent/embedded/lib/ruby/2.1.0/time.rb:325:in `_parse'" tag="xxxx-test3" time=2018-01-08 14:13:02.271125703 +0800 record={"log_time"=>1515390592, "level"=>"info", "message"=>"output:  \" source=\"helper.go:101", "tailed_path"=>"/var/log/xxxx_exporter.log.10", "hostname"=>"xxxx", "ip"=>"x.x.x.x", "log_type"=>"xxxx"}
2018-01-10 09:13:14 +0800 [warn]: #0 dump an error event: error_class=TypeError error="no implicit conversion of Fixnum into String" location="/opt/td-agent/embedded/lib/ruby/2.1.0/time.rb:325:in `_parse'" tag="xxxx-test3" time=2018-01-08 13:37:48.928853772 +0800 record={"log_time"=>1515388695, "level"=>"info", "message"=>"stderr:  \" source=\"helper.go:102", "tailed_path"=>"/var/log/xxxx_exporter.log.9", "hostname"=>"xxxx", "ip"=>"x.x.x.x", "log_type"=>"xxxx"}
2018-01-10 09:13:14 +0800 [info]: #0 detected rotation of /usr/local/xxxx/logs/xxxx.log.7; waiting 5 seconds
2018-01-10 09:13:35 +0800 [info]: #0 detected rotation of /var/log/xxxx_exporter.log; waiting 5 seconds
2018-01-10 09:13:35 +0800 [warn]: #0 dump an error event: error_class=TypeError error="no implicit conversion of Fixnum into String" location="/opt/td-agent/embedded/lib/ruby/2.1.0/time.rb:325:in `_parse'" tag="xxxx-test4" time=2018-01-08 16:15:38.040842413 +0800 record={"level"=>"ERROR", "log_time"=>1546854427, "module"=>"util", "message"=>"cmd error msg:\n    Invalid entry length (16). Fixed up to 11.\n    None", "tailed_path"=>"/usr/local/xxxx/logs/xxxx.log.9", "hostname"=>"xxxx", "ip"=>"x.x.x.x", "log_type"=>"xxxx"}
2018-01-10 09:13:35 +0800 [info]: #0 detected rotation of /var/log/xxxx_exporter.log.9; waiting 5 seconds
2018-01-10 09:13:36 +0800 [warn]: #0 dump an error event: error_class=TypeError error="no implicit conversion of Fixnum into String" location="/opt/td-agent/embedded/lib/ruby/2.1.0/time.rb:325:in `_parse'" tag="xxxx-test3" time=2018-01-08 16:21:04.504520330 +0800 record={"level"=>"INFO", "log_time"=>1515381249, "module"=>"util", "message"=>"cmd output:\n    ", "tailed_path"=>"/usr/local/xxxx/logs/xxxx.log.2", "hostname"=>"xxxx", "ip"=>"x.x.x.x", "log_type"=>"xxxx"}
2018-01-10 09:13:36 +0800 [warn]: #0 dump an error event: error_class=TypeError error="no implicit conversion of Fixnum into String" location="/opt/td-agent/embedded/lib/ruby/2.1.0/time.rb:325:in `_parse'" tag="xxxx-test3" time=2018-01-08 12:38:02.403610777 +0800 record={"log_time"=>1515385348, "level"=>"info", "message"=>"output:  0000:04:00.1\\n\" source=\"helper.go:101", "tailed_path"=>"/var/log/xxxx_exporter.log.7", "hostname"=>"xxxx", "ip"=>"x.x.x.x", "log_type"=>"xxxx"}
2018-01-10 09:13:36 +0800 [warn]: #0 dump an error event: error_class=TypeError error="no implicit conversion of Fixnum into String" location="/opt/td-agent/embedded/lib/ruby/2.1.0/time.rb:325:in `_parse'" tag="xxxx-test3" time=2018-01-08 15:42:38.089622109 +0800 record={"level"=>"INFO", "log_time"=>1515345911, "module"=>"mxutils", "message"=>"Mxvote cmd: ssh -q -o ConnectTimeout=3 172.16.132.19 drbdsetup role 0\n     output: Primary/Primary\n    \n     error message: ", "tailed_path"=>"/usr/local/xxxx/logs/xxxx.log.6", "hostname"=>"xxxx", "ip"=>"x.x.x.x", "log_type"=>"xxxx"}
2018-01-10 09:13:36 +0800 [warn]: #0 dump an error event: error_class=TypeError error="no implicit conversion of Fixnum into String" location="/opt/td-agent/embedded/lib/ruby/2.1.0/time.rb:325:in `_parse'" tag="xxxx-test4" time=2018-01-08 16:47:27.462415874 +0800 record={"level"=>"INFO", "log_time"=>1515378538, "module"=>"util", "message"=>"cmd: dmidecode|grep Vendor", "tailed_path"=>"/usr/local/xxxx/logs/xxxx.log.2", "hostname"=>"xxxx", "ip"=>"x.x.x.x", "log_type"=>"xxxx"}
  • 查看issue

紧接着就是在github中搜索相关issue,找到一个关于td-agnet内存疯长的issue。根据issue提示,增加了buffer等相关配置,内存增加问题得到了解决

CPU问题排查方向

  • 设置enable_watch_timer
  • 修改buffer配置参数是否能生效
  • 检查tail相同文件是否会有竟态条件repeat
  • dio等插件计算是否有影响
  • 检查正则是否合理

结论

在修改各种参数和配置后,发现只有把在配置文件中去除@type dio部分配置后才不会再出现。

在查阅大量资料后推测,因为fluent-plugin-dio-1.0.1.gem插件会把日志中的不同格式时间转成时间戳,在这个过程中导致了CPU使用率较高。

由于@timestamp时间和日志时间非常接近,完全可以替代日志中的时间作为筛选条件进行查询。所以把fluent-plugin-dio-1.0.1.gem插件移除即可。

参考


上一篇 flunetd日志监控/告警实践     下一篇 Python使用cx_Oracle用非业务IP查询数据
目录导航