java抓取jstack dump
sdmei 人气:0java应用CPU有波动,事后怎么分析?
目前我采用的方案是根据CPU负载自动执行jstack,并将文件上传到OSS。
环境:阿里云 + k8s + springcloud + prometheus + oss
容器镜像安装python 2.7
涉及以下几个文件:
oss相关的文件包括:oss客户端和配置文件;
逻辑主要在脚本:mon_cpu_jstack.py
- 两个参数:CPU阈值,profile环境(prod \ test等)
- 每10s 执行一次脚本(跳过起初的3分钟,因为应用启动时CPU负载高是正常的);
- 通过top获取CPU,超过阈值则执行jstack;
- 如果CPU消耗最大进程是GC,拉取jmap信息;
- 每小时最大执行3次jstack(太多无意义);
- 生成文件后通过oss客户端上传OSS;
# more mon_cpu_jstack.py # -*- coding: utf-8 -*- import time import commands as cm import os,sys from signal import signal, SIGPIPE, SIG_DFL signal(SIGPIPE,SIG_DFL) if len(sys.argv) != 3: print 'Usage: ' + sys.argv[0] + ' [cpu_threshold] [profiles]' sys.exit(1) cpu_th = float(sys.argv[1]) profiles = sys.argv[2] base_dir = '/opt/perf' time_list = [] def oss_upload(app_name,logfile): comm_upload = base_dir + '/ossutil64 -c ' + base_dir + '/ossutilconfig cp ' + logfile + ' oss://k8s-jstack-log/' + profiles + '/' + app_name + '/' comm_delete = 'rm -f ' + logfile os.system(comm_upload) time.sleep(1) os.system(comm_delete) def mon_and_catch(): global time_list # up to 3 times hourly if len(time_list) == 3: if time.time() - time_list[-1] <= 3600: return host_name = cm.getoutput('hostname') java_pid = cm.getoutput('top -b -n1 | grep java|awk \'{printf $1}\'') if (java_pid == ''): print 'no java_pid' return cpu_pct_str = cm.getoutput('top -b -p ' + java_pid + ' -n1 | tail -1 |awk \'{printf $9}\'') cpu_pct = float(cpu_pct_str) time_str = time.strftime("%Y-%m-%d_%H-%M-%S", time.localtime()) log_name = base_dir + '/' + host_name + '_' + time_str + '_cpu_' + cpu_pct_str.split('.')[0] + '.log' if(cpu_pct > cpu_th): time_list.insert(0, time.time()) time_list = time_list[:3] os.system('top -Hbp' + java_pid + ' -n1 >> ' + log_name) os.system('echo >> ' + log_name) os.system('jstack -l ' + java_pid + ' >> '+log_name) # if cpu is used by gc, exec jmap top_thread_id = cm.getoutput('cat ' + log_name + ' | grep java | head -1 | awk \'{print $1}\'') top_thread_id_hex = "0x" + cm.getoutput('printf \'%x\n\' ' + top_thread_id) + " " top_thread_gc = cm.getoutput('cat ' + log_name + ' | grep "' + top_thread_id_hex + '" | grep "GC" | wc -l') if top_thread_gc == '1': os.system('echo >> ' + log_name) os.system('jmap -histo:live ' + java_pid + ' | head -100 >> ' + log_name) # get app_name from hostname hostname_list = host_name.split('-') app_name_list = hostname_list[:len(hostname_list)-2] app_name='-'.join(app_name_list) if app_name[-1] in ['1','2']: app_name = app_name[:len(app_name)-1] oss_upload(app_name, log_name) if __name__ == '__main__': i = 0 while (True): # skip at startup if i > 18: mon_and_catch() time.sleep(10) i = i + 1
poststart.sh调用上述python脚本(profile直接从环境变量中获取)
# more poststart.sh cpu_th=$1 nohup python -u /opt/perf/mon_cpu_jstack.py ${cpu_th} ${spring_profiles_active} > /opt/perf/poststart.log 2>&1 & exit 0
通过postStart调用脚本自动执行:
最终生成的oss文件:
第一层目录:profiles
第二层目录:应用名称
第三层目录:具体jstack文件
文件名最后的数字是当时的CPU使用量,如166指1.66C
加载全部内容