Ansari's Technical Notes: Linux kernel - crash utility to analysis vmcore

What is crash ?

A tool to interactively analyzing the state of the crash generated by kdump, netdump, diskdump, LKCD, xendump or kvmdump

What are packages required?

yum install crash kernel-debuginfo-$(uname -r) kernel-debuginfo-common-x86_64-$(uname -r)

How to check created kernel crash file vmcore is valid?

If below command will show date when kernel crash was created and exit - then vmcore is created.

# crash -st /usr/lib/debug/lib/modules/$(uname -r)/vmlinux vmcore
Tue Mar 10 04:02:55 2015

Is vmcore created by crash is valid?

It will take couple of minutes to load kernel and show you crash prompt if crash is valid and you are using correct debug package. It will show number of cpu, system name, memory size, time when crash was created, load of system, date when system was crashed.

# crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux vmcore
crash 5.1.8-2.el5_9
Copyright (C) 2002-2011 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.

GNU gdb (GDB) 7.0
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

KERNEL: /usr/lib/debug/lib/modules/2.6.18-308.8.2.el5/vmlinux
DUMPFILE: vmcore
CPUS: 2
DATE: Tue Mar 10 04:02:55 2015
UPTIME: 1 days, 00:09:41
LOAD AVERAGE: 9.29, 4.35, 2.80
TASKS: 205
NODENAME: dev001.example.com
RELEASE: 2.6.18-308.8.2.el5
VERSION: #1 SMP Tue May 29 11:54:17 EDT 2012
MACHINE: x86_64 (2699 Mhz)
MEMORY: 3.9 GB
PANIC: "Kernel panic - not syncing: out of memory. panic_on_oom is selected"
PID: 3419
COMMAND: "sshd"
TASK: ffff81013b2e5860 [THREAD_INFO: ffff81012f594000]
CPU: 1
STATE: TASK_RUNNING (PANIC)

How to use crash commands ?

-log: dump lernel log_buf contents in chronological order.Most interesting information will be at the end of log file. kernel thread dump, memory dump, swap usages etc.

crash>log
.....
lowmem_reserve[]: 0 3000 4010 4010
Node 0 DMA32 free:10008kB min:6052kB low:7564kB high:9076kB active:1483204kB inactive:1420176kB present:3072160kB pages_scanned:10951420 all_unreclaimable? yes
lowmem_reserve[]: 0 0 1010 1010
Node 0 Normal free:1980kB min:2036kB low:2544kB high:3052kB active:442232kB inactive:458364kB present:1034240kB pages_scanned:1716084 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 4*4kB 3*8kB 4*16kB 1*32kB 3*64kB 4*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 2*4096kB = 10056kB
Node 0 DMA32: 0*4kB 5*8kB 1*16kB 1*32kB 1*64kB 1*128kB 4*256kB 1*512kB 0*1024kB 0*2048kB 2*4096kB = 10008kB
Node 0 Normal: 5*4kB 5*8kB 0*16kB 0*32kB 2*64kB 4*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1980kB
Node 0 HighMem: empty
10029 pagecache pages
Swap cache: add 127866954, delete 127858677, find 188220155/188834341, race 0+1692
Free swap = 0kB
Total swap = 2031608kB
Free swap: 0kB
1310720 pages of RAM
332619 reserved pages
20496 pages shared
8412 pages swap cached
Kernel panic - not syncing: out of memory. panic_on_oom is selected

INFO: task kjournald:602 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald D ffff810009004420 0 602 49 631 578 (L-TLB)

INFO: task nagios:20831 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nagios D ffff810009004420 0 20831 3763 20729 (NOTLB)

-ps : process status at time of crash. Active tasks have a preceding >

crash>ps
PID PPID CPU TASK ST %MEM VSZ RSS COMM
0 0 0 ffffffff80319b60 RU 0.0 0 0 [swapper]
0 1 1 ffff8101047460c0 RU 0.0 0 0 [swapper]
1 0 1 ffff8101047347a0 IN 0.0 10372 484 init
2 1 0 ffff810104734040 IN 0.0 0 0 [migration/0]
2979 1 1 ffff81013f5827a0 IN 0.0 28664 356 restorecond
> 3419 1 1 ffff81013b2e5860 RU 0.0 60808 456 sshd
2988 19639 0 ffff810130580820 IN 0.0 5876 388 vmstat
2990 1 1 ffff81013cb267a0 IN 0.0 5932 472 syslogd
crash>

-bt : display a kernel stack backtrace. bt -a - stack tradce of all active tasks

crash> bt
PID: 3419 TASK: ffff81013b2e5860 CPU: 1 COMMAND: "sshd"
#0 [ffff81012f595aa0] crash_kexec at ffffffff800b099c
#1 [ffff81012f595b60] panic at ffffffff80093989
#2 [ffff81012f595c50] out_of_memory at ffffffff800caa5d
#3 [ffff81012f595ca0] __alloc_pages at ffffffff8000f612
#4 [ffff81012f595d10] read_swap_cache_async at ffffffff80032415
#5 [ffff81012f595d50] swapin_readahead at ffffffff800d0777
#6 [ffff81012f595da0] __handle_mm_fault at ffffffff800092d9
#7 [ffff81012f595e60] do_page_fault at ffffffff80067202
#8 [ffff81012f595f50] error_exit at ffffffff8005dde9
RIP: 00002b3e2f66a25a RSP: 00007fff86964840 RFLAGS: 00010206
RAX: 0000000000000000 RBX: 00002b3e2f9261d8 RCX: 00002b3e2f9261d8
RDX: 00002b3e4a2363e0 RSI: 000000000000000a RDI: 0000000000000001
RBP: 00007fff86964860 R8: 00002b3e4a2363e0 R9: 000000000000000a
R10: 00007fff86964e10 R11: 0000000000000246 R12: 00002b3e4a236570
R13: 00007fff86964e50 R14: 0000000000000001 R15: 00002b3e2cd76078
ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
crash>

-bt : to backtrace a PID

crash > bt
crash > bt -f

-bt -f : display all stack data contained in a frame; this option can be used to determine the arguments passed to each function

crash>bt -f
#0 [ffff81012f595aa0] crash_kexec at ffffffff800b099c
#1 [ffff81012f595b60] panic at ffffffff80093989
#2 [ffff81012f595c50] out_of_memory at ffffffff800caa5d

-kmem : details of kernel memory location e.g. For #2 above

crash> kmem ffffffff800caa5d
ffffffff800caa5d (T) out_of_memory+593 ../debug/kernel-2.6.18/linux-2.6.18-308.8.2.el5.x86_64/mm/oom_kill.c: 506

PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffff810100009c30 2ca000 0 0 1 400
crash>

-whatis : search symbol table for data or type information

crash> whatis crash_kexec
void crash_kexec(struct pt_regs *);

crash> whatis panic
void panic(const char *, ...);

crash> whatis out_of_memory
void out_of_memory(struct zonelist *, gfp_t, int, int);

-some_more_commands

crash> sys|egrep -i "cpu|date|uptime|load|tasks|node|release|machine|memory|panic"
DUMPFILE: /cores/retrace/tasks/152431766/crash/vmcore [PARTIAL DUMP]
CPUS: 24
DATE: Wed Sep 9 07:54:53 2015
UPTIME: 322 days, 11:00:25
LOAD AVERAGE: 0.13, 0.13, 0.13
TASKS: 755
NODENAME: ms00456
RELEASE: 2.6.32-358.el6.x86_64
MACHINE: x86_64 (2494 Mhz)
MEMORY: 192 GB
PANIC: "Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details."

crash> rd -a 0xffffffff8201c001 100
ffffffff8201c000: HP
ffffffff8201c004: P70
ffffffff8201c008: 2.8
ffffffff8201c010: 12/20/2013 <<<
ffffffff8201c01c: HP
ffffffff8201c020: ProLiant DL380p Gen8

crash> dis -rl 0xffffffffa00574ca
0xffffffffa00574c8 : callq 0xffffffff8150cf21
/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/drivers/watchdog/hpwdt.c: 495

crash> mod |grep -E "NAME|"
MODULE NAME SIZE OBJECT FILE
ffffffffa00581a0 hpwdt 7094 /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-358.el6.x86_64/kernel/drivers/watchdog/hpwdt.ko.debug

crash> px ((struct module *)0xffffffffa00581a0)->name
$2 = "hpwdt\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"

crash> px ((struct module *)0xffffffffa00581a0)->version
$3 = 0xffff88301782cb20 "1.3.0"

crash> px ((struct module *)0xffffffffa00581a0)->srcversion
$4 = 0xffff8830194fe4c0 "87D39D97B9E0A6F667C8671"

crash> px ((struct module *)0xffffffffa00581a0)->gpgsig_ok
$5 = 0x1

crash> px notify_die
notify_die = $6 =
{int (enum die_val, const char *, struct pt_regs *, long, int, int)} 0xffffffff8109cbd0

crash> eval -b 00000000000000f1

-Other crash command

help : list all supported crash command
help : to know available option and details of a crash command e.g 'help bt'
net : show network interfaces and ip address configured
mount : show mount point and mount option at time of crash
sys : system specific information excatly what you see when crash was started
swap : swap usages
task : task structure of running task. In above example, it is sshd with PID 3419
runq : displays the tasks on the run queues of each cpu
set -v : display the current state of internal crash variables.

How to run crash analysis unattended and save in a text file ?

# cat > inputfile.txt
sys
mount
net
swap
log
ps
runq
bt
bt -a
bt -g
bt -t
bt -al
bt -f
foreach bt
task
kmem
kmem -i
kmem -S
exit

# crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux vmcore -i inputfile.txt > crash-analysys.txt
# less crash-analysys.txt

- References
http://people.redhat.com/anderson/crash_whitepaper/
http://www.dedoimedo.com/computers/crash-book.html
https://codeascraft.com/2012/03/30/kernel-debugging-101/

3 comments:

UthavumkarangalSeptember 8, 2015 at 12:12 AM
Hi Ansari, Thanks for the article, From where we can get usr/lib/debug/lib/modules/$(uname -r)/vmlinux file
UthavumkarangalSeptember 8, 2015 at 12:13 AM
Hi Ansari, Thanks for the article, From where we can get usr/lib/debug/lib/modules/$(uname -r)/vmlinux file. Sundaram
Nasimuddin AnsariSeptember 8, 2015 at 12:50 AM
Hi Uthavumkarangal,

(1) http://debuginfo.centos.org/

(2) https://access.redhat.com/solutions/9907

For Red Hat Enterprise Linux 5.8+, 6 and 7
With the release of RHEL 6 the debuginfo packages are no longer provided via the Red Hat public FTP site. They have instead moved to Red Hat Network (RHN) classic or Red Hat Satellite for download.

Tuesday, March 31, 2015

Linux kernel - crash utility to analysis vmcore