Ansari's Technical Notes: April 2014

Monday, April 28, 2014

Steps to delete a nagios poller from centreon mysql database

You may need to delete a nagios poller from Centreon database in following cases:
(Assuming you are using MySQL DB as your centreon database.

- Due to some reason, there are duplicate pollers ( 2 nagios pollers ID for same nagios poller hostname )
- You are not able to delete a nagios poller from centreon UP
- You have some checks associated with centreon hosts being monitored thru a nagios poller that have been removed without deleting associated service checks.

>> Take mysqlDB backup

# cd /var/lib/mysql/backup && mkdir new
# innobackupex /var/lib/mysql/backup/new/
# innobackupex --apply-log /var/lib/mysql/backup/new/2014-05-31_05-01-09/
# chown -R mysql:mysql backup

>> Delete Nagios poller from MySQl DB

# mysql
myysql> connect centreon_status;
mysql> select * from nagios_instances;
+-------------+------------------+----------------------+
| instance_id | instance_name | instance_description |
+-------------+------------------+----------------------+
| 99 | nagiospoller99 |
+-------------+------------------+----------------------+

>> Take backup of record being deleted.

# mysqldump centreon_status nagios_instances nagios_services nagios_hoststatus nagios_servicestatus --where 'instance_id=99' --compact -n -t > instancebackup.sql

>> Delete poller from centreon_status tsable;

# mysqld
myysql> connect centreon_status;
mysql> delete from nagios_instances where instance_id=99;
mysql> delete from nagios_services where instance_id=99;
mysql> delete from nagios_hoststatus where instance_id=99;
mysql> delete from nagios_servicestatus where instance_id=99;

>> You are done !
>> In case you want to restore DB

- To restore partial backup

# cat instancebackup.sql | mysql

- To restore full backup

service mysql stop
cd /var/lib/mysql
df -hP /var/lib/mysql # check if you have enough space
mkdir backup/old
mv * backup/old
mv backup/new/DATEOFBACKUP/* .
chown -R mysql:mysql /var/lib/mysql
service mysql start

Reference one and two

Wednesday, April 23, 2014

Puppet code : Stop a service when a filesystem is not mounted

Say you do not want crond to run when /mnt filesystem is not mounted.

LOGIC

Is /mnt filesystem is mounted?
YES
Ensure File resource created
Notify crond service ( assuming cron service is defined)
NO
Do not do anything

CODE

exec {

'/bin/echo’:

onlyif => '/bin/cat /proc/mounts | /bin/grep -q \' /mnt \'',

require => File[‘cron_file'];

}

file { 'cron_file':

'/etc/cron.d/custom_crontab':

ensure => file,

source => 'puppet:///$module/$sub_module/custom_crontab',

notify => Service[crond],

owner => 'root',

group => 'root',

mode => '0644';

}

Refer : this discussion

Wednesday, April 16, 2014

Adding IP resource in Parallel Veritas Cluster Group

There may be case that we need to add IP resource in one of parallel group in VCS cluster. IP should be uniq for each system and must come up on designated system. Below is example how to add such Ip resource

group A_parallel_group (
SystemList = { system1 = 0, system2 = 1 }
Parallel = 1
AutoStartList = { system1, system2 }
)

IP app_ip(
Device @system1 = "eth2"
Device @system2 = "eth3"
Address @system1 = "10.10.10.100"
Address @system2 = "10.10.10.200"
NetMask = "255.255.255.0"
)

Friday, April 11, 2014

Configuring proftpd as failover group in Veritas Cluster

>> Here is part of config file if you want to edit main.cf manually. Take note of default value that AutoFailOver = 1 (true) and Parallel = 0 (false). Hence if service will fail on one nod, it will start on another node.

group proftpd_grp (
SystemList = { system_1 = 0, system_2 = 1 }
AutoStartList = { system_1, system_2 }
OnlineRetryLimit = 3
)

Application proftpd_service (
StartProgram = "/etc/init.d/proftpd start"
PidFiles = { "/var/run/proftpd/proftpd.pid" }
StopProgram = "/etc/init.d/proftpd stop"
)

IP proftpd_ip (
Device = eth0
Address = "192.168.10.99"
NetMask = "255.255.255.0"
)

requires group some_other_group online local firm
proftpd_service requires proftpd_ip

>>> Here are CLI dor same.

haconf -makerw
hagrp -add proftpd_grp
hagrp -modify proftpd_grp SystemList system_1 0 system_2 1
hagrp -modify proftpd_grp AutoStartList system_1 system_2
hagrp -modify proftpd_grp OnlineRetryLimit 3
hagrp -modify proftpd_grp SourceFile "./main.cf"
hares -add proftpd_service Application proftpd_grp
hares -modify proftpd_service StartProgram "/etc/init.d/proftpd start"
hares -modify proftpd_service PidFiles "/var/run/proftpd/proftpd.pid"
hares -modify proftpd_service StopProgram "/etc/init.d/proftpd stop"
hares -modify proftpd_service User root
hares -modify proftpd_service UseSUDash 0
hares -modify proftpd_service Enabled 1
hares -add proftpd_ip IP proftpd_grp
hares -modify proftpd_ip Device eth0
hares -modify proftpd_ip Address "192.168.10.99"
hares -modify proftpd_ip NetMask "255.255.255.0"
hares -modify proftpd_ip PrefixLen 1000
hares -modify proftpd_ip Enabled 1
hares -link proftpd_service proftpd_ip
haconf -dump -makero

Thursday, April 3, 2014

Using blktrace, blkparser and btt to analyze block device performance in Linux

Collect blktrace
( Caution - it will increase load on system)

mount -t debugfs debugfs /sys/kernel/debug
df /sys/kernel/debug
cd /var/tmp
mkdir blktrace_data
cd blktrace_data
# It will create one file per device per CPU
blktrace -d /dev/sdbs /dev/sdq # dump binary for one of more disk
blkrawverify sdbs # it will create sdbs.verify.out file
less sdbs.verify.out
grep invalid sdbs.verify.out

For many devices - use blkparser to combine blktrace data and then use btt to create report.

# combine all the files into one binary time-ordered stream of traces
blkparse -i sdbs -d bp.sdbs.bin # .blktrace.* not required
btt -A -i bp.sdbs.bin > bp.sdbs.txt
less bp.sdbs.txt

blkparse sdbs > bp.sdbs.txt # text file
less bp.sdbs.txt

# iosatt like output
btt -I bp.sdbs.iostat -i bp.sdbs.bin
less bp.sdbs.iostat

Interpreting Information

Note: All time is mili seconds
Q2Q — time between requests sent to the block layer

Q2G — how long it takes from the time a block I/O is queued to the time it gets a request allocated for it

G2I — how long it takes from the time a request is allocated to the time it is Inserted into the device's queue

Q2M — how long it takes from the time a block I/O is queued to the time it gets merged with an existing request

I2D — how long it takes from the time a request is inserted into the device's queue to the time it is actually issued to the device

M2D — how long it takes from the time a block I/O is merged with an exiting request until the request is issued to the device

D2C — service time of the request by the device

Q2C — total time spent in the block layer for a request

Q------->G------------>I--------->M------------------->D----------------------------->C

|-Q time-|-Insert time-|

|--------- merge time ------------|-merge with other IO|

|----------------scheduler time time-------------------|---driver,adapter,storagetime--|

|----------------------- await time in iostat output ----------------------------------|

If Q2Q is much larger than Q2C, that means the application is not issuing I/O in rapid succession. hus, any performance problems you have may not be at all related to the I/O subsystem.
D2C
 is very high, then the device is taking a long time to service 
requests. This can indicate that the device is simply overloaded (which 
may be due to the fact that it is a shared resource), or it could be 
because the workload sent down to the device is sub-optimal.
If Q2G is very high, it means that there are a lot of requests queued concurrently. This could indicate that the storage is unable to keep up with the I/O load.
await in iostat output = Q2C = Q2I + I2D + D2C
Q2I + I2D == scheduler time
The I2D time can include a lot of apparent extra time due to plug and unplug events (not shown above) which are used to improve merging of io within the schedule sort queue
D2C time covers driver time, adapter time, transport time, and storage service time (and back)
So if D2C/Q2C is approaching to 1, it  means % time spent on storage component is high.
high D->C times, the underlying transport structure of storage needs to be examined, such as switch counters or maintenance interfaces for storage boxes themselves.

References:
* A good hp doc
* Redhat doc
* Redhat article
* btt user guide
* blktrace user guide

Tuesday, April 1, 2014

How to change IO scheduler and LUN IO queue depth

To change scheduler to cfq for a LUN

# echo cfq > /sys/block/sdxx/queue/scheduler
# cat /sys/block/sdxx/queue/scheduler
noop anticipatory deadline [cfq]

To change schedule to deadline on for all paths of all multiapth devices

# multipath -l|awk '$3~/^sd/ {print $3}'|while read dev; do echo deadline > /sys/block/$dev/queue/scheduler; done
# multipath -l| awk '$3~/^sd/ {print $3}'|while read dev; do echo -n "$dev "; cat /sys/block/$dev/queue/scheduler; done

Add elevator=deadline in grub.conf to retain it over reboot.

To change queue depth from 16 to 32 for Q-Logic HBA

# cat /sys/module/qla2xxx/parameters/ql2xmaxqdepth
16
# echo 32 > /sys/module/qla2xxx/parameters/ql2xmaxqdepth
# cat /sys/module/qla2xxx/parameters/ql2xmaxqdepth

Modify /etc/modprobe.conf to retain it over reboot
options qla2xxx ql2xmaxqdepth=32 qlport_down_retry=14 ql2xloginretrycount=30

To monitor LUN queue

# cat $(echo /proc/scsi/sg/device{_hdr,s})|more
host chan id lun type opens qdepth busy online
3 0 0 12 0 1 32 5 1

If busy is reaching near to qdepth, we need to increase LUN qdepth. We should not increase qdepth too much !

The column headers in 'device_hdr' are given below. If the device is not present (and one is present after it) then a line of "-1" entries is output. Each entry is separated by a whitespace (currently a tab):
host            host number (indexes 'hosts' table, origin 0)
chan            channel number of device
id              SCSI id of device
lun             Logical Unit number of device
type            SCSI type (e.g. 0->disk, 5->cdrom, 6->scanner)
opens           number of opens (by sd, sr, sr and sg) at this time
depth           maximum queue depth supported by device
busy            number of commands being processed by host for this device
online          1 indicates device is in normal online state, 0->offline