Monday, April 28, 2014

Steps to delete a nagios poller from centreon mysql database

You may need to delete a nagios poller from Centreon database in following cases:
(Assuming you are using MySQL DB as your centreon database.


  1. - Due to some reason, there are duplicate pollers ( 2 nagios pollers ID for same nagios poller hostname )
  2. - You are not able to delete a nagios poller from centreon UP
  3. - You have some checks associated with centreon hosts being monitored thru a nagios poller that have been removed without deleting associated service checks.


>> Take mysqlDB backup

# cd /var/lib/mysql/backup && mkdir new
#  innobackupex /var/lib/mysql/backup/new/
#  innobackupex --apply-log /var/lib/mysql/backup/new/2014-05-31_05-01-09/
# chown -R mysql:mysql backup

>> Delete Nagios poller from MySQl DB

# mysql
myysql> connect centreon_status;
mysql> select * from nagios_instances;
+-------------+------------------+----------------------+
| instance_id | instance_name    | instance_description |
+-------------+------------------+----------------------+
|          99 | nagiospoller99                          |
+-------------+------------------+----------------------+

>> Take backup of record being deleted.

# mysqldump centreon_status nagios_instances nagios_services nagios_hoststatus nagios_servicestatus --where 'instance_id=99' --compact -n -t > instancebackup.sql 

>> Delete poller from centreon_status tsable;

# mysqld
myysql> connect centreon_status;
mysql> delete from nagios_instances where instance_id=99;
mysql> delete from nagios_services where instance_id=99;
mysql> delete from nagios_hoststatus where instance_id=99;
mysql> delete from nagios_servicestatus where instance_id=99;

>> You are done !
>> In case you want to restore DB

- To restore partial backup

# cat instancebackup.sql | mysql 

- To restore full backup

service mysql stop 
cd /var/lib/mysql 
df -hP /var/lib/mysql # check if you have enough space
mkdir backup/old 
mv * backup/old 
mv backup/new/DATEOFBACKUP/* . 
chown -R mysql:mysql /var/lib/mysql 
service mysql start 

Reference one and two

Wednesday, April 23, 2014

Puppet code : Stop a service when a filesystem is not mounted

Say you do not want crond to run when /mnt filesystem is not mounted.

LOGIC

Is /mnt filesystem is mounted?
YES
     Ensure File resource created
     Notify crond service ( assuming cron service is defined)
NO
     Do not do anything


CODE

exec {
    '/bin/echo’:
    onlyif => '/bin/cat /proc/mounts | /bin/grep -q \' /mnt \'',
    require => File[‘cron_file'];
}

file { 'cron_file':
'/etc/cron.d/custom_crontab':
      ensure  => file,
      source  => 'puppet:///$module/$sub_module/custom_crontab',
      notify  => Service[crond],
      owner   => 'root',
      group   => 'root',
      mode    => '0644';
}

Refer : this discussion

Wednesday, April 16, 2014

Adding IP resource in Parallel Veritas Cluster Group

There may be case that we need to add IP resource in one of parallel group in VCS  cluster. IP should be uniq for each system and must come up on designated system. Below is example how to add such Ip resource


group A_parallel_group (
        SystemList = { system1 = 0, system2 = 1 }
        Parallel = 1
        AutoStartList = { system1, system2 }
        )

        IP app_ip(
                Device @system1 = "eth2"
                Device @system2 = "eth3"
                Address @system1 = "10.10.10.100"
                Address @system2 = "10.10.10.200"
                NetMask = "255.255.255.0"
                )

Friday, April 11, 2014

Configuring proftpd as failover group in Veritas Cluster

>> Here is part of config  file if you want to edit main.cf manually. Take note of default value that AutoFailOver = 1 (true) and Parallel = 0 (false). Hence if service will fail on one nod, it will start on another node.

group proftpd_grp (
        SystemList = { system_1 = 0, system_2 = 1 }
        AutoStartList = { system_1, system_2 }
        OnlineRetryLimit = 3
        )

        Application proftpd_service (
                StartProgram = "/etc/init.d/proftpd start"
                PidFiles = { "/var/run/proftpd/proftpd.pid" }
                StopProgram = "/etc/init.d/proftpd stop"
                )

        IP proftpd_ip (
                Device = eth0
                Address = "192.168.10.99"
                NetMask = "255.255.255.0"
                )

        requires group some_other_group online local firm
        proftpd_service requires proftpd_ip


>>> Here are CLI dor same.

haconf -makerw
hagrp -add proftpd_grp
hagrp -modify proftpd_grp SystemList  system_1 0 system_2 1
hagrp -modify proftpd_grp AutoStartList  system_1 system_2
hagrp -modify proftpd_grp OnlineRetryLimit 3
hagrp -modify proftpd_grp SourceFile "./main.cf"
hares -add proftpd_service Application proftpd_grp
hares -modify proftpd_service StartProgram "/etc/init.d/proftpd start"
hares -modify proftpd_service PidFiles  "/var/run/proftpd/proftpd.pid"
hares -modify proftpd_service StopProgram "/etc/init.d/proftpd stop"
hares -modify proftpd_service User root
hares -modify proftpd_service UseSUDash 0
hares -modify proftpd_service Enabled 1
hares -add proftpd_ip IP proftpd_grp
hares -modify proftpd_ip Device eth0
hares -modify proftpd_ip Address "192.168.10.99"
hares -modify proftpd_ip NetMask "255.255.255.0"
hares -modify proftpd_ip PrefixLen 1000
hares -modify proftpd_ip Enabled 1
hares -link proftpd_service proftpd_ip
haconf -dump -makero

Thursday, April 3, 2014

Using blktrace, blkparser and btt to analyze block device performance in Linux

Collect blktrace
( Caution - it will increase load on system)

mount -t debugfs debugfs /sys/kernel/debug
df /sys/kernel/debug
cd /var/tmp
mkdir blktrace_data
cd blktrace_data
# It will create one file per device per CPU
blktrace -d /dev/sdbs /dev/sdq  # dump binary for one of more disk
blkrawverify sdbs  # it will create sdbs.verify.out file
less sdbs.verify.out
grep invalid  sdbs.verify.out

For many devices - use blkparser to combine blktrace data and then use btt to create report.

# combine all the files into one binary  time-ordered stream of traces
blkparse -i sdbs -d bp.sdbs.bin # .blktrace.* not required
btt -A -i bp.sdbs.bin > bp.sdbs.txt
less bp.sdbs.txt


blkparse sdbs > bp.sdbs.txt  # text file
less bp.sdbs.txt


# iosatt like output
btt -I bp.sdbs.iostat -i bp.sdbs.bin
less bp.sdbs.iostat


Interpreting Information


Note: All time is mili seconds Q2Q — time between requests sent to the block layer
Q2G — how long it takes from the time a block I/O is queued to the time it gets a request allocated for it
G2I — how long it takes from the time a request is allocated to the time it is Inserted into the device's queue
Q2M — how long it takes from the time a block I/O is queued to the time it gets merged with an existing request
I2D — how long it takes from the time a request is inserted into the device's queue to the time it is actually issued to the device
M2D — how long it takes from the time a block I/O is merged with an exiting request until the request is issued to the device
D2C — service time of the request by the device
Q2C — total time spent in the block layer for a request

Q------->G------------>I--------->M------------------->D----------------------------->C
|-Q time-|-Insert time-|
|--------- merge time ------------|-merge with other IO|
|----------------scheduler time time-------------------|---driver,adapter,storagetime--|

|----------------------- await time in iostat output ----------------------------------|








  • If Q2Q is much larger than Q2C, that means the application is not issuing I/O in rapid succession. hus, any performance problems you have may not be at all related to the I/O subsystem.
  • D2C is very high, then the device is taking a long time to service requests. This can indicate that the device is simply overloaded (which may be due to the fact that it is a shared resource), or it could be because the workload sent down to the device is sub-optimal.
  • If Q2G is very high, it means that there are a lot of requests queued concurrently. This could indicate that the storage is unable to keep up with the I/O load.
  • await in iostat output = Q2C = Q2I + I2D + D2C
  • Q2I + I2D == scheduler time
  • The I2D time can include a lot of apparent extra time due to plug and unplug events (not shown above) which are used to improve merging of io within the schedule sort queue
  • D2C time covers driver time, adapter time, transport time, and storage service time (and back)
  • So if D2C/Q2C is approaching to 1, it  means % time spent on storage component is high.
  • high D->C times, the underlying transport structure of storage needs to be examined, such as switch counters or maintenance interfaces for storage boxes themselves.

  • References:
    * A good hp doc
    * Redhat doc
    * Redhat article
    * btt user guide
    * blktrace user guide

    Tuesday, April 1, 2014

    How to change IO scheduler and LUN IO queue depth

    To change scheduler to cfq for a LUN

    # echo cfq > /sys/block/sdxx/queue/scheduler
    # cat /sys/block/sdxx/queue/scheduler
      noop anticipatory deadline [cfq]


    To change schedule to deadline on for all paths of all multiapth devices

    # multipath -l|awk '$3~/^sd/ {print $3}'|while read dev; do echo deadline > /sys/block/$dev/queue/scheduler; done 
    # multipath -l| awk '$3~/^sd/ {print $3}'|while read dev; do echo -n "$dev "; cat /sys/block/$dev/queue/scheduler; done 

    Add elevator=deadline in grub.conf to retain it over reboot.


    To change queue depth from 16 to 32 for Q-Logic HBA

    # cat /sys/module/qla2xxx/parameters/ql2xmaxqdepth
    16
    # echo 32 > /sys/module/qla2xxx/parameters/ql2xmaxqdepth
    # cat /sys/module/qla2xxx/parameters/ql2xmaxqdepth

    Modify /etc/modprobe.conf to retain it over reboot
    options qla2xxx ql2xmaxqdepth=32 qlport_down_retry=14 ql2xloginretrycount=30


    To monitor LUN queue

    # cat  $(echo /proc/scsi/sg/device{_hdr,s})|more
    host   chan    id      lun     type    opens   qdepth  busy   online
    3       0       0       12      0       1       32     5       1

    If busy is reaching near to qdepth, we need to increase LUN qdepth. We should not increase qdepth too much !

    The column headers in 'device_hdr' are given below. If the device is not present (and one is present after it) then a line of "-1" entries is output. Each entry is separated by a whitespace (currently a tab):
    host            host number (indexes 'hosts' table, origin 0)
    chan            channel number of device
    id              SCSI id of device
    lun             Logical Unit number of device
    type            SCSI type (e.g. 0->disk, 5->cdrom, 6->scanner)
    opens           number of opens (by sd, sr, sr and sg) at this time
    depth           maximum queue depth supported by device
    busy            number of commands being processed by host for this device
    online          1 indicates device is in normal online state, 0->offline