Monday, September 28, 2015

Veritas Replication stopped :paused due to network disconnection (dcm resynchronization)

VVR replication stopped with below status. enabled, attached and consistent (are good sign) means, we are in state from where we can recover/resume quickly.

Status: (On Primary and Secondary both)


# vradmin -g data_dg repstatus data_rvg
Replicated Data Set: data_rvg
Primary:
  Host name:                  primary-vvr-host
  RVG name:                   data_rvg
  DG name:                    data_dg
  RVG state:                  enabled for I/O
  Data volumes:               1
  VSets:                      0
  SRL name:                   srl_volume
  SRL size:                   24.54 G
  Total secondaries:          1
Secondary:
  Host name:                  secondary-vvr-host
  RVG name:                   data_rvg
  DG name:                    data_dg
  Data status:                consistent, behind
  Replication status:         resync paused due to network disconnection (dcm resynchronization)
  Current mode:               asynchronous
  Logging to:                 DCM (contains  4567890  Kbytes) (SRL protection logging)


# vxprint -Pl
Disk group: data_dg

Rlink:    rlk_secondary-vvr-host
info:     timeout=500 rid=0.1076
          latency_high_mark=10000 latency_low_mark=9950
          bandwidth_limit=none
state:    state=ACTIVE
          synchronous=off latencyprot=off srlprot=autodcm
assoc:    rvg=data_rvg
          remote_host=secondary-vvr-host IP_addr=10.50.60.70 port=4145
          remote_dg=data_dg
          remote_dg_dgid=1397347157.35.secondary-host
          remote_rvg_version=unknown
          remote_rlink=rlk_primary-vvr-host
          remote_rlink_rid=0.1078
          local_host=primary-vvr-host IP_addr=10.20.30.58 port=4145
protocol: TCP/IP
flags:    write enabled attached consistent disconnected asynchronous dcm_logging

Possible Cause


Network disconnect between Primary and Secondary site at some point of time


Solution

CHECK VVR HEALTH


--What port VVR is using?

# /sbin/vrport
heartbeat=4145
vradmind=8199
vxrsyncd=8989
data=Anonymous-Ports
 
 
--What is daemon status ?

# /etc/init.d/vras-vradmind.sh status
VxVM VVR  V-5-4-0 VRAS daemon running: [  OK  ]
VxVM VVR  V-5-4-0 vvr_stats running: [  OK  ]
# /usr/sbin/vxstart_vvr status
VxVM VVR INFO V-5-2-3935 Using following ports:
heartbeat: 4145
vradmind: 8199
vxrsyncd: 8989
data: Anonymous-Ports
To change, see vrport(1M) command
VxVM VVR vxnetd INFO V-5-1-15103  Cannot create IPv6 socket Address family not supported by protocol
VxVM VVR  V-5-2-5942 Starting Communication daemon: [  OK  ]

 
--Are ports UP and listening on primary and secondary site ?

# netstat -nap|grep -Ew "4145|8199|8989"
  tcp        0      0 0.0.0.0:8199                0.0.0.0:*                   LISTEN      2080/vradmind
  tcp        0      0 0.0.0.0:4145                0.0.0.0:*                   LISTEN      -
  tcp        0      0 0.0.0.0:8989                0.0.0.0:*                   LISTEN      2161/in.vxrsyncd
  tcp        0      0 10.20.30.67:8199            10.20.30.57:47574           ESTABLISHED 2080/vradmind
  udp        0      0 0.0.0.0:4145                0.0.0.0:*                               -
 
 
--Are primary to secondary and vice versa remote_host ( name used for replication) are pinging?

# vxprint -Pl|grep remote_host # to find remote_host
# ping
 
 
--Are ports are reachable from primary -> secondary and secondary ->primary ?

# nc -zvw3 4145
# nc -zvw3 8199
# nc -zvw3 8989
 
 

FIX OF PROBLEM

--If above VVR health is not good, fix those issue first. 
If health is good, stop VVR on secondary node

# /usr/sbin/vxstart_vvr stop
 
--wait for 60 seconds otr so, stop VVR on Primary node
 
--start VVR on secondary node

# /usr/sbin/vxstart_vvr start
 
--wait for 60 seconds, start VVR on Primary node
 
--Check if status is connected now, it will be something below :connected resync_started

# vxprint -Pl |grep flags
flags:    write enabled attached inconsistent cant_sync connected asynchronous dcm_logging resync_started

# vradmin -g data_dg repstatus data_rvg|tail -5

  Data status:                inconsistent
  Replication status:         resync in progress (dcm resynchronization)
  Current mode:               asynchronous
  Logging to:                 DCM (contains  14461920  Kbytes) (SRL protection logging)
  Timestamp Information:      N/A


--You may required to resync rvg

# vradmin -g data_dg resync  data_rvg

--If all is good, tcpdump will show data transfer to remote site (use NIC used for replication)

# tcpdump -i ethX port 4145

--And, link status will be something like this

# vxrlink -g data_dg status rlk_secondary-vvr-host -i 1
  VxVM VVR vxrlink INFO V-5-1-12887 DCM is in use on rlink rlk_secondary-vvr-host. DCM contains 14298240 Kbytes (1%) of the Data Volume(s).


No comments:

Post a Comment