veritas cluster

Comands LVM VXVM
1 Fdisk,vxdctl Fdisk -l vxdctl enable
2 pvcreate,vxdisksetup pvcreate /dev/sdb* vxdisksetup -I /etc/vx/bin/vxdisksetup -i disk_0 /etc/vx/vxdisksetup -i disk_1
3 vgcreate,vxdg vgcreate oravg /dev/sdb* vxdg init oradg disk_0 disk_1
4 lvcreate,vxassist lvcreate -L +3G oravg ora_lv vxassist -g make vol vxassist -g oradg make oravol 4g disk_0 disk_1
5 mkfs.ext3,mkfs.vxfs mkfs.ext3 /dev/oravg/ora_lv mkfs.vxfs /dev/vx/rdsk/webdg/webvol
6 mount mount -t ext3 /dev/oravg/ora_lv /oravol mount -t vxfs /dev/vx/dsk/oradg/oravol /oravol/

lvm veritas

fdisk vxdctl enable
pvcreate vxdisksetup -i lun
vgcreate vxdg -g adddisk device=lun
lvcreate vxassit -g make vol size
mkfs mkfs -F vxfs
mount mount
/etc/fstab /etc/fstab

Linux = native multipathing
solaris = native multpathing
veritas = veritas dynamic multipathing

Veritas cluster Server

or the end of the week, we’re going to continue with the theme of sparse-but-hopefully useful information. Quick little “crib sheets” (preceding by paragraphs and paragraphs of stilted ramblings by the lunatic who pens this blog’s content 😉 For this Friday, we’re going to come back around and take a look at Veritas Cluster Server (VCS) troubleshooting. If you’re interested in more specific examples of problems, solutions and suggestions, with regards to VCS, check out all the VCS related posts from the past year or so. Hopefully you’ll be able to find something useful in our archives, as well. These simple suggestions should work equally well for Unix as well as Linux, if you choose to go the VCS route rather than some less costly one 🙂

And, here we go again; quick, pointed bullets of info. Bite-sized bits of troubleshooting advice that focus on solving the problem, rather than understanding it. That sounds awful, I know, but, sometimes, you have to get things done and, let’s face it, if it’s the job or your arse, who cares about the why? Leave that for philosophers and academics. Plus, since you fix problems so fast, you’ll have plenty of time to read up on the ramifications of your actions later 😉

The setup: Your site is down. It’s a small cluster configuration with only two nodes and redundant nic’s, attached network disk, etc. All you know is that the problem is with VCS (although it’s probably indirectly due to a hardware issue). Something has gone wrong with VCS and it’s, obviously, not responding correctly to whatever terrible accident of nature has occurred. You don’t have much more to go on than that. The person you receive your briefing from thinks the entire clustered server set up (hardware, software, cabling, power, etc) is a bookmark in IE 😉

Now, one by one, in a fashion that zigs on purpose, but has a tendency to zag, here are a few things to look at right off the bat when assessing a situation like this one. Perhaps next week, we’ll look into more advanced troubleshooting (and, of course, you can find lots of specific “weird VCS problem” solutions in our VCS archives)

1. Check if the cluster is working at all.

Log into one of the cluster nodes as root (or a user with equivalent privilege – who shouldn’t exist 😉 and run

host1 # hastatus –summary

host1 # hasum <-- both do the same thing, basically Ex: host1 # hastatus -summary -- SYSTEM STATE -- System State Frozen A host1 RUNNING 0 A host2 RUNNING 0 -- GROUP STATE -- Group System Probed AutoDisabled State B ClusterService host1 Y N OFFLINE B ClusterService host2 Y N ONLINE B SG_NIC host1 Y N ONLINE B SG_NIC host2 Y N OFFLINE B SG_ONE host1 Y N ONLINE B SG_ONE host2 Y N OFFLINE B SG_TWO host1 Y N OFFLINE B SG_TWO host2 Y N OFFLINE Clearly, your situation is bad: A normal VCS status should indicate that all nodes in the cluster are “RUNNING” (which these are). However, it should also show all service groups as being ONLINE on at least one of the nodes, which isn't the case above with SG_TWO (Service Group 2). 2. Check for cluster communication problems. Here we want to determine if a service group is failing because of any heartbeat failure (The VCS cluster, that is, not another administrator 😉 Check on GAB first, by running: host1 # gabconfig -a Ex: host1 # gabconfig -a GAB Port Memberships =============================================================== Port a gen 3a1501 membership 01 Port h gen 3a1505 membership 01 This output is okay. You would know you had a problem at this point if any of the following conditions were true: if no port “a” memberships were present (0 and 1 above), this could indicate a problem with gab or llt (Looked at next) If no port "h" memberships were present (0 and 1 above), this could indicate a problem with had. If starting llt causes it to stop immediately, check your heartbeat cabling and llt setup. Try starting gab, if it's down, with: host1 # /etc/init.d/gab start If you're running the command on a node that isn't operational, gab won't be seeded, which means you'll need to force it, like so: host1 # /sbin/gabconfig -x 3. Check on LLT, now, since there may be something wrong there (even though it wasn't indicated above) LLT will most obviously present as a crucial part of the problem if your "hastatus -summary" gives you a message that it "can't connect to the server." This will prompt you to check all cluster communication mechanisms (some of which we've already covered). First, bang out a quick: host1 # lltconfig on the command line to see if llt is running at all. If llt isn't running, be sure to check your console, system messages file (syslog, possibly messages and any logs in /var/log/VRTSvcs/... - usually the "engine log" is worth a quick look) As a rule, I usually do host1 # ls -tr when I'm in the VCS log directory to see which log got written to last, and work backward from there. This puts the most recently updated file last in the listing. My assumption is that any pertinent errors got written to one of the fresher log files 🙂 Look in these logs for any messages about bad llt configurations or files, such as /etc/llttab, /etc/llthost and /etc/VRTSvcs/conf/sysname. Also, make sure those three files contain valid entries that "match" <-- This is very important. If you refer to the same facility by 3 different names, even though they all point back to the same IP, VCS can become addled and drop-the-ball. Examples of invalid entries in LLT config files would include "node numbers" outside the range of 0 to 31 and "cluster numbers" outside the range of 0 to 255. Now, if LLT "is" running, check its status, like so: host # lltstat -wn <-- This will let you know if llt on the separate nodes within the cluster can communicate with one another. Of course, verify physical connections, as well. Also, see our previous post on dlpiping for more low-level-connection VCS troubleshooting tips. Ex: host1 # lltstat -vvn LLT node information: Node State Link Status Address 0 prsbn012 OPEN ce0 DOWN ce1 DOWN HB172.1 UP 00:03:BA:9D:57:91 HB172.2 UP 00:03:BA:0E:F1:DE HB173.1 UP 00:03:BA:9D:57:92 HB173.2 UP 00:03:BA:0E:D0:BE 1 prsbn015 OPEN ce3 UP 00:03:BA:0E:CE:09 ce5 UP 00:03:BA:0E:F4:6B HB172.1 UP 00:03:BA:9D:5C:69 HB172.2 UP 00:03:BA:0E:CE:08 HB173.1 UP 00:03:BA:0E:F4:6A HB173.2 UP 00:03:BA:9D:5C:6A host1 # cat /etc/llttab <-- pardon the lack of low-pri links. We had to build this cluster on the cheap 😉 set-node /etc/VRTSvcs/conf/sysname set-cluster 100 link ce0 /dev/ce:0 - ether 0x1051 - link ce1 /dev/ce:1 - ether 0x1052 - exclude 7-31 host1 # cat /etc/llthosts 0 host1 1 host2 host1 # cat /etc/VRTSvcs/conf/sysname host1 If llt is down, or you think it might be the problem, either start it or restart it with: host1 # /etc/init.d/llt.rc start or host1 # /etc/init.d/llt.rc stop host1 # /etc/init.d/llt.rc start And, that's where we'll end it today. There's still a lot more to cover (we haven't even given the logs more than their minimum due), but that's for next week. Until then, have a pleasant and relaxing weekend 🙂 Veritas Cluster Server (VCS) Command line VCS has can be divided into two important parts Cluster Communication: Low Latency Transport (LLT) and Global Atomic Broadcast (GAB) are responsible for heartbeat and cluster communication. LLT status lltconfig -a list – List all the MAC addresses in cluster lltstat -l – Lists information about each configured LLT link lltstat [-nvv|-n] – Verify status of links in cluster Starting and stopping LLT lltconfig -c – Start the LLT service lltconfig -U – stop the LLT running GAB status gabconfig -a – List Membership, Verify id GAB is operating gabdiskhb -l – Check the disk heartbeat status gabdiskx -l – lists all the exclusive GAB disks and their membership information Starting and stopping GAB gabconfig -c -n seed_number – Start the GAB gabconfig -U – Stop the GAB HAD: Stands for High Availability daemon. HAD is responsible for all the cluster functionality. The commands for Veritas start with “ha” meaning high availability. For example, ‘hastart’, ‘hastop’, ‘hares’ etc. Listed below are commands sorted by category which are used for most day to day operation/management of VCS. Cluster Status hastatus -summary – Outputs the status of cluster hasys -display – Displays the cluster operation status Start or Stop services hastart [-force|-stale] – ‘force’ is used to load local configuration hasys -force 'system' – start the cluster using config file from the mentioned “system” hastop -local [-force|-evacuate] – ‘local’ option will stop the service only on the system you type the command hastop -sys 'system' [-force|-evacuate] – ‘sys’ stops had on the system you specify hastop -all [-force] – ‘all’ stops had on all systems in the cluster Change VCS Configuration online haconf -makerw – makes VCS configuration in read/write mode haconf -dump -makero – Dumps the configuration changes Agent Operations haagent -start agent_name -sys system – Starts an agent haagent -stop agent_name -sys system – Stops an agent Cluster Operations haclus -display – Displays cluster information and status haclus -enable LinkMonitoring – Enables heartbeat link monitoring in the GUI haclus -disable LinkMonitoring – Disables heartbeat link monitoring in the GUI Add and Delete Users hauser -add user_name – Adds a user with read/write access hauser -add VCSGuest – Adds a user with read-only access hauser -modify user_name – Modifies a users password hauser -delete user_name – Deletes a user hauser -display [user_name] – Displays all users if username is not specified System Operations hasys -list – List systems in the cluster hasys -display – Get detailed information about each system hasys -add system – Add a system to cluster hasys -delete system – Delete a system from cluster Resource Types hatype -list – List resource types hatype -display [type_name] – Get detailed information about a resource type hatype -resources type_name – List all resources of a particular type hatype -add resource_type – Add a resource type hatype -modify .... – Set the value of static attributes hatype -delete resource_type – Delete a resource type Resource Operations hares -list – List all resources hares -dep [resource] – List a resource’s dependencies hares -display [resource] – Get detailed information about a resource hares -add resource_type service_group – Add a resource hares -modify resource attribute_name value – Modify the attributes of the new resource hares -delete resource – Delete a resource type hares -online resource -sys systemname – Online a resource, type hares -offline resource -sys systemname – Offline a resource, type hares -probe resource -sys system – Cause a resource’s agent to immediately monitor the resource on a particular system hares -clear resource [-sys system] – Clear a faulted resource hares -local resource attribute_name value – Make a resource’s attribute value local hares -global resource attribute_name value – Make a resource’s attribute value global hares -link parent_res child_res – Specify a dependency between two resources hares -unlink parent_res child_res – Remove the dependency relationship between two resources Service Group Operations hagrp -list – List all service groups hagrp -resources [service_group] – List a service group’s resources hagrp -dep [service_group] – List a service group’s dependencies hagrp -display [service_group] – Get detailed information about a service group hagrp -online groupname -sys systemname – Start a service group and bring it resources hagrp -offline groupname -sys systemname – Stop a service group and take it resources offline hagrp -switch groupname -to systemname – Switch a service group from one system to another hagrp -freeze -sys -persistent – Gets into Maintenance Mode. Freeze a service group. This will disable online and offline operations hagrp -unfreeze -sys -persistent] – Take the servicegroup out of maintenance mode hagrp -enable service_group [-sys system] – Enable a service group hagrp -disable service_group [-sys system] – Disable a service group hagrp -enableresources service_group – Enable all the resources in a service group hagrp -disableresources service_group – Disable all the resources in a service group hagrp -link parent_group child_group relationship – Specify the dependency relationship between two service groups hagrp -unlink parent_group child_group – Remove the dependency between two service groups VCS Startup Process Please verify that the Cables are setup for heartbeat network. You can tcpdump from one server NIC’s MAC Address to another to verify the connectivity Step 1: LLT (Low latency Transport) should be startup first using the “lltstat -c” command. It reads /etc/llttab and /etc/llthosts files and establishes heartbeat network. Heartbeat network is a private network where VCS status information is exchanged by all systems within a VCS cluster. These networks require each system in the cluster to have a dedicated NIC, connected to a private hub. VCS requires a minimum of two dedicated communication channels between each system in a cluster. LLT is a low overhead networking protocol that runs in the kernel. Because it runs in the kernel, it is capable of handling kernel to kernel communications. Examples of files are as below: #cat /etc/llthosts 0
1
.
.
n
In the Example below, the linux systems will have interface names such as “eth0/1?. If using any device, then replace “ce” with “qfe0/1? etc..
#cat /etc/llttab
set-node
set-cluster
link ce2 /dev/ce:0 – ether – –
link ce3 /dev/ce:3 – ether – –
link-lowpri ce4 /dev/ce:4 – ether – –
start
Verification for startup can be done using lltstat -n command. “*” represents firstnode (master) in the cluster

#lltstat -n
Node State Links
* 0 OPEN 3
1 OPEN 3
2 OPEN 3
.
.
n OPEN 3
Step 2:
GAB (Group Atomic Board) starts next. It executes /etc/gabtab and checks for other GABs to establish a cluster membership. GAB runs over Low Latency Transport (LLT) and uses broadcasts to distribute cluster configuration information and ensure that each system has a synchronized view of the cluster, including the state of each system, service group, and resource.

# cat /etc/gabtab
/sbin/gabconfig -c -n5 # for a 5 node cluster
GAB can be started using “gabconfig -c” and verified by using “gabconfig -a”. Below is the example output for a 5 node cluster. Port ‘a’ runs GAB service and port ‘h’ runs VCS Deamon

# gabconfig -a
GAB Port Memberships
=========================================
Port a gen 11ff05 membership 012345
Port h gen 11ff09 membership 012345
Step 3:

After both LLT and GAB are loaded, hashadow starts which will load HAD (High Availability Deamon). HAD reads /etc/VRTSvcs/conf/config/main.cf, types.cf and all include.cf’s mentioned in main.cf file.

HAD checks if there are other HADs avaible and registers them with GAB. If there are no other HADs, it loads the main.cf again into HAD memory.Same process will happen when HAD starts on

other nodes. The HAD on the first node will load the main.cf and other include.cf files from the local system and all other HADs will load configuration from the first HAD.
After starting up, HAD will know all the service groups and resources from main.cf. It will call the respective agents to check if the resources are currently online or offline. Based on

main.cf, HAD will online/offline the Service group on the respective nodes.

Cluster is started up with “hastart” command. The status can be verified using “hastatus -sum”

VCS Logfile: /var//log/engine_A.log

Setup SAN disk for use in a Linux Veritas cluster

For this particular exercise we’re going to go through the entire process of provisioning disk for use in a VCS cluster.

We will use EMC Symmetrix disk zoned and masked to a RHEL 4u6 host as the foundation.

Get the disk(s) presented to the host observing that it’s visible down multiple paths.

————& DEVICE :VEND :PROD ————& /dev/sda :EMC :SYMMETRIX /dev/sdb :EMC :SYMMETRIX /dev/sdc :EMC :SYMMETRIX /dev/sdd :EMC :SYMMETRIX /dev/sde :EMC :SYMMETRIX /dev/sdf :EMC :SYMMETRIX /dev/sdg :EMC :SYMMETRIX /dev/sdh :EMC :SYMMETRIX /dev/sdi :EMC :SYMMETRIX /dev/sdj :EMC :SYMMETRIX /dev/sdk :EMC :SYMMETRIX /dev/sdl :EMC :SYMMETRIX /dev/sdm :EMC :SYMMETRIX /dev/sdn :EMC :SYMMETRIX /dev/sdo :EMC :SYMMETRIX /dev/sdp :EMC :SYMMETRIX /dev/sdq :EMC :SYMMETRIX /dev/sdr :EMC :SYMMETRIX /dev/sds :EMC :SYMMETRIX /dev/sdt :EMC :SYMMETRIX /dev/sdu :EMC :SYMMETRIX /dev/sdv :EMC :SYMMETRIX /dev/sdw :EMC :SYMMETRIX /dev/sdx :EMC :SYMMETRIX /dev/sdy :EMC :SYMMETRIX /dev/sdz :EMC :SYMMETRIX /dev/sdaa :EMC :SYMMETRIX /dev/sdab :EMC :SYMMETRIX /dev/sdac :EMC :SYMMETRIX /dev/sdad :EMC :SYMMETRIX #8212;————————————————————–
:REV :SER NUM :Volume :CAP(kb)
#8212;————————————————————–
:5771 :0123456789 : 00617: 2880
:5771 :0123456789 : 00204: 35654400
:5771 :0123456789 : 00206: 35654400
:5771 :0123456789 : 00208: 35654400
:5771 :0123456789 : 0020A: 35654400
:5771 :0123456789 : 0020C: 35654400
:5771 :0123456789 : 0020E: 35654400
:5771 :0123456789 : 00210: 35654400
:5771 :0123456789 : 00212: 35654400
:5771 :0123456789 : 00214: 35654400
:5771 :0123456789 : 00263: 35654400
:5771 :0123456789 : 00265: 35654400
:5771 :0123456789 : 00267: 35654400
:5771 :0123456789 : 00269: 35654400
:5771 :0123456789 : 0026B: 35654400
:5771 :0123456789 : 00617: 2880
:5771 :0123456789 : 00204: 35654400
:5771 :0123456789 : 00206: 35654400
:5771 :0123456789 : 00208: 35654400
:5771 :0123456789 : 0020A: 35654400
:5771 :0123456789 : 0020C: 35654400
:5771 :0123456789 : 0020E: 35654400
:5771 :0123456789 : 00210: 35654400
:5771 :0123456789 : 00212: 35654400
:5771 :0123456789 : 00214: 35654400
:5771 :0123456789 : 00263: 35654400
:5771 :0123456789 : 00265: 35654400
:5771 :0123456789 : 00267: 35654400
:5771 :0123456789 : 00269: 35654400
:5771 :0123456789 : 0026B: 35654400

See what disks Veritas can see.
vxdisk -o alldgs list

Initialize the disk for the first time. This needs to be repeated for each individual disk.
/etc/vx/bin/vxdisksetup -i DEVICE format=cdsdisk

See if the intialize worked correctly.

# vxdisk -o alldgs list
DEVICE TYPE DISK GROUP STATUS
EMC0_0 auto:cdsdisk – (dg_grp) online
EMC0_1 auto:cdsdisk – (dg_grp) online
EMC0_2 auto:cdsdisk – (dg_grp) online
EMC0_3 auto:cdsdisk – (dg_grp) online
EMC0_4 auto:cdsdisk – (dg_grp) online
EMC0_5 auto:cdsdisk – (dg_grp) online
EMC0_6 auto:cdsdisk – (dg_grp) online
EMC0_7 auto:cdsdisk – (dg_grp) online
EMC0_8 auto:cdsdisk – (dg_grp) online
EMC0_9 auto:cdsdisk – (dg_grp) online
EMC0_10 auto:cdsdisk – (dg_grp) online
EMC0_11 auto:cdsdisk – (dg_grp) online

EMC0_12 auto:cdsdisk – (dg_grp) online
EMC0_13 auto:cdsdisk – (dg_grp) online
cciss/c0d0 auto:none – – online invalid

All device(s) (e.g. EMC0_n) now show as online.

Create the disk group.
vxdg init dg_name dg_internal_name01=DEVICE
The dg_name is the name of your disk group while dg_internal_name01 is the name of the first disk. In our case dg_internal_name01=EMC0_0.

Add any additional disk to the disk group.
vxdg -g dg_name adddisk dg_internal_name02=EMC0_n+1

Note that EMC0_n+1 is the next free disk that you are attempting to add. So dg_internal_name02=EMC0_1 (remember we started with EMC0_0).
To create the volume
vxassist -g dg_name make lv_name [size] dg_internal_nameN

Repeat as necessary.
Finally, create the file system
mkfs -t vxfs /dev/vx/rdsk/dg_name/lv_name
At this point the volumes are now available to be defined as a mount resource in VCS.

Recent Posts

Pages

Categories

Archives

Recent Comments

Categories

veritas cluster

Leave a Reply Cancel reply

PRODUCTS