This article will demonstrate setting up a simple RHCS (Red Hat Cluster Suite) two-node cluster, with an end goal of having a 50GB LUN shared between two servers, thus providing clustered shared storage to both nodes. This will enable applications running on the nodes to write to a shared filesystem, perform correct locking, and ensure filesystem integrity.
This type of configuration is central to many active-active application setups, where both nodes share a central content or configuration repository.
For this article, two RHEL 6.1 nodes, running on physical hardware (IBM blades) were used. Each node has multiple paths back to the 50GB SAN LUN presented, and multipathd will be used to manage path failover and rebuild in the event of interruption.
Validating Hardware
Prior to building our cluster, it is imperative that the appropriate kernel module(s) have been loaded. Using QLogic 2xxx HBAs, running lsmod should yield something like:
|
1 2 3 |
# lsmod | grep ql qla2xxx 365773 0 scsi_transport_fc 52002 1 qla2xxx |
Each of the physical servers has two HBAs installed. Whilst most HBA manufacturers offer software to check the status of the HBAs (for example, QLogic offer SANSurfer), I prefer to check the output of the dmesg command, or /var/log/dmesg, for appropriate detection messages. The correct detection of two QLogic HBAs by the OS should look something like the following:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# grep -i ql /var/log/dmesg QLogic Fibre Channel HBA Driver: 8.03.07.03.06.1-k qla2xxx 0000:24:00.0: PCI INT A -> GSI 32 (level, low) -> IRQ 32 qla2xxx 0000:24:00.0: Found an ISP2532, irq 32, iobase 0xffffc90013dd0000 qla2xxx 0000:24:00.0: irq 48 for MSI/MSI-X qla2xxx 0000:24:00.0: irq 49 for MSI/MSI-X qla2xxx 0000:24:00.0: Configuring PCI space... qla2xxx 0000:24:00.0: setting latency timer to 64 qla2xxx 0000:24:00.0: qla2xxx 0000:24:00.0: qla2xxx 0000:24:00.0: qla2xxx 0000:24:00.0: qla2xxx 0000:24:00.0: Allocated (64 KB) for FCE... qla2xxx 0000:24:00.0: Allocated (64 KB) for EFT... qla2xxx 0000:24:00.0: Allocated (1350 KB) for firmware dump... scsi1 : qla2xxx Configure NVRAM parameters... Verifying loaded RISC code... firmware: requesting ql2500_fw.bin FW: Loading via request-firmware... qla2xxx 0000:24:00.0: QLogic Fibre Channel HBA Driver: 8.03.07.03.06.1-k QLogic QMI2572 - QLogic 4Gb Fibre Channel Expansion Card (CIOv) for IBM BladeCenter qla2xxx 0000:24:00.1: PCI INT B -> GSI 42 (level, low) -> IRQ 42 qla2xxx 0000:24:00.1: Found an ISP2532, irq 42, iobase 0xffffc90013da2000 qla2xxx 0000:24:00.1: irq 50 for MSI/MSI-X qla2xxx 0000:24:00.1: irq 51 for MSI/MSI-X qla2xxx 0000:24:00.1: Configuring PCI space... qla2xxx 0000:24:00.1: setting latency timer to 64 qla2xxx 0000:24:00.1: Configure NVRAM parameters... qla2xxx 0000:24:00.1: Verifying loaded RISC code... qla2xxx 0000:24:00.1: FW: Loading via request-firmware... qla2xxx 0000:24:00.1: Allocated (64 KB) for FCE... qla2xxx 0000:24:00.1: Allocated (64 KB) for EFT... qla2xxx 0000:24:00.1: Allocated (1350 KB) for firmware dump... scsi2 : qla2xxx qla2xxx 0000:24:00.1: QLogic Fibre Channel HBA Driver: 8.03.07.03.06.1-k QLogic QMI2572 - QLogic 4Gb Fibre Channel Expansion Card (CIOv) for IBM BladeCenter qla2xxx 0000:24:00.0: LOOP UP detected (4 Gbps). |
Once you are happy that the Operating System has successfully detected the HBAs and loaded the appropriate kernel modules, you can proceed. If the HBAs were installed after Operating System installation, you should ensure that you follow the steps provided with your HBA documentation to have them made available to the Operating System. Most common HBAs already have appropriate modules bundled with the OS, so it may just be a case of enabling/configuring them in /etc/modprobe.conf.
Multipath Configuration
The next step is to configure multipathd. We use multipathd to manage mpx-io storage access to each node. The actual multipathd configuration will vary depending on which SAN or other storage technology is being used, and thus should be configured according to your storage array documentation. Our servers connect back to an IBM SAN Volume Controller (product 2145), which leads to a multipath configuration as follows:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
# vi /etc/multipath.conf defaults { polling_interval 30 failback immediate no_path_retry 5 rr_min_io 100 path_checker tur user_friendly_names yes } devices { device { vendor "IBM" product "2145" path_grouping_policy group_by_prio } device { vendor "IBM" product "1750500" path_grouping_policy group_by_prio } device { vendor "IBM" product "2107900" path_grouping_policy group_by_serial } device { vendor "IBM" product "2105800" path_grouping_policy group_by_serial } } |
Once configured, start multipathd:
|
1 |
# service multipathd start |
Once started, verify that all storage paths are available with the multipath -ll command:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# multipath -ll mpathb (360050768018e82bd3800000000000297) dm-8 IBM,2145 size=50G features='1 queue_if_no_path' hwhandler='0' wp=rw |-+- policy='round-robin 0' prio=50 status=active | |- 1:0:0:0 sdb 8:16 active ready running | |- 1:0:2:0 sdd 8:48 active ready running | |- 2:0:0:0 sdf 8:80 active ready running | `- 2:0:2:0 sdh 8:112 active ready running `-+- policy='round-robin 0' prio=10 status=enabled |- 1:0:1:0 sdc 8:32 active ready running |- 1:0:3:0 sde 8:64 active ready running |- 2:0:1:0 sdg 8:96 active ready running `- 2:0:3:0 sdi 8:128 active ready running |
Excellent - all paths are online. You can obtain a similar list (and access to far more functionality and configuration commands) from the multipathd -k interactive command prompt, using the list paths subcommand.
|
1 2 3 4 5 6 7 8 9 10 11 |
multipathd> list paths hcil dev dev_t pri dm_st chk_st dev_st next_check 0:1:1:0 sda 8:0 1 undef ready running orphan 1:0:0:0 sdb 8:16 50 active ready running XXXXXXXX.. 101/120 1:0:1:0 sdc 8:32 10 active ready running XXXXXXXX.. 101/120 1:0:2:0 sdd 8:48 50 active ready running XXXXXXXX.. 101/120 1:0:3:0 sde 8:64 10 active ready running XXXXXXXX.. 101/120 2:0:0:0 sdf 8:80 50 active ready running XXXXXXXX.. 101/120 2:0:1:0 sdg 8:96 10 active ready running XXXXXXXX.. 101/120 2:0:2:0 sdh 8:112 50 active ready running XXXXXXXX.. 101/120 2:0:3:0 sdi 8:128 10 active ready running XXXXXXXX.. 101/120 |
Once you have confirmed storage availability and that all paths are active from both nodes, you can configure multipathd to start automatically on system boot:
|
1 2 3 |
# chkconfig --add multipathd # chkconfig --level 2345 multipathd on # chkconfig --list multipathd |
LUN Partitioning
We will create a single partition on the LUN of type 8e (Linux LVM) - this will house a Clustered Logical Volume Manager (CLVM) physical volume. Perform this step from a single node only, substituting the appropriate device path in place of mpathb if needed:
|
1 2 3 |
# sfdisk /dev/mapper/mpathb <<_EOF_ 0,,8e _EOF_ |
Ensure that the appropriate device nodes/links have been created:
|
1 2 3 |
# ls -l /dev/mapper/mpathb* lrwxrwxrwx 1 root root 7 Aug 21 11:52 /dev/mapper/mpathb -> ../dm-8 brw-rw---- 1 root disk 253, 9 Aug 21 15:27 /dev/mapper/mpathbp1 |
On the second node, run partprobe, and check that the new partition is detected, and the appropriate device nodes/links have been created:
|
1 2 3 4 |
# partprobe # ls -l /dev/mapper/mpathd* lrwxrwxrwx 1 root root 7 Aug 21 15:41 /dev/mapper/mpathd -> ../dm-0 lrwxrwxrwx 1 root root 7 Aug 21 15:41 /dev/mapper/mpathdp1 -> ../dm-1 |
As you can see, the devices have been created with different device names on each node - depending on the current udev state and configuration on the server in question.
We can, however, define aliases within /etc/multipath.conf to assign hard device names to the multipathed device(s).
Find the WWID for your LUN as follows:
|
1 |
# multipath -l | sed -n 's/^.*(\([0-9a-f]*\)).*$/\1/p' |
Use the WWID you’ve gleaned to define the alias in /etc/multipath.conf as follows on all cluster nodes:
|
1 2 3 4 5 6 |
multipaths { multipath { wwid "360050768018e82bd3800000000000297" alias mpathb } } |
Force a multipath devmap reload:
|
1 |
# multipath -r |
All nodes will now have the same device nodes and links created under /dev/mapper/mpathb* for the base LUN and the partition created earlier.
Further Preparation
Before proceeding with the cluster configuration, several other prerequisite tasks must be performed. It is imperative that date and time are synchronised across the cluster for correct operation. First, check that NTP is peering correctly, and that the date/time are correct:
|
1 2 |
# ntpq -p # date |
If this returns an error, or the date and time are not correct, configure NTP appropriately.
If the date/time is correct and synchronisation is occurring correctly, sync the time back to the hardware clock:
|
1 |
# hwclock --systohc |
As an extra precaution against DNS failure, add entries for both nodes to each node’s /etc/hosts file:
|
1 2 3 |
# vi /etc/hosts 192.168.0.1 node1 192.168.0.2 node2 |
These two safeguards will help to ensure that the cluster operates smoothly.
Software Installation
With RHEL, ensure that you have correctly registered your system either via rhn_register, or rhnreg_ks if you prefer keeping things on the command line. With CentOS, this step is not required. If using RHEL, you’ll need to log into Red Hat Network, and apply your Resilient Storage entitlements to both nodes at this time. If you haven’t purchased Resilient Storage entitlements, this step will obviously fail - go and spend your dollars before returning to this article.
Install the following packages, and their dependencies, via yum:
- gfs2-utils Utilities for managing the global filesystem (GFS2)
- lvm2-cluster Cluster extensions for userland logical volume management tools
- openais The OpenAIS standards-based cluster framework executive and APIs
- cman Red Hat Cluster Manager
- modcluster Red Hat Cluster Suite - remote management
- rgmanager Open source HA resource group failover for Red Hat Cluster
You can install the packages and their dependencies via yum as follows:
|
1 2 |
# yum install -y gfs2-utils lvm2-cluster cman \ > modcluster rgmanager openais |
Verify that all packages have been correctly installed:
|
1 |
# rpm -q gfs2-utils lvm2-cluster cman modcluster rgmanager openais |
Now that the cluster framework and all supporting packages are installed, we can proceed to cluster configuration.
Cluster Configuration
The values supplied here will vary depending upon your site configuration. I find the easiest method to configure the cluster is to modify the cluster configuration file (/etc/cluster/cluster.conf) directly. There are tools (command line and GUI) available to create and edit this file, however I find a quick bit of vi-hackery the easiest way to get this job done.
Create /etc/cluster/cluster.conf on each node with the following contents (of course, substituting appropriate values depending on your configuration):
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# vi /etc/cluster/cluster.conf <?xml version="1.0"?> <cluster config_version="3" name="test-gfs-cluster"> <fence_daemon post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="node1" nodeid="1" votes="1"> <fence> <method name="scsi"> <device name="scsi_1" key="1"/> </method> </fence> <unfence> <device name="scsi_1" key="1" action="on"/> </unfence> </clusternode> <clusternode name="node2" nodeid="2" votes="1"> <fence> <method name="scsi"> <device name="scsi_1" key="2"/> </method> </fence> <unfence> <device name="scsi_1" key="2" action="on"/> </unfence> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1" broadcast="yes"> </cman> <fencedevices> <fencedevice agent="fence_scsi" name="scsi_1" devices="/dev/mapper/mpathbp1"/> </fencedevices> <rm> <failoverdomains/> <resources/> </rm> </cluster> |
A few of these configuration directives are worth looking at further. Ensure that both nodes have unique nodeid values - failure to do so will result in a split-brain cluster - essentially running two single-node clusters. Use the same hostnames for each clusternode as defined in /etc/hosts and DNS. Provide a partition to the devices element for the fence_scsi agent, instead of using the base device. In our case, this is /dev/mapper/mpathbp1. For ease of configuration (and as this is the only cluster on this subnet), broadcast is used for cluster heartbeat. This is a two-node cluster, so a few workarounds are required for correct quorum operation. First, set the two_node flag to 1 (enabling it), and set expected_votes to 1. This configuration circumvents the use of qdiskd, or a third note for correct quorum establishment.
Ensure that the fence and unfence methods are correct, or the cluster will fail to fence correctly, and again nodes will not form quorum or join/leave the cluster correctly.
If, during testing, a node does not fence correctly, you can manually acknowledge the failed fencing operation with fence_ack_manual. This will allow a two node cluster to form with a single node from cold startup if the second node is in an inconsistent or failed state. You can check for fencing (and cluster in general) log messages in /var/log/messages:
|
1 2 |
Jan 22 22:44:26 node1 fenced[1433]: fencing node node2 Jan 22 22:44:26 node1 fenced[1433]: fence node2 success |
Validate the configuration before enabling it with ccs_config_validate:
|
1 |
# ccs_config_validate |
If ccs_config_validate doesn’t return errors, the cluster is correctly configured and is ready for its initial startup.
Cluster Startup
First, start the cluster manager cman. This is the core cluster service, and will spawn various low-level required daemons (fenced - the fencing daemon, corosync - the core cluster engine, etc.). All steps should, unless otherwise noted, be performed on both nodes). Start cman via the service command:
|
1 |
# service cman start |
Next, start rgmanager. No actual resources or resource groups are required for GFS2, but this daemon is included and started for completeness of the RHCS stack:
|
1 |
# service rgmanager start |
Enable the correct LVM locking_type for clustering. This updates the value of locking_type in /etc/lvm/lvm.conf from its default value of 0, to 3 - built-in clustered locking.
|
1 |
# lvmconf --enable-cluster |
Start the clustered LVM daemon:
|
1 |
# service clvmd start |
If no errors are experienced at this point, the core cluster is ready for use. Enable all cluster services to start automatically on system boot:
|
1 2 3 4 |
# for service in cman rgmanager clvmd; do > chkconfig ${service} on > chkconfig --list ${service} > done |
LVM Configuration
In order to provide a logical volume for the creation of our GFS2 filesystem, we must first create a new LVM physical volume on our shared storage (/dev/mapper/mpathbp1). Do this with the pvcreate command. These steps must be performed from a single node:
|
1 |
# pvcreate /dev/mapper/mpathbp1 |
Create a new volume group, vg_shared, and ensure that you specify -c y to create a clustered volume:
|
1 |
# vgcreate -c y vg_shared /dev/mapper/mpathbp1 |
Next, create an appropriately sized logical volume for your GFS2 filesystem. Our partition is 50GB in size:
|
1 |
# lvcreate -L 50G -n lv_gfs_01 vg_shared |
On the other node, ensure that the logical volume is available:
|
1 |
# lvdisplay /dev/vg_shared/lv_gfs_01 |
If there are any issues, rescan the various LVM components:
|
1 |
# pvscan && vgscan && lvscan |
If, during lvscan, the new volume is listed as inactive, run the following command to activate it:
|
1 |
# lvchange -a y /dev/vg_shared/lv_gfs_01 |
The logical volume is now ready to receive the GFS2 filesystem.
GFS Configuration
Use the mkfs.gfs2 command to create the GFS2 filesystem. Ensure that lock_dlm is used for the locking protocol, and the first part of the LockTableName (specified with -t <clustername>:<fsname>) matches the cluster name defined in /etc/cluster/cluster.conf. Again, run this from one node only:
|
1 2 |
# mkfs -t gfs2 -p lock_dlm -t test-gfs-cluster:lv_gfs_01 \ > -j 4 /dev/vg_shared/lv_gfs_01 |
I created 4 journals which will allow 4 nodes to mount the filesystem. Additional journals can be added at a later date with the gfs2_jadd command should more nodes be required. Adding a journal will consume additional space on the GFS2 filesystem, and that should be taken into account when sizing the volumes appropriately.
Test mounting the volume on both nodes.
|
1 2 3 4 |
# mkdir -p /shared/tmpmount # mount -t gfs2 \ > -o noatime,nodiratime /dev/vg_shared/lv_gfs_01 /shared/tmpmount # mount | grep gfs2 |
If the test is successful, unmount the filesystem:
|
1 |
# umount /shared/tmpmount |
Update /etc/fstab on both nodes with the appropriate filesystem configuration. Ensure that you do NOT allow the system to fsck the filesystem on boot otherwise it may attempt to check a filesystem mounted by another node. Also ensure that the noatime and nodiratime mount options are specified. This will significantly increase the performance of the GFS2 filesystem by disabling updates of file/directory access times which are not usually required.
|
1 2 3 |
# vi /etc/fstab <append> /dev/vg_shared/lv_gfs_01 /shared/tmpmount gfs2 defaults,noatime,nodiratime 0 0 |
Mount the filesystem on both nodes:
|
1 2 |
# mount -a # mount | grep gfs2 |
The filesystem is now mounted to both nodes, and is correctly locked and clustered.
You can now enable the automatic startup of GFS2:
|
1 2 3 |
# service gfs2 status # chkconfig gfs2 on # chkconfig --list gfs2 |
Final Validation
Reboot both cluster nodes, and validate correct operation, reviewing system boot messages:
|
1 |
# shutdown -r now |
Once both nodes are back up, run the following commands to verify cluster status:
|
1 2 3 4 |
# clustat # cman_tool nodes # service gfs2 status # df -hT /shared/tmpmount |
If no issues are noted, you are done! You probably want a more sensible mountpoint than /shared/tmpmount - but this is being done in a lab environment and is suitable for my needs.
Check /var/log/messages should any issues be evident, and resolve them.
Conclusion
This article has walked through the preparation of shared storage, and the installation of Red Hat Cluster Suite and the Global Filesystem, configuration of a simple two-node cluster, and the creation and mounting of a clusterwide shared filesystem.
RHCS is a very complex suite of software, capable of the most demanding high-availability requirements. If you want to learn more, consult the appropriate RedHat documentation.