ODS (online disk suite)

Sun's volume manager has many names

ODS is a disk storage management solution, which offers

Raid Levels

The disk management software offers the common raid levels

raid 0 (Striping)

A number of disks are concatenated together to give the appearance of one very large disk.

Advantages
   Improved performance
   Can Create very large Volumes

Disadvantages
   Not highly available (if one disk fails, the volume fails)

raid 1 (Mirroring)

A single disk is mirrored by another disk, if one disk fails the system is unaffected as it can use its mirror.

Advantages
   Improved performance
   Highly Available (if one disk fails the mirror takes over)

Disadvantages
   Expensive (requires double the number of disks)

raid 5

Raid stands for Redundant Array of Inexpensive Disks, the disks are striped with parity across 3 or more disks, the parity is used in the event that one of the disks fails, the data on the failed disk is reconstructed by using the parity bit.

Advantages
   Improved performance (read only)
   Not expensive

Disadvantages
   Slow write operations (caused by having to create the parity bit)

Metadevice and Metadevice Database

A metadevice is a name for a group of physical slices that appear as a single logical device (virtual device). The maximum default number of metadevices is 128 but this can be adjusted by editing /kernel/drv/md.conf and changing the nmd parameter (1024 maximum).

A metadevice database (otherwise know as state database) is a database that stores information about the ODS configuration, it is used to store and track changes made to ODS, this database is what makes the ODS persistent across reboots. The database has multiple copies known as replicas (minimum of 3 is required), this ensures that the database is always valid, you should keep multiple copies across different disks just in case a disk should fail and thus reducing single-points of failure, the database is never more than an 10MB and is generally stored on a single slice of each disk.

ODS uses a majority consensus algorithm to determine if a replica is corrupted or not, when changes are made each replica is updated in turn just in case a power failure happens during the update, thus when the system is started the majority replicas will be implemented, the algorithm guarantees the following

Hot Spares

ODS uses a hot spare pool, which is a collection of disk slices reserved by ODS which will automatically be used when a disk slice fails. They provide increased data protection, however i have very rarely used hot spares as i normally replace a failed disk pretty quickly. See the Sun Documentation for detail information on hot spares.

Growing/Shrinking Filesystem

Expanding filesystems is not without problems with ODS but it is possible, however shrinking a filesystem under ODS is not possible, normally you create another new smaller filesystem and copy the data across then cut over to the new filesystem.

This is one area the Veritas volume manager excels as it very easy to grow and shrink a filesystem.

Filesystem Logging

ODS uses translogs to log changes made to the filesystem, in the event that the system were to crash the log is replayed thus avoiding a fsck (which can take a long time depending on the size of the filesystem). However newer versions of Solaris offer UFS logging, here is a list of advantages/disadvantages of both

ODS logging

UFS logging

My preference is to use UFS logging and since its introduction in solaris 7 i have only ever used this.

Naming Convention

There is no set standard on what you call your metadevices but i have my own convention and undoubtedly there are many others.

The main metadevice (raid 0,1 or 5) which is were the filesystem will be placed will always end in 0 so for example d0, d10, d20, d30, d40, etc

A sub-mirror will either end in a 1 (first sub-mirror) or 2 (second sub-mirror) so for example d1 and d2, d11 and d12, d21 and d22, etc
A raid slice will either end in a 1..n (n = depends on number of disks) so for example d1 & d2 & d3, d21 & d22 & d23, etc

So for an example

This is my own preference and you are welcome to have your own naming convention

File Locations

ODS uses a number of different files, below are the most useful one's:

/kernel/drv/md.conf

This file is the ODS device drive configuration file, the only modifiable field is the 'nmd' which represents the number of metadevices supported by the driver, if you change this file you must reboot the system for the changes to take affect.

In a configuration that uses a lot of devices I increase this to the maximum 1024.

/etc/lvm/mddb.cf This file keeps track of metadevice state database replicas, each metadevice state database has a unique entry in this file. You can display the file using 'cat' but do not edit it manually.
/etc/lvm/md.tab

This file is used by metainit, metadb and metahs commands. The file contains the the rest of the commandline for use by metainit, metadb and metahs.

This file can be edited manually or populated by the command 'metastat -p'

/etc/lvm/md.cf This file is a copy of the md.tab file and is used for disaster recovery purposes, it is automatically updated.

Meta Commands

I am not going to explain in details how ODS works but simply supply a list of commands that I use regularly, if you want a more detail explanation then I suggest you refer to the Sun Documentation

Metadatabase Commands
Create

metadb -a -f -c 3 c0t0d0s6 c1t0d0s6 c2t0d0s6

-a - attach metadatabase to device
-f - create the initial metadatabase and force deletion of replicas below the minimum of one
-c - specifies the number of replicas to be placed on each device

Add metadb -a -c 3 c3t0d0s6
Remove metadb -d c3t0d0s6
Display metadb -i
Repairing

# The only way to repair a replica is that you simply delete all the replica's on the device and
# recreate them

# First confirm that the replicas are corrupted and you have the device name
metadb -i

# Delete the corrupted replicas and reboot
metadb -d c3t0d0s6
reboot

# Now recreate them making sure you have 3 copies
metadb -a -c 3 c3t0d0s6
metadb -i

Metadevice Commands
Create Concatenated device

metainit d0 3 1 c1t0d0s0 1 c2t0d0s0 1 c3t0d0s0

d0 - metadevice name
3 - total number of slices
1 c1t0d0s0 - number of slices to added followed by device name

Create stripe metadevice

metainit d0 1 2 c1t0d0s0 c2t0d0s0 -i 64k

d0 - metadevice name
1 - total number of stripes
2 c1t0d0s0 c2t0d0s0 - number of slices to be added to stripe followed by device name
i 64k - stripe size

Create Mirror metadevice

# first create two metadevices (these will become sub-mirrors)
metainit d11 1 1 c2t0d0s0
metainit d12 1 1 c3t0d0s0

# Then create the mirror metadevice using the metadevice d11 (now called a sub-mirror)
metainit d10 -m d11

# Then attach the second sub-mirror using the metadevice d12 create above to the mirror d10
metattach d10 d12

# Display the mirrored metadevice and confirm that mirror has complete resyncing operation
# this may take a long time depending on the size of the mirror device
metastat d10

Create Raid 5 metadevice # When creating a raid 5 metadevice you need a minimum of 3 slices

metainit d10 -r c1t0d0s0 c2t0d0s0 c3t0d0s0

-r - specify that its a raid 5 configuration
Mirroring the root filesystem

# Lets say you want to mirror the main disk which has the following filesystems configured, we will be using
# c1t0d0 as the new mirror disk
#
# We hope to achieve the following device configuration
# d0 - mirrored metadevice which contains the root filesystem
#    d1 - a sub-mirror metadevice of d0 (c0t0d0s0)
#    d2 - a sub-mirror metadevice of d0 (c1t0d0s0)
#
# If either c0t0d0s0 or c1t0d0s0 fails the other will take over, thus the system will continue to work as normal

# The first step is to make sure the partition information is the same on the new mirror disk (c1t0d0)
# basically copies the partition information to the new mirror device

prtvtoc /dev/rdsk/c0t0d0s2 | fmthard -s - /dev/rdsk/c1t0d0s2

# Then we want to install the boot block on the new mirror device, this allows you boot the disk should
# the other disk fails

installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c1t0d0s0

# Create first metadevice which will become the a sub-mirror of d0
# NOTE: although we are using the existing root slice this does not delete any data, we have also
# specified the -f (force) option as the filesystem is mounted

metainit -f d1 1 1 c0t0d0s0

# Create the second metadevice which will become the sub-mirror of d0, we do not need the -f option (force)
# as there is not filesystem on the new device

metainit d2 1 1 c1t0d0s0

# At this point we have two metadevices d11 (contains root filesystem) and d12 (the new disk)
# we now create the mirror metadevice d0

metainit -d0 -m d1

# We now have to update the /etc/system and /etc/vfstab with the new root metadevice information

metaroot d0

# Now reboot the server so that the new mirror metadevice is mounted and the kernel parameters for ODS
# are loaded, we lock the filesystem before rebooting making sure all buffers have been written to the
# filesystem

lockfs -fa
reboot

# Once the server has been rebooted attach the second sub-mirror

metattach d0 d12

# Depending on how big the root filesystem the longer the resyncing of the two mirrors will take

metastat d0

# Once the mirrors are sync'ed you have a root filesystem that is highly available, you can now perform
# the same task with other filesystems such as /var, swap, /usr, etc

Other ODS Commands
Display Metadatabse metadb -i
Display Metadevices metastat
Display metadevice in md.tab format metastat -p


ODS Errors

A list of some of the more common errors of ODS

"no such file or directory error" when trying to configure a metadevice # update the nmd parameter in the /kernel/drv/md.conf file, i normally increase this to it's maximum 1024.
Metadevice in maintenance state

# Disks do go bad from time to time, however there is a difference between a total disk failure or a
# disk with bad data blocks, however if you replace the disk and use the same disk slice then the same
# command is used

# First access the disk via format, if you can then run a analyze on the disk to repair/map out any bad
# data blocks

format -> select disk -> anal -> read

# If you cannot access the disk via format then physically replace the disk, then run the below command
# to repair ODS, you must do this for each metadevice configured for that disk

metareplace -e d0 c1t0d0s0

# If you want to replace the disk with a different disk then run

metareplace d0 c1t0d0s0 <new device name>

# Again confirm that the disk re-sync'ed

metastat d0