You are on page 1of 84

OCFS2(7) OCFS2 Manual Pages OCFS2(7)

NAME
OCFS2 − A Shared-Disk Cluster File System for Linux

INTRODUCTION
OCFS2 is a file system. It allows users to store and retrieve data. The data is stored in files that are orga-
nized in a hierarchical directory tree. It is a POSIX compliant file system that supports the standard inter-
faces and the behavioral semantics as spelled out by that specification.

It is also a shared disk cluster file system, one that allows multiple nodes to access the same disk at the
same time. This is where the fun begins as allowing a file system to be accessible on multiple nodes opens a
can of worms. What if the nodes are of different architectures? What if a node dies while writing to the file
system? What data consistency can one expect if processes on two nodes are reading and writing concur-
rently? What if one node removes a file while it is still being used on another node?

Unlike most shared file systems where the answer is fuzzy, the answer in OCFS2 is very well defined. It
behaves on all nodes exactly like a local file system. If a file is removed, the directory entry is removed but
the inode is kept as long as it is in use across the cluster. When the last user closes the descriptor, the inode
is marked for deletion.

The data consistency model follows the same principle. It works as if the two processes that are running on
two different nodes are running on the same node. A read on a node gets the last write irrespective of the IO
mode used. The modes can be buffered, direct, asynchronous, splice or memory mapped IOs. It is fully
cache coherent.

Take for example the REFLINK feature that allows a user to create multiple write-able snapshots of a file.
This feature, like all others, is fully cluster-aware. A file being written to on multiple nodes can be safely
reflinked on another. The snapshot created is a point-in-time image of the file that includes both the file data
and all its attributes (including extended attributes).

It is a journaling file system. When a node dies, a surviving node transparently replays the journal of the
dead node. This ensures that the file system metadata is always consistent. It also defaults to ordered data
journaling to ensure the file data is flushed to disk before the journal commit, to remove the small possibil-
ity of stale data appearing in files after a crash.

It is architecture and endian neutral. It allows concurrent mounts on nodes with different processors like
x86, x86_64, IA64 and PPC64. It handles little and big endian, 32-bit and 64-bit architectures.

It is feature rich. It supports indexed directories, metadata checksums, extended attributes, POSIX ACLs,
quotas, REFLINKs, sparse files, unwritten extents and inline-data.

It is fully integrated with the mainline Linux kernel. The file system was merged into Linux kernel 2.6.16
in early 2006.

It is quickly installed. It is available with almost all Linux distributions. The file system is on-disk com-
patible across all of them.

It is modular. The file system can be configured to operate with other cluster stacks like Pacemaker and
CMAN along with its own stack, O2CB.

It is easily configured. The O2CB cluster stack configuration involves editing two files, one for cluster lay-
out and the other for cluster timeouts.

It is very efficient. The file system consumes very little resources. It is used to store virtual machine images
in limited memory environments like Xen and KVM.

Version 1.8.2 January 2012 1


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

In summary, OCFS2 is an efficient, easily configured, modular, quickly installed, fully integrated and com-
patible, feature-rich, architecture and endian neutral, cache coherent, ordered data journaling, POSIX-com-
pliant, shared disk cluster file system.

OVERVIEW
OCFS2 is a general-purpose shared-disk cluster file system for Linux capable of providing both high per-
formance and high availability.

As it provides local file system semantics, it can be used with almost all applications. Cluster-aware appli-
cations can make use of cache-coherent parallel I/Os from multiple nodes to scale out applications easily.
Other applications can make use of the clustering facilities to fail-over running application in the event of a
node failure.

The notable features of the file system are:


Tunable Block size
The file system supports block sizes of 512, 1K, 2K and 4K bytes. 4KB is almost always recom-
mended. This feature is available in all releases of the file system.

Tunable Cluster size


A cluster size is also referred to as an allocation unit. The file system supports cluster sizes of 4K,
8K, 16K, 32K, 64K, 128K, 256K, 512K and 1M bytes. For most use cases, 4KB is recommended.
However, a larger value is recommended for volumes hosting mostly very large files like database
files, virtual machine images, etc. A large cluster size allows the file system to store large files
more efficiently. This feature is available in all releases of the file system.

Endian and Architecture neutral


The file system can be mounted concurrently on nodes having different architectures. Like 32-bit,
64-bit, little-endian (x86, x86_64, ia64) and big-endian (ppc64, s390x). This feature is available
in all releases of the file system.

Buffered, Direct, Asynchronous, Splice and Memory Mapped I/O modes


The file system supports all modes of I/O for maximum flexibility and performance. It also sup-
ports cluster-wide shared writeable mmap(2). The support for bufferred, direct and asynchronous
I/O is available in all releases. The support for splice I/O was added in Linux kernel 2.6.20 and for
shared writeable map(2) in 2.6.23.

Multiple Cluster Stacks


The file system includes a flexible framework to allow it to function with userspace cluster stacks
like Pacemaker (pcmk) and CMAN (cman), its own in-kernel cluster stack o2cb and no cluster
stack.

The support for o2cb cluster stack is available in all releases.

The support for no cluster stack, or local mount, was added in Linux kernel 2.6.20.

The support for userspace cluster stack was added in Linux kernel 2.6.26.

Journaling
The file system supports both ordered (default) and writeback data journaling modes to provide
file system consistency in the event of power failure or system crash. It uses JBD2 in Linux kernel
2.6.28 and later. It used JBD in earlier kernels.

Version 1.8.2 January 2012 2


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

Extent-based Allocations
The file system allocates and tracks space in ranges of clusters. This is unlike block based file sys-
tems that have to track each and every block. This feature allows the file system to be very effi-
cient when dealing with both large volumes and large files. This feature is available in all releases
of the file system.

Sparse files
Sparse files are files with holes. With this feature, the file system delays allocating space until a
write is issued to a cluster. This feature was added in Linux kernel 2.6.22 and requires enabling
on-disk feature sparse.

Unwritten Extents
An unwritten extent is also referred to as user pre-allocation. It allows an application to request a
range of clusters to be allocated, but not initialized, within a file. Pre-allocation allows the file sys-
tem to optimize the data layout with fewer, larger extents. It also provides a performance boost,
delaying initialization until the user writes to the clusters. This feature was added in Linux kernel
2.6.23 and requires enabling on-disk feature unwritten.

Hole Punching
Hole punching allows an application to remove arbitrary allocated regions within a file. Creating
holes, essentially. This is more efficient than zeroing the same extents. This feature is especially
useful in virtualized environments as it allows a block discard in a guest file system to be con-
verted to a hole punch in the host file system thus allowing users to reduce disk space usage. This
feature was added in Linux kernel 2.6.23 and requires enabling on-disk features sparse and
unwritten.

Inline-data
Inline data is also referred to as data-in-inode as it allows storing small files and directories in the
inode block. This not only saves space but also has a positive impact on cold-cache directory and
file operations. The data is transparently moved out to an extent when it no longer fits inside the
inode block. This feature was added in Linux kernel 2.6.24 and requires enabling on-disk feature
inline-data.

REFLINK
REFLINK is also referred to as fast copy. It allows users to atomically (and instantly) copy regular
files. In other words, create multiple writeable snapshots of regular files. It is called REFLINK
because it looks and feels more like a (hard) link(2) than a traditional snapshot. Like a link, it is a
regular user operation, subject to the security attributes of the inode being reflinked and not to the
super user privileges typically required to create a snapshot. Like a link, it operates within a file
system. But unlike a link, it links the inodes at the data extent level allowing each reflinked inode
to grow independently as and when written to. Up to four billion inodes can share a data extent.
This feature was added in Linux kernel 2.6.32 and requires enabling on-disk feature refcount.

Allocation Reservation
File contiguity plays an important role in file system performance. When a file is fragmented on
disk, reading and writing to the file involves many seeks, leading to lower throughput. Contiguous
files, on the other hand, minimize seeks, allowing the disks to perform IO at the maximum rate.

With allocation reservation, the file system reserves a window in the bitmap for all extending files
allowing each to grow as contiguously as possible. As this extra space is not actually allocated, it
is available for use by other files if the need arises. This feature was added in Linux kernel 2.6.35
and can be tuned using the mount option resv_level.

Version 1.8.2 January 2012 3


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

Indexed Directories
An indexed directory allows users to perform quick lookups of a file in very large directories. It
also results in faster creates and unlinks and thus provides better overall performance. This feature
was added in Linux kernel 2.6.30 and requires enabling on-disk feature indexed-dirs.

File Attributes
This refers to EXT2-style file attributes, such as immutable, modified using chattr(1) and queried
using lsattr(1). This feature was added in Linux kernel 2.6.19.

Extended Attributes
An extended attribute refers to a name:value pair than can be associated with file system objects
like regular files, directories, symbolic links, etc. OCFS2 allows associating an unlimited number
of attributes per object. The attribute names can be up to 255 bytes in length, terminated by the
first NUL character. While it is not required, printable names (ASCII) are recommended. The
attribute values can be up to 64 KB of arbitrary binary data. These attributes can be modified and
listed using standard Linux utilities setfattr(1) and getfattr(1). This feature was added in Linux
kernel 2.6.29 and requires enabling on-disk feature xattr.

Metadata Checksums
This feature allows the file system to detect silent corruptions in all metadata blocks like inodes
and directories. This feature was added in Linux kernel 2.6.29 and requires enabling on-disk fea-
ture metaecc.

POSIX ACLs and Security Attributes


POSIX ACLs allows assigning fine-grained discretionary access rights for files and directories.
This security scheme is a lot more flexible than the traditional file access permissions that imposes
a strict user-group-other model.

Security attributes allow the file system to support other security regimes like SELinux, SMACK,
AppArmor, etc.

Both these security extensions were added in Linux kernel 2.6.29 and requires enabling on-disk
feature xattr.

User and Group Quotas


This feature allows setting up usage quotas on user and group basis by using the standard utilities
like quota(1), setquota(8), quotacheck(8), and quotaon(8). This feature was added in Linux ker-
nel 2.6.29 and requires enabling on-disk features usrquota and grpquota.

Unix File Locking


The Unix operating system has historically provided two system calls to lock files. flock(2) or
BSD locking and fcntl(2) or POSIX locking. OCFS2 extends both file locks to the cluster. File
locks taken on one node interact with those taken on other nodes.

The support for clustered flock(2) was added in Linux kernel 2.6.26. All flock(2) options are sup-
ported, including the kernels ability to cancel a lock request when an appropriate kill signal is
received by the user. This feature is supported with all cluster-stacks including o2cb.

The support for clustered fcntl(2) was added in Linux kernel 2.6.28. But because it requires group
communication to make the locks coherent, it is only supported with userspace cluster stacks,
pcmk and cman and not with the default cluster stack o2cb.

Version 1.8.2 January 2012 4


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

Comprehensive Tools Support


The file system has a comprehensive EXT3-style toolset that tries to use similar parameters for
ease-of-use. It includes mkfs.ocfs2(8) (format), tunefs.ocfs2(8) (tune), fsck.ocfs2(8) (check),
debugfs.ocfs2(8) (debug), etc.

Online Resize
The file system can be dynamically grown using tunefs.ocfs2(8). This feature was added in Linux
kernel 2.6.25.

RECENT CHANGES
The O2CB cluster stack has a global heartbeat mode. It allows users to specify heartbeat regions that are
consistent across all nodes. The cluster stack also allows online addition and removal of both nodes and
heartbeat regions.

o2cb(8) is the new cluster configuration utility. It is an easy to use utility that allows users to create the
cluster configuration on a node that is not part of the cluster. It replaces the older utility o2cb_ctl(8) which
has being deprecated.

ocfs2console(8) has been obsoleted.

o2info(8) is a new utility that can be used to provide file system information. It allows non-priviledged
users to see the enabled file system features, block and cluster sizes, extended file stat, free space fragmen-
tation, etc.

o2hbmonitor(8) is a o2hb heartbeat monitor. It is an extremely light weight utility that logs messages to
the system logger once the heartbeat delay exceeds the warn threshold. This utility is useful in identifying
volumes encountering I/O delays.

debugfs.ocfs2(8) has some new commands. net_stats shows the o2net message times between various
nodes. This is useful in indentifying nodes are that slowing down the cluster operations. stat_sysdir allows
the user to dump the entire system directory that can be used to debug issues. grpextents dumps the com-
plete free space fragmentation in the cluster group allocator.

mkfs.ocfs2(8) now enables xattr, indexed-dirs, discontig-bg, refcount, extended-slotmap and clusterinfo
feature flags by default, in addition to the older defaults, sparse, unwritten and inline-data.

mount.ocfs2(8) allows users to specify the level of cache coherency between nodes. By default the file
system operates in full coherency mode that also serializes the direct I/Os. While this mode is technically
correct, it limits the I/O thruput in a clustered database. This mount option allows the user to limit the cache
coherency to only the buffered I/Os to allow multiple nodes to do concurrent direct writes to the same file.
This feature works with Linux kernel 2.6.37 and later.

COMPATIBILITY
The OCFS2 development teams goes to great lengths to maintain compatibility. It attempts to maintain both
on-disk and network protocol compatibility across all releases of the file system. It does so even while
adding new features that entail on-disk format and network protocol changes. To do this successfully, it fol-
lows a few rules:

1. The on-disk format changes are managed by a set of feature flags that can be turned on and off. The
file system in kernel detects these features during mount and continues only if it understands all the
features. Users encountering this have the option of either disabling that feature or upgrading the file
system to a newer release.

Version 1.8.2 January 2012 5


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

2. The latest release of ocfs2-tools is compatible with all versions of the file system. All utilities detect
the features enabled on disk and continue only if it understands all the features. Users encountering
this have to upgrade the tools to a newer release.

3. The network protocol version is negotiated by the nodes to ensure all nodes understand the active
protocol version.

FEATURE FLAGS
The feature flags are split into three categories, namely, Compat, Incompat and RO Compat.

Compat, or compatible, is a feature that the file system does not need to fully understand to safely
read/write to the volume. An example of this is the backup-super feature that added the capability
to backup the super block in multiple locations in the file system. As the backup super blocks are
typically not read nor written to by the file system, an older file system can safely mount a volume
with this feature enabled.

Incompat, or incompatible, is a feature that the file system needs to fully understand to read/write
to the volume. Most features fall under this category.

RO Compat, or read-only compatible, is a feature that the file system needs to fully understand to
write to the volume. Older software can safely read a volume with this feature enabled. An exam-
ple of this would be user and group quotas. As quotas are manipulated only when the file system is
written to, older software can safely mount such volumes in read-only mode.

The list of feature flags, the version of the kernel it was added in, the earliest version of the tools
that understands it, etc., is as follows:

Feature Flags Kernel Version Tools Version Category Hex Value


backup-super All ocfs2-tools 1.2 Compat 1
strict-journal-super All All Compat 2
local Linux 2.6.20 ocfs2-tools 1.2 Incompat 8
sparse Linux 2.6.22 ocfs2-tools 1.4 Incompat 10
inline-data Linux 2.6.24 ocfs2-tools 1.4 Incompat 40
extended-slotmap Linux 2.6.27 ocfs2-tools 1.6 Incompat 100
xattr Linux 2.6.29 ocfs2-tools 1.6 Incompat 200
indexed-dirs Linux 2.6.30 ocfs2-tools 1.6 Incompat 400
metaecc Linux 2.6.29 ocfs2-tools 1.6 Incompat 800
refcount Linux 2.6.32 ocfs2-tools 1.6 Incompat 1000
discontig-bg Linux 2.6.35 ocfs2-tools 1.6 Incompat 2000
clusterinfo Linux 2.6.37 ocfs2-tools 1.8 Incompat 4000
unwritten Linux 2.6.23 ocfs2-tools 1.4 RO Compat 1
grpquota Linux 2.6.29 ocfs2-tools 1.6 RO Compat 2
usrquota Linux 2.6.29 ocfs2-tools 1.6 RO Compat 4

To query the features enabled on a volume, do:

$ o2info --fs-features /dev/sdf1


backup-super strict-journal-super sparse extended-slotmap inline-data xattr
indexed-dirs refcount discontig-bg clusterinfo unwritten

Version 1.8.2 January 2012 6


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

ENABLING AND DISABLING FEATURES

The format utility, mkfs.ocfs2(8), allows a user to enable and disable specific features using the fs-
features option. The features are provided as a comma separated list. The enabled features are
listed as is. The disabled features are prefixed with no. The example below shows the file system
being formatted with sparse disabled and inline-data enabled.

# mkfs.ocfs2 --fs-features=nosparse,inline-data /dev/sda1

After formatting, the users can toggle features using the tune utility, tunefs.ocfs2(8). This is an
offline operation. The volume needs to be umounted across the cluster. The example below shows
the sparse feature being enabled and inline-data disabled.

# tunefs.ocfs2 --fs-features=sparse,noinline-data /dev/sda1

Care should be taken before enabling and disabling features. Users planning to use a volume with
an older version of the file system will be better of not enabling newer features as turning disabling
may not succeed.

An example would be disabling the sparse feature; this requires filling every hole. The operation
can only succeed if the file system has enough free space.

DETECTING FEATURE INCOMPATIBILITY

Say one tries to mount a volume with an incompatible feature. What happens then? How does one
detect the problem? How does one know the name of that incompatible feature?

To begin with, one should look for error messages in dmesg(8). Mount failures that are due to an
incompatible feature will always result in an error message like the following:

ERROR: couldn’t mount because of unsupported optional features (200).

Here the file system is unable to mount the volume due to an unsupported optional feature. That
means that that feature is an Incompat feature. By referring to the table above, one can then
deduce that the user failed to mount a volume with the xattr feature enabled. (The value in the
error message is in hexadecimal.)

Another example of an error message due to incompatibility is as follows:

ERROR: couldn’t mount RDWR because of unsupported optional features (1).

Here the file system is unable to mount the volume in the RW mode. That means that that feature
is a RO Compat feature. Another look at the table and it becomes apparent that the volume had
the unwritten feature enabled.

In both cases, the user has the option of disabling the feature. In the second case, the user has the
choice of mounting the volume in the RO mode.

GETTING STARTED
The OCFS2 software is split into two components, namely, kernel and tools. The kernel component
includes the core file system and the cluster stack, and is packaged along with the kernel. The tools compo-
nent is packaged as ocfs2-tools and needs to be specifically installed. It provides utilities to format, tune,
mount, debug and check the file system.

Version 1.8.2 January 2012 7


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

To install ocfs2-tools, refer to the package handling utility in in your distributions.

The next step is selecting a cluster stack. The options include:

A. No cluster stack, or local mount.

B. In-kernel o2cb cluster stack with local or global heartbeat.

C. Userspace cluster stacks pcmk or cman.

The file system allows changing cluster stacks easily using tunefs.ocfs2(8). To list the cluster stacks
stamped on the OCFS2 volumes, do:

# mounted.ocfs2 -d
Device Stack Cluster F UUID Label
/dev/sdb1 o2cb webcluster G DCDA2845177F4D59A0F2DCD8DE507CC3 hbvol1
/dev/sdc1 None 23878C320CF3478095D1318CB5C99EED localmount
/dev/sdd1 o2cb webcluster G 8AB016CD59FC4327A2CDAB69F08518E3 webvol
/dev/sdg1 o2cb webcluster G 77D95EF51C0149D2823674FCC162CF8B logsvol
/dev/sdh1 o2cb webcluster G BBA1DBD0F73F449384CE75197D9B7098 scratch

NON-CLUSTERED OR LOCAL MOUNT

To format a OCFS2 volume as a non-clustered (local) volume, do:

# mkfs.ocfs2 -L "mylabel" --fs-features=local /dev/sda1

To convert an existing clustered volume to a non-clustered volume, do:

# tunefs.ocfs2 --fs-features=local /dev/sda1

Non-clustered volumes do not interact with the cluster stack. One can have both clustered and
non-clustered volumes mounted at the same time.

While formating a non-clustered volume, users should consider the possibility of later converting
that volume to a clustered one. If there is a possibility of that, then the user should add enough
node-slots using the -N option. Adding node-slots during format creates journals with large
extents. If created later, then the journals will be fragmented which is not good for performance.

CLUSTERED MOUNT WITH O2CB CLUSTER STACK

Only one of the two heartbeat mode can be active at any one time. Changing heartbeat modes is an
offline operation.

Both heartbeat modes require /etc/ocfs2/cluster.conf and /etc/sysconfig/o2cb to be populated as


described in ocfs2.cluster.conf(5) and o2cb.sysconfig(5) respectively. The only difference in set
up between the two modes is that global requires heartbeat devices to be configured whereas local
does not.

Refer o2cb(7) for more information.

Version 1.8.2 January 2012 8


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

LOCAL HEARTBEAT
This is the default heartbeat mode. The user needs to populate the configuration files as
described in ocfs2.cluster.conf(5) and o2cb.sysconfig(5). In this mode, the cluster stack
heartbeats on all mounted volumes. Thus, one does not have to specify heartbeat devices
in cluster.conf.

Once configured, the o2cb cluster stack can be onlined and offlined as follows:

# service o2cb online


Setting cluster stack "o2cb": OK
Registering O2CB cluster "webcluster": OK
Setting O2CB cluster timeouts : OK

# service o2cb offline


Clean userdlm domains: OK
Stopping O2CB cluster webcluster: OK
Unregistering O2CB cluster "webcluster": OK

GLOBAL HEARTBEAT
The configuration is similar to local heartbeat. The one additional step in this mode is that
it requires heartbeat devices to be also configured.

These heartbeat devices are OCFS2 formatted volumes with global heartbeat enabled on
disk. These volumes can later be mounted and used as clustered file systems.

The steps to format a volume with global heartbeat enabled is listed in o2cb(7). Also
listed there is listing all volumes with the cluster stack stamped on disk.

In this mode, the heartbeat is started when the cluster is onlined and stopped when the
cluster is offlined.

# service o2cb online


Setting cluster stack "o2cb": OK
Registering O2CB cluster "webcluster": OK
Setting O2CB cluster timeouts : OK
Starting global heartbeat for cluster "webcluster": OK

# service o2cb offline


Clean userdlm domains: OK
Stopping global heartbeat on cluster "webcluster": OK
Stopping O2CB cluster webcluster: OK
Unregistering O2CB cluster "webcluster": OK

# service o2cb status


Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster "webcluster": Online
Heartbeat dead threshold: 31
Network idle timeout: 30000
Network keepalive delay: 2000

Version 1.8.2 January 2012 9


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

Network reconnect delay: 2000


Heartbeat mode: Global
Checking O2CB heartbeat: Active
77D95EF51C0149D2823674FCC162CF8B /dev/sdg1
Nodes in O2CB cluster: 92 96

CLUSTERED MOUNT WITH USERSPACE CLUSTER STACK

Configure and online the userspace stack pcmk or cman before using tunefs.ocfs2(8) to update
the cluster stack on disk.

# tunefs.ocfs2 --update-cluster-stack /dev/sdd1


Updating on-disk cluster information to match the running cluster.
DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS
FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION.
Update the on-disk cluster information? y

Refer to the cluster stack documentation for information on starting and stopping the cluster stack.

FILE SYSTEM UTILITIES


This sections lists the utilities that are used to manage the OCFS2 file systems. This includes tools to for-
mat, tune, check, mount, debug the file system. Each utility has a man page that lists its capabilities in
detail.

mkfs.ocfs2(8)
This is the file system format utility. All volumes have to be formatted prior to its use. As this util-
ity overwrites the volume, use it with care. Double check to ensure the volume is not in use on any
node in the cluster.

As a precaution, the utility will abort if the volume is locally mounted. It also detects use across
the cluster if used by OCFS2. But these checks are not comprehensive and can be overridden. So
use it with care.

While it is not always required, the cluster should be online.

tunefs.ocfs2(8)
This is the file system tune utility. It allows users to change certain on-disk parameters like label,
uuid, number of node-slots, volume size and the size of the journals. It also allows turning on and
off the file system features as listed above.

This utility requires the cluster to be online.

fsck.ocfs2(8)
This is the file system check utility. It detects and fixes on-disk errors. All the check codes and
their fixes are listed in fsck.ocfs2.checks(8).

This utility requires the cluster to be online to ensure the volume is not in use on another node and
to prevent the volume from being mounted for the duration of the check.

Version 1.8.2 January 2012 10


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

mount.ocfs2(8)
This is the file system mount utility. It is invoked indirectly by the mount(8) utility.

This utility detects the cluster status and aborts if the cluster is offline or does not match the cluster
stamped on disk.

o2cluster(8)
This is the file system cluster stack update utility. It allows the users to update the on-disk cluster
stack to the one provided.

This utility only updates the disk if the utility is reasonably assured that the file system is not in
use on any node.

o2info(1)
This is the file system information utility. It provides information like the features enabled on disk,
block size, cluster size, free space fragmentation, etc.

It can be used by both priviledged and non-priviledged users. Users having read permission on the
device can provide the path to the device. Other users can provide the path to a file on a mounted
file system.

debugfs.ocfs2(8)
This is the file system debug utility. It allows users to examine all file system structures including
walking directory structures, displaying inodes, backing up files, etc., without mounting the file
system.

This utility requires the user to have read permission on the device.

o2image(8)
This is the file system image utility. It allows users to copy the file system metadata skeleton,
including the inodes, directories, bitmaps, etc. As it excludes data, it shrinks the size of the file
system tremendously.

The image file created can be used in debugging on-disk corruptions.

mounted.ocfs2(8)
This is the file system detect utility. It detects all OCFS2 volumes in the system and lists its label,
uuid and cluster stack.

O2CB CLUSTER STACK UTILITIES


This sections lists the utilities that are used to manage O2CB cluster stack. Each utility has a man page that
lists its capabilities in detail.
o2cb(8)
This is the cluster configuration utility. It allows users to update the cluster configuration by
adding and removing nodes and heartbeat regions. This utility is used by the o2cb init script to
online and offline the cluster.

This is a new utility and replaces o2cb_ctl(8) which has been deprecated.

Version 1.8.2 January 2012 11


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

ocfs2_hb_ctl(8)
This is the cluster heartbeat utility. It allows users to start and stop local heartbeat. This utility is
invoked by mount.ocfs2(8) and should not be invoked directly by the user.

o2hbmonitor(8)
This is the disk heartbeat monitor. It tracks the elapsed time since the last heartbeat and logs warn-
ings once that time exceeds the warn threshold.

FILE SYSTEM NOTES


This section includes some useful notes that may prove helpful to the user.
BALANCED CLUSTER
A cluster is a computer. This is a fact and not a slogan. What this means is that an errant node in
the cluster can affect the behavior of other nodes. If one node is slow, the cluster operations will
slow down on all nodes. To prevent that, it is best to have a balanced cluster. This is a cluster that
has equally powered and loaded nodes.

The standard recommendation for such clusters is to have identical hardware and software across
all the nodes. However, that is not a hard and fast rule. After all, we have taken the effort to ensure
that OCFS2 works in a mixed architecture environment.

If one uses OCFS2 in a mixed architecture environment, try to ensure that the nodes are equally
powered and loaded. The use of a load balancer can assist with the latter. Power refers to the num-
ber of processors, speed, amount of memory, I/O throughput, network bandwidth, etc. In reality,
having equally powered heterogeneous nodes is not always practical. In that case, make the lower
node numbers more powerful than the higher node numbers. The O2CB cluster stack favors lower
node numbers in all of its tiebreaking logic.

This is not to suggest you should add a single core node in a cluster of quad cores. No amount of
node number juggling will help you there.

FILE DELETION
In Linux, rm(1) removes the directory entry. It does not necessarily delete the corresponding
inode. By removing the directory entry, it gives the illusion that the inode has been deleted. This
puzzles users when they do not see a corresponding up-tick in the reported free space. The reason
is that inode deletion has a few more hurdles to cross.

First is the hard link count. This indicates the number of directory entries pointing to that inode.
As long as a directory entry is linked to that inode, it cannot be deleted. The file system has to wait
for that count to drop to zero.

The second hurdle is the POSIX semantics allowing files to be unlinked even while they are in use.
In OCFS2, that translates to in use across the cluster. The file system has to wait for all processes
across the cluster to stop using the inode.

Once these two conditions are met, the inode is deleted and the freed bits are flushed to disk on the
next sync.

This assumes that the inode was not reflinked. If it was, then the deletion would only release space
that was private to the inode. Shared space would only be released when the last inode using it is
deleted.

Users interested in following the trail can use debugfs.ocfs2(8) to view the node specific system
files orphan_dir and truncate_log. Once the link count is zero, an inode is moved to the

Version 1.8.2 January 2012 12


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

orphan_dir. After deletion, the freed bits are added to the truncate_log, where they remain until the
next sync, during which the bits are flushed to the global bitmap.

DIRECTORY LISTING
ls(1) may be a simple command, but it is not cheap. What is expensive is not the part where it
reads the directory listing, but the second part where it reads all the inodes, also referred as an
inode stat(2). If the inodes are not in cache, this can entail disk I/O. Now, while a cold cache
inode stat(2) is expensive in all file systems, it is especially so in a clustered file system. It needs to
take a lock on each node, pure overhead when compared to a local file system.

A hot cache stat(2), on the other hand, has shown to perform on OCFS2 like it does on EXT3.

In other words, the second ls(1) will be quicker than the first. However, it is not guaranteed. Say
you have a million files in a file system and not enough kernel memory to cache all the inodes. In
that case, each ls(1) will involve some cold cache stat(2)s.

ALLOCATION RESERVATION
Allocation reservation allows multiple concurrently extending files to grow as contiguously as pos-
sible. One way to demonstrate its functioning is to run a script that extends multiple files in a cir-
cular order. The script below does that by writing one hundred 4KB chunks to four files, one after
another.

$ for i in $(seq 0 99);


> do
> for j in $(seq 4);
> do
> dd if=/dev/zero of=file$j bs=4K count=1 seek=$i;
> done;
> done;

When run on a system running Linux kernel 2.6.34 or earlier, we end up with files with 100
extents each. That is full fragmentation. As the files are being extended one after another, the on-
disk allocations are fully interleaved.

$ filefrag file1 file2 file3 file4


file1: 100 extents found
file2: 100 extents found
file3: 100 extents found
file4: 100 extents found

When run on a system running Linux kernel 2.6.35 or later, we see files with 7 extents each. That
is a lot fewer than before. Fewer extents mean more on-disk contiguity and that always leads to
better overall performance.

$ filefrag file1 file2 file3 file4


file1: 7 extents found
file2: 7 extents found
file3: 7 extents found
file4: 7 extents found

REFLINK OPERATION
This feature allows a user to create a writeable snapshot of a regular file. In this operation, the file
system creates a new inode with the same extent pointers as the original inode. Multiple inodes are

Version 1.8.2 January 2012 13


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

thus able to share data extents. This adds a twist in file system administration because none of the
existing file system utilities in Linux expect this behavior. du(1), a utility to used to compute file
space usage, simply adds the blocks allocated to each inode. As it does not know about shared
extents, it over estimates the space used. Say, we have a 5GB file in a volume having 42GB free.

$ ls -l
total 5120000
-rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:15 myfile

$ du -m myfile*
5000 myfile

$ df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/sdd1 50G 8.2G 42G 17% /ocfs2

If we were to reflink it 4 times, we would expect the directory listing to report five 5GB files, but
the df(1) to report no loss of available space. du(1), on the other hand, would report the disk usage
to climb to 25GB.

$ reflink myfile myfile-ref1


$ reflink myfile myfile-ref2
$ reflink myfile myfile-ref3
$ reflink myfile myfile-ref4

$ ls -l
total 25600000
-rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:15 myfile
-rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref1
-rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref2
-rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref3
-rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref4

$ df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/sdd1 50G 8.2G 42G 17% /ocfs2

$ du -m myfile*
5000 myfile
5000 myfile-ref1
5000 myfile-ref2
5000 myfile-ref3
5000 myfile-ref4
25000 total

Enter shared-du(1), a shared extent-aware du. This utility reports the shared extents per file in
parenthesis and the overall footprint. As expected, it lists the overall footprint at 5GB. One can
view the details of the extents using shared-filefrag(1). Both these utilities are available at
http://oss.oracle.com/˜smushran/reflink-tools/. We are currently in the process of pushing the
changes to the upstream maintainers of these utilities.

$ shared-du -m -c --shared-size myfile*


5000 (5000) myfile
5000 (5000) myfile-ref1

Version 1.8.2 January 2012 14


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

5000 (5000) myfile-ref2


5000 (5000) myfile-ref3
5000 (5000) myfile-ref4
25000 total
5000 footprint

# shared-filefrag -v myfile
Filesystem type is: 7461636f
File size of myfile is 5242880000 (1280000 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 2247937 8448
1 8448 2257921 2256384 30720
2 39168 2290177 2288640 30720
3 69888 2322433 2320896 30720
4 100608 2354689 2353152 30720
7 192768 2451457 2449920 30720
. . .
37 1073408 2032129 2030592 30720 shared
38 1104128 2064385 2062848 30720 shared
39 1134848 2096641 2095104 30720 shared
40 1165568 2128897 2127360 30720 shared
41 1196288 2161153 2159616 30720 shared
42 1227008 2193409 2191872 30720 shared
43 1257728 2225665 2224128 22272 shared,eof
myfile: 44 extents found

DATA COHERENCY
One of the challenges in a shared file system is data coherency when multiple nodes are writing to
the same set of files. NFS, for example, provides close-to-open data coherency that results in the
data being flushed to the server when the file is closed on the client. This leaves open a wide win-
dow for stale data being read on another node.

A simple test to check the data coherency of a shared file system involves concurrently appending
the same file. Like running "uname -a >>/dir/file" using a parallel distributed shell like dsh or
pconsole. If coherent, the file will contain the results from all nodes.

# dsh -R ssh -w node32,node33,node34,node35 "uname -a >> /ocfs2/test"


# cat /ocfs2/test
Linux node32 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
Linux node35 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
Linux node33 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
Linux node34 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

OCFS2 is a fully cache coherent cluster file system.

DISCONTIGUOUS BLOCK GROUP


Most file systems pre-allocate space for inodes during format. OCFS2 dynamically allocates this
space when required.

However, this dynamic allocation has been problematic when the free space is very fragmented,
because the file system required the inode and extent allocators to grow in contiguous fixed-size
chunks.

The discontiguous block group feature takes care of this problem by allowing the allocators to

Version 1.8.2 January 2012 15


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

grow in smaller, variable-sized chunks.

This feature was added in Linux kernel 2.6.35 and requires enabling on-disk feature discontig-bg.

BACKUP SUPER BLOCKS


A file system super block stores critical information that is hard to recreate. In OCFS2, it stores
the block size, cluster size, and the locations of the root and system directories, among other
things. As this block is close to the start of the disk, it is very susceptible to being overwritten by
an errant write. Say, dd if=file of=/dev/sda1.

Backup super blocks are copies of the super block. These blocks are dispersed in the volume to
minimize the chances of being overwritten. On the small chance that the original gets corrupted,
the backups are available to scan and fix the corruption.

mkfs.ocfs2(8) enables this feature by default. Users can disable this by specifying --fs-fea-
tures=nobackup-super during format.

o2info(1) can be used to view whether the feature has been enabled on a device.

# o2info --fs-features /dev/sdb1


backup-super strict-journal-super sparse extended-slotmap inline-data xattr
indexed-dirs refcount discontig-bg clusterinfo unwritten

In OCFS2, the super block is on the third block. The backups are located at the 1G, 4G, 16G,
64G, 256G and 1T byte offsets. The actual number of backup blocks depends on the size of
the device. The super block is not backed up on devices smaller than 1GB.

fsck.ocfs2(8) refers to these six offsets by numbers, 1 to 6. Users can specify any backup with the
-r option to recover the volume. The example below uses the second backup. If successful,
fsck.ocfs2(8) overwrites the corrupted super block with the backup.

# fsck.ocfs2 -f -r 2 /dev/sdb1
fsck.ocfs2 1.8.0
[RECOVER_BACKUP_SUPERBLOCK] Recover superblock information from backup block#1048576? <n> y
Checking OCFS2 filesystem in /dev/sdb1:
Label: webhome
UUID: B3E021A2A12B4D0EB08E9E986CDC7947
Number of blocks: 13107196
Block size: 4096
Number of clusters: 13107196
Cluster size: 4096
Number of slots: 8

/dev/sdb1 was run with -f, check forced.


Pass 0a: Checking cluster allocation chains
Pass 0b: Checking inode allocation chains
Pass 0c: Checking extent block allocation chains
Pass 1: Checking inodes and blocks.
Pass 2: Checking directory entries.
Pass 3: Checking directory connectivity.
Pass 4a: checking for orphaned inodes
Pass 4b: Checking inodes link counts.
All passes succeeded.

Version 1.8.2 January 2012 16


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

SYNTHETIC FILE SYSTEMS


The OCFS2 development effort included two synthetic file systems, configfs and dlmfs. It also
makes use of a third, debugfs.

configfs
configfs has since been accepted as a generic kernel component and is also used by net-
console and fs/dlm. OCFS2 tools use it to communicate the list of nodes in the cluster,
details of the heartbeat device, cluster timeouts, and so on to the in-kernel node manager.
The o2cb init script mounts this file system at /sys/kernel/config.

dlmfs dlmfs exposes the in-kernel o2dlm to the user-space. While it was developed primarily
for OCFS2 tools, it has seen usage by others looking to add a cluster locking dimension
in their applications. Users interested in doing the same should look at the libo2dlm
library provided by ocfs2-tools. The o2cb init script mounts this file system at /dlm.

debugfs
OCFS2 uses debugfs to expose its in-kernel information to user space. For example, list-
ing the file system cluster locks, dlm locks, dlm state, o2net state, etc. Users can access
the information by mounting the file system at /sys/kernel/debug. To automount, add the
following to /etc/fstab: debugfs /sys/kernel/debug debugfs defaults 0 0

DISTRIBUTED LOCK MANAGER


One of the key technologies in a cluster is the lock manager, which maintains the locking state of
all resources across the cluster. An easy implementation of a lock manager involves designating
one node to handle everything. In this model, if a node wanted to acquire a lock, it would send the
request to the lock manager. However, this model has a weakness: lock managerâs death causes
the cluster to seize up.

A better model is one where all nodes manage a subset of the lock resources. Each node maintains
enough information for all the lock resources it is interested in. On event of a node death, the
remaining nodes pool in the information to reconstruct the lock state maintained by the dead node.
In this scheme, the locking overhead is distributed amongst all the nodes. Hence, the term distrib-
uted lock manager.

O2DLM is a distributed lock manager. It is based on the specification titled "Programming Lock-
ing Application" written by Kristin Thomas and is available at the following link.
http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf

DLM DEBUGGING
O2DLM has a rich debugging infrastructure that allows it to show the state of the lock manager,
all the lock resources, among other things. The figure below shows the dlm state of a nine-node
cluster that has just lost three nodes: 12, 32, and 35. It can be ascertained that node 7, the recovery
master, is currently recovering node 12 and has received the lock states of the dead node from all
other live nodes.

# cat /sys/kernel/debug/o2dlm/45F81E3B6F2B48CCAAD1AE7945AB2001/dlm_state
Domain: 45F81E3B6F2B48CCAAD1AE7945AB2001 Key: 0x10748e61
Thread Pid: 24542 Node: 7 State: JOINED
Number of Joins: 1 Joining Node: 255
Domain Map: 7 31 33 34 40 50
Live Map: 7 31 33 34 40 50
Lock Resources: 48850 (439879)

Version 1.8.2 January 2012 17


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

MLEs: 0 (1428625)
Blocking: 0 (1066000)
Mastery: 0 (362625)
Migration: 0 (0)
Lists: Dirty=Empty Purge=Empty PendingASTs=Empty PendingBASTs=Empty
Purge Count: 0 Refs: 1
Dead Node: 12
Recovery Pid: 24543 Master: 7 State: ACTIVE
Recovery Map: 12 32 35
Recovery Node State:
7 - DONE
31 - DONE
33 - DONE
34 - DONE
40 - DONE
50 - DONE

The figure below shows the state of a dlm lock resource that is mastered (owned) by node 25, with
6 locks in the granted queue and node 26 holding the EX (writelock) lock on that resource.

# debugfs.ocfs2 -R "dlm_locks M000000000000000022d63c00000000" /dev/sda1


Lockres: M000000000000000022d63c00000000 Owner: 25 State: 0x0
Last Used: 0 ASTs Reserved: 0 Inflight: 0 Migration Pending: No
Refs: 8 Locks: 6 On Lists: None
Reference Map: 26 27 28 94 95
Lock-Queue Node Level Conv Cookie Refs AST BAST Pending-Action
Granted 94 NL -1 94:3169409 2 No No None
Granted 28 NL -1 28:3213591 2 No No None
Granted 27 NL -1 27:3216832 2 No No None
Granted 95 NL -1 95:3178429 2 No No None
Granted 25 NL -1 25:3513994 2 No No None
Granted 26 EX -1 26:3512906 2 No No None

The figure below shows a lock from the file system perspective. Specifically, it shows a lock that is
in the process of being upconverted from a NL to EX. Locks in this state are are referred to in the
file system as busy locks and can be listed using the debugfs.ocfs2 command, "fs_locks -B".

# debugfs.ocfs2 -R "fs_locks -B" /dev/sda1


Lockres: M000000000000000000000b9aba12ec Mode: No Lock
Flags: Initialized Attached Busy
RO Holders: 0 EX Holders: 0
Pending Action: Convert Pending Unlock Action: None
Requested Mode: Exclusive Blocking Mode: No Lock
PR > Gets: 0 Fails: 0 Waits Total: 0us Max: 0us Avg: 0ns
EX > Gets: 1 Fails: 0 Waits Total: 544us Max: 544us Avg: 544185ns
Disk Refreshes: 1

With this debugging infrastructure in place, users can debug hang issues as follows:

* Dump the busy fs locks for all the OCFS2 volumes on the node with hanging processes. If
no locks are found, then the problem is not related to O2DLM.

* Dump the corresponding dlm lock for all the busy fs locks. Note down the owner (master)
of all the locks.

Version 1.8.2 January 2012 18


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

* Dump the dlm locks on the master node for each lock.

At this stage, one should note that the hanging node is waiting to get an AST from the master. The
master, on the other hand, cannot send the AST until the current holder has down converted that
lock, which it will do upon receiving a Blocking AST. However, a node can only down convert if
all the lock holders have stopped using that lock. After dumping the dlm lock on the master node,
identify the current lock holder and dump both the dlm and fs locks on that node.

The trick here is to see whether the Blocking AST message has been relayed to file system. If not,
the problem is in the dlm layer. If it has, then the most common reason would be a lock holder, the
count for which is maintained in the fs lock.

At this stage, printing the list of process helps.

$ ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN

Make a note of all D state processes. At least one of them is responsible for the hang on the first
node.

The challenge then is to figure out why those processes are hanging. Failing that, at least get
enough information (like alt-sysrq t output) for the kernel developers to review. What to do next
depends on where the process is hanging. If it is waiting for the I/O to complete, the problem
could be anywhere in the I/O subsystem, from the block device layer through the drivers to the
disk array. If the hang concerns a user lock (flock(2)), the problem could be in the userâs applica-
tion. A possible solution could be to kill the holder. If the hang is due to tight or fragmented mem-
ory, free up some memory by killing non-essential processes.

The thing to note is that the symptom for the problem was on one node but the cause is on another.
The issue can only be resolved on the node holding the lock. Sometimes, the best solution will be
to reset that node. Once killed, the O2DLM recovery process will clear all locks owned by the
dead node and let the cluster continue to operate. As harsh as that sounds, at times it is the only
solution. The good news is that, by following the trail, you now have enough information to file a
bug and get the real issue resolved.

NFS EXPORTING
OCFS2 volumes can be exported as NFS volumes. This support is limited to NFS version 3, which
translates to Linux kernel version 2.4 or later.

If the version of the Linux kernel on the system exporting the volume is older than 2.6.30, then the
NFS clients must mount the volumes using the nordirplus mount option. This disables the READ-
DIRPLUS RPC call to workaround a bug in NFSD, detailed in the following link:

http://oss.oracle.com/pipermail/ocfs2-announce/2008-June/000025.html

Users running NFS version 2 can export the volume after having disabled subtree checking (mount
option no_subtree_check). Be warned, disabling the check has security implications (documented
in the exports(5) man page) that users must evaluate on their own.

FILE SYSTEM LIMITS


OCFS2 has no intrinsic limit on the total number of files and directories in the file system. In gen-
eral, it is only limited by the size of the device. But there is one limit imposed by the current
filesystem. It can address at most four billion clusters. A file system with 1MB cluster size can go
up to 4PB, while a file system with a 4KB cluster size can address up to 16TB.

Version 1.8.2 January 2012 19


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

SYSTEM OBJECTS
The OCFS2 file system stores its internal meta-data, including bitmaps, journals, etc., as system
files. These are grouped in a system directory. These files and directories are not accessible via the
file system interface but can be viewed using the debugfs.ocfs2(8) tool.

To list the system directory (referred to as double-slash), do:

# debugfs.ocfs2 -R "ls -l //" /dev/sde1


66 drwxr-xr-x 10 0 0 3896 19-Jul-2011 13:36 .
66 drwxr-xr-x 10 0 0 3896 19-Jul-2011 13:36 ..
67 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 bad_blocks
68 -rw-r--r-- 1 0 0 1179648 19-Jul-2011 13:36 global_inode_alloc
69 -rw-r--r-- 1 0 0 4096 19-Jul-2011 14:35 slot_map
70 -rw-r--r-- 1 0 0 1048576 19-Jul-2011 13:36 heartbeat
71 -rw-r--r-- 1 0 0 53686960128 19-Jul-2011 13:36 global_bitmap
72 drwxr-xr-x 2 0 0 3896 25-Jul-2011 15:05 orphan_dir:0000
73 drwxr-xr-x 2 0 0 3896 19-Jul-2011 13:36 orphan_dir:0001
74 -rw-r--r-- 1 0 0 8388608 19-Jul-2011 13:36 extent_alloc:0000
75 -rw-r--r-- 1 0 0 8388608 19-Jul-2011 13:36 extent_alloc:0001
76 -rw-r--r-- 1 0 0 121634816 19-Jul-2011 13:36 inode_alloc:0000
77 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 inode_alloc:0001
77 -rw-r--r-- 1 0 0 268435456 19-Jul-2011 13:36 journal:0000
79 -rw-r--r-- 1 0 0 268435456 19-Jul-2011 13:37 journal:0001
80 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 local_alloc:0000
81 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 local_alloc:0001
82 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 truncate_log:0000
83 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 truncate_log:0001

The file names that end with numbers are slot specific and are referred to as node-local system
files. The set of node-local files used by a node can be determined from the slot map. To list the
slot map, do:

# debugfs.ocfs2 -R "slotmap" /dev/sde1


Slot# Node#
0 32
1 35
2 40
3 31
4 34
5 33

For more information, refer to the OCFS2 support guides available in the Documentation section
at http://oss.oracle.com/projects/ocfs2.

HEARTBEAT, QUORUM, AND FENCING


Heartbeat is an essential component in any cluster. It is charged with accurately designating nodes
as dead or alive. A mistake here could lead to a cluster hang or a corruption.

o2hb is the disk heartbeat component of o2cb. It periodically updates a timestamp on disk, indicat-
ing to others that this node is alive. It also reads all the timestamps to identify other live nodes.
Other cluster components, like o2dlm and o2net, use the o2hb service to get node up and down
events.

The quorum is the group of nodes in a cluster that is allowed to operate on the shared storage.

Version 1.8.2 January 2012 20


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

When there is a failure in the cluster, nodes may be split into groups that can communicate in their
groups and with the shared storage but not between groups. o2quo determines which group is
allowed to continue and initiates fencing of the other group(s).

Fencing is the act of forcefully removing a node from a cluster. A node with OCFS2 mounted will
fence itself when it realizes that it does not have quorum in a degraded cluster. It does this so that
other nodes wonât be stuck trying to access its resources.

o2cb uses a machine reset to fence. This is the quickest route for the node to rejoin the cluster.

PROCESSES

[o2net] One per node. It is a work-queue thread started when the cluster is brought on-line and
stopped when it is off-lined. It handles network communication for all mounts. It gets the
list of active nodes from O2HB and sets up a TCP/IP communication channel with each
live node. It sends regular keep-alive packets to detect any interruption on the channels.

[user_dlm]
One per node. It is a work-queue thread started when dlmfs is loaded and stopped when it
is unloaded (dlmfs is a synthetic file system that allows user space processes to access the
in-kernel dlm).

[ocfs2_wq]
One per node. It is a work-queue thread started when the OCFS2 module is loaded and
stopped when it is unloaded. It is assigned background file system tasks that may take
cluster locks like flushing the truncate log, orphan directory recovery and local alloc
recovery. For example, orphan directory recovery runs in the background so that it does
not affect recovery time.

[o2hb-14C29A7392]
One per heartbeat device. It is a kernel thread started when the heartbeat region is popu-
lated in configfs and stopped when it is removed. It writes every two seconds to a block in
the heartbeat region, indicating that this node is alive. It also reads the region to maintain
a map of live nodes. It notifies subscribers like o2net and o2dlm of any changes in the
live node map.

[ocfs2dc]
One per mount. It is a kernel thread started when a volume is mounted and stopped when
it is unmounted. It downgrades locks in response to blocking ASTs (BASTs) requested by
other nodes.

[jbd2/sdf1-97]
One per mount. It is part of JBD2, which OCFS2 uses for journaling.

[ocfs2cmt]
One per mount. It is a kernel thread started when a volume is mounted and stopped when
it is unmounted. It works with kjournald2.

[ocfs2rec]
It is started whenever a node has to be recovered. This thread performs file system recov-
ery by replaying the journal of the dead node. It is scheduled to run after dlm recovery

Version 1.8.2 January 2012 21


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

has completed.

[dlm_thread]
One per dlm domain. It is a kernel thread started when a dlm domain is created and
stopped when it is destroyed. This thread sends ASTs and blocking ASTs in response to
lock level convert requests. It also frees unused lock resources.

[dlm_reco_thread]
One per dlm domain. It is a kernel thread that handles dlm recovery when another node
dies. If this node is the dlm recovery master, it re-masters every lock resource owned by
the dead node.

[dlm_wq]
One per dlm domain. It is a work-queue thread that o2dlm uses to queue blocking tasks.

FUTURE WORK
File system development is a never ending cycle. Faster and larger disks, faster and more number
of processors, larger caches, etc. keep changing the sweet spot for performance forcing developers
to rethink long held beliefs. Add to that new use cases, which forces developers to be innovative in
providing solutions that melds seamlessly with existing semantics.

We are currently looking to add features like transparent compression, transparent encryption,
delayed allocation, multi-device support, etc. as well as work on improving performance on newer
generation machines.

If you are interested in contributing, email the development team at ocfs2-devel@oss.oracle.com.

ACKNOWLEDGEMENTS
The principal developers of the OCFS2 file system, its tools and the O2CB cluster stack, are Joel Becker,
Zach Brown, Mark Fasheh, Jan Kara, Kurt Hackel, Tao Ma, Sunil Mushran, Tiger Yang and Tristan Ye.

Other developers who have contributed to the file system via bug fixes, testing, etc. are Wim Coekaerts,
Srinivas Eeda, Coly Li, Jeff Mahoney, Marcos Matsunaga, Goldwyn Rodrigues, Manish Singh and Wen-
gang Wang.

The members of the Linux Cluster community including Andrew Beekhof, Lars Marowsky-Bree, Fabio
Massimo Di Nitto and David Teigland.

The members of the Linux File system community including Christoph Hellwig and Chris Mason.

The corporations that have contributed resources for this project including Oracle, SUSE Labs, EMC,
Emulex, HP, IBM, Intel and Network Appliance.

SEE ALSO
debugfs.ocfs2(8) fsck.ocfs2(8) fsck.ocfs2.checks(8) mkfs.ocfs2(8) mount.ocfs2(8) mounted.ocfs2(8)
o2cluster(8) o2image(8) o2info(1) o2cb(7) o2cb(8) o2cb.sysconfig(5) o2hbmonitor(8) ocfs2.clus-
ter.conf(5) tunefs.ocfs2(8)

AUTHOR
Oracle Corporation

Version 1.8.2 January 2012 22


OCFS2(7) OCFS2 Manual Pages OCFS2(7)

COPYRIGHT
Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 23


o2cb(7) OCFS2 Manual Pages o2cb(7)

NAME
o2cb − Default cluster stack of the OCFS2 file system.
SYNOPSIS
o2cb is the default cluster stack of the OCFS2 file system. It is an in-kernel cluster stack that includes a
node manager (o2nm) to keep track of the nodes in the cluster, a disk heartbeat agent (o2hb) to detect node
live-ness, a network agent (o2net) for intra-cluster node communication and a distributed lock manager
(o2dlm) to keep track of lock resources. It also includes a synthetic file system, dlmfs, to allow applica-
tions to access the in-kernel dlm.

CONFIGURATION
The stack is configured using the o2cb(8) cluster configuration utility and operated (online/offline/status)
using the o2cb init service.

CLUSTER CONFIGURATION

It has two configuration files. One for the cluster layout (/etc/ocfs2/cluster.conf) and the other for
the cluster timeouts, etc. (/etc/sysconfig/o2cb). More information about these two files can be
found in ocfs2.cluster.conf(5) and o2cb.sysconfig(5).

The o2cb cluster stack supports two heartbeat modes, namely, local and global. Only one heart-
beat mode can be active at any one time.

Local heartbeat refers to disk heartbeating on all shared devices. In this mode, the heartbeat is
started during mount and stopped during umount. This mode is easy to setup as it does not require
configuring heartbeat devices. The one drawback in this mode is the overhead on servers having a
large number of OCFS2 mounts. For example, a server with 50 mounts will have 50 heartbeat
threads. This is the default heartbeat mode.

Global heartbeat, on the other hand, refers to heartbeating on specific shared devices. These
devices are normal OCFS2 formatted volumes that could also be mounted and used as clustered
file systems. In this mode, the heartbeat is started during cluster online and stopped during cluster
offline. While this mode can be used for all clusters, it is strongly recommended for clusters having
a large number of mounts.

More information on disk heartbeat is provided below.

KERNEL CONFIGURATION

Two sysctl values need to be set for o2cb to function properly. The first, panic_on_oops, must be
enabled to turn a kernel oops into a panic. If a kernel thread required for o2cb to function crashes,
the system must be reset to prevent a cluster hang. If it is not set, another node may not be able to
distinguish whether a node is unable to respond or slow to respond.

The other related sysctl parameter is panic, which specifies the number of seconds after a panic
that the system will be auto-reset. Setting this parameter to zero disables autoreset; the cluster will
require manual intervention. This is not preferred in a cluster environment.

To manually enable panic on oops and set a 30 sec timeout for reboot on panic, do:

# echo 1 > /proc/sys/kernel/panic_on_oops


# echo 30 > /proc/sys/kernel/panic

To enable the above on every boot, add the following to /etc/sysctl.conf:

Version 1.8.2 August 2011 1


o2cb(7) OCFS2 Manual Pages o2cb(7)

kernel.panic_on_oops = 1
kernel.panic = 30

OS CONFIGURATION

The o2cb cluster stack also requires iptables (firewalling) to be either disabled or modified to
allow network traffic on the private network interface. The port used by o2cb is specified in
/etc/ocfs2/cluster.conf.

DISK HEARTBEAT
O2CB uses disk heartbeat to detect node liveness. The disk heartbeat thread, o2hb, periodically reads and
writes to a heartbeat file in a OCFS2 file system. Its write payload contains a sequence number that it incre-
ments in each write. This allows other nodes reading the same heartbeat file to detect the change and asso-
ciate that with a live node. Conversely, a node whose sequence number has stopped changing is marked as
a possible dead node. Possible. Not confirmed. That is because it just could be slow I/Os.

To differentiate between a dead node and one that has slow I/Os, O2CB has a disk heartbeat threshold
(timeout). Only nodes whose sequence number has not incremented for that duration are marked dead.

However that node may not be dead but just experiencing slow I/O. To prevent that, the heartbeat thread
keeps track of the time elapsed since the last completed write. If that time exceeds the timeout, it forces a
self-fence. It does so to prevent other nodes from marking it as dead while it is still alive.

This self-fencing scheme has proven to be very reliable as it relies on kernel timers and pci bus reset. Exter-
nal fencing, while attractive, is rarely as reliable as it relies on external hardware and software that is prone
to failure due to misconfiguration, etc.

Having said that, O2CB disk heartbeat has had its share of problems with self fencing. Nodes experiencing
slow I/O on only one of multiple devices have to initiate self-fence.

This is because in the default local heartbeat scheme, nodes in a cluster may not be heartbeating on the
same set of devices.

The global heartbeat mode addresses this shortcoming by introducing a scheme that forces all nodes to
heartbeat on the same set of devices. In this scheme, a node experiencing a slowdown in I/O on a device
may not need to initiate self-fence. It will only have to do so if it encounters slowdown on 50% or more of
the heartbeat devices. In a cluster with 3 heartbeat regions, a slowdown in 1 region will be tolerated. In a
cluster with 5 regions, a slowdown in 2 will be tolerated.

It is for this reason, this mode is recommended for users that have 3 or more OCFS2 mounts.

O2CB allows upto 32 heartbeat regions to be configured in the global heartbeat mode.

ONLINE CLUSTER MODIFICATION


The O2CB cluster stack allows adding and removing nodes in an online cluster when run in the global
heartbeat mode. Use the o2cb(8) utility to make the changes in the configuration and (re)online the cluster
using the o2cb init script. The user must do the same on all nodes in the cluster. The cluster will not allow
any new cluster mounts if the node configuration on all nodes is not the same.

The removal of nodes will only succeed if that node is no longer in use. If the user removes an active node
from the configuration, the re-online will fail.

The cluster stack also allows adding and removing heartbeat regions in an online cluster. Use the o2cb(8)

Version 1.8.2 August 2011 2


o2cb(7) OCFS2 Manual Pages o2cb(7)

utility to make the changes in the configuration file and (re)online the cluster using the o2cb init script. The
user must do the same on all nodes in the cluster. The cluster will not allow any new cluster mounts if the
heartbeat region configuration on all nodes is not the same.

The removal of heartbeat regions will only succeed if the active heartbeat region count is greater than 3.
This is to protect against edge conditions that can destabilize the cluster.

GETTING STARTED
The first step in configuring o2cb is deciding whether to setup local or global heartbeat. If global heartbeat,
then one has to format atleast one heartbeat device.

To format a OCFS2 volume with global heartbeat enabled, do:

# mkfs.ocfs2 --cluster-stack=o2cb --cluster-name=webcluster --global-heartbeat -L "hbvol1" /dev/sdb1

Once formatted, setup /etc/ocfs2/cluster.conf following the example provided in ocfs2.cluster.conf(5).

If local heartbeat, then one can setup cluster.conf without any heartbeat devices. The next step is starting
the cluster.

To online the cluster stack, do:

# service o2cb online


Loading stack plugin "o2cb": OK
Loading filesystem "ocfs2_dlmfs": OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Setting cluster stack "o2cb": OK
Registering O2CB cluster "webcluster": OK
Setting O2CB cluster timeouts : OK
Starting global heartbeat for cluster "webcluster": OK

Once the cluster stack is online, new OCFS2 volumes can be formatted normally without specifying the
cluster stack information. mkfs.ocfs2(8) will pick up that information automatically.

# mkfs.ocfs2 -L "datavol" /dev/sdc1

Meanwhile existing volumes can be converted to the new cluster stack using tunefs.ocfs2(8) utility.

# tunefs.ocfs2 --update-cluster-stack /dev/sdd1


Updating on-disk cluster information to match the running cluster.
DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS FILESYSTEM
BEFORE MODIFYING ITS CLUSTER CONFIGURATION.
Update the on-disk cluster information? y

Another utility mounted.ocfs2(8) is useful is listing all the OCFS2 volumes alonghwith the cluster stack
information.

To get a list of OCFS2 volumes, do:

# mounted.ocfs2 -d
Device Stack Cluster F UUID Label
/dev/sdb1 o2cb webcluster G DCDA2845177F4D59A0F2DCD8DE507CC3 hbvol1
/dev/sdc1 None 23878C320CF3478095D1318CB5C99EED localmount
/dev/sdd1 o2cb webcluster G 8AB016CD59FC4327A2CDAB69F08518E3 webvol

Version 1.8.2 August 2011 3


o2cb(7) OCFS2 Manual Pages o2cb(7)

/dev/sdg1 o2cb webcluster G 77D95EF51C0149D2823674FCC162CF8B logsvol


/dev/sdh1 o2cb webcluster G BBA1DBD0F73F449384CE75197D9B7098 scratch

The o2cb init script can also be used to check the status of the cluster, offline the cluster, etc.

To check the status of the cluster stack, do:

# service o2cb status


Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster "webcluster": Online
Heartbeat dead threshold: 62
Network idle timeout: 60000
Network keepalive delay: 2000
Network reconnect delay: 2000
Heartbeat mode: Global
Checking O2CB heartbeat: Active
77D95EF51C0149D2823674FCC162CF8B /dev/sdg1
DCDA2845177F4D59A0F2DCD8DE507CC3 /dev/sdk1
BBA1DBD0F73F449384CE75197D9B7098 /dev/sdh1
Nodes in O2CB cluster: 6 7 10
Active userdlm domains: ovm

To offline and unload the cluster stack, do:

# service o2cb offline


Clean userdlm domains: OK
Stopping global heartbeat on cluster "webcluster": OK
Stopping O2CB cluster webcluster: OK
Unregistering O2CB cluster "webcluster": OK

# service o2cb unload


Clean userdlm domains: OK
Unmounting ocfs2_dlmfs filesystem: OK
Unloading module "ocfs2_dlmfs": OK
Unloading module "ocfs2_stack_o2cb": OK

SEE ALSO
o2cb(8) o2cb.sysconfig(5) ocfs2.cluster.conf(5) o2hbmonitor(8)

AUTHORS
Oracle Corporation

COPYRIGHT
Copyright © 2004, 2011 Oracle. All rights reserved.

Version 1.8.2 August 2011 4


o2cb(8) OCFS2 Manual Pages o2cb(8)

NAME
o2cb − Cluster registration utility for the O2CB cluster stack.
SYNOPSIS
o2cb [--config-file=path] [-h|--help] [-v|--verbose] [-V|--version] COMMAND [ARGS]

DESCRIPTION
o2cb(8) is used to add, remove and list the information in the O2CB cluster configuration file. This utility
is also used to register and unregister the cluster, as well as start and stop global heartbeat.

The default location of the configuration file, /etc/ocfs2/cluster.conf, can be overridden using the --config-
file option.

OPTIONS
--config-file config-file
Specify a path to the configuration file. If not provided, it will use the default path of
/etc/ocfs2/cluster.conf.

-v, --verbose
Verbose mode.

-h, --help
Help.

-V, --version
Show version and exit.

O2CB COMMANDS
add-cluster cluster-name
Adds a cluster to the configuration file. The O2CB configuration file can hold multiple clusters.
However, only one cluster can be active at any time.

remove-cluster cluster-name
Removes a cluster from the configuration file. This command removes all the nodes and heartbeat
regions assigned to the cluster.

add-node cluster-name node-name [--ip ip-address] [--port port] [--number node-number]


Adds a node to the cluster in the configuration file. It accepts three optional arguments. If not pro-
vided, the ip-address defaults to the one assigned to the node-name, port to 7777, and node-num-
ber to the lowest unused node number.

remove-node cluster-name node-name


Removes a node from the cluster in the configuration file.

add-heartbeat cluster-name [uuid|device]


Adds a heartbeat region to the cluster in the configuration file.

remove-heartbeat cluster-name [uuid|device]


Removes a heartbeat region from the cluster in the configuration file.

Version 1.8.2 January 2012 1


o2cb(8) OCFS2 Manual Pages o2cb(8)

heartbeat-mode cluster-name [local|global]


Sets the heartbeat mode for the cluster in the configuration file.

list-clusters
Lists all the cluster names in the configuration file.

list-cluster cluster-name --oneline


Lists all the nodes and heartbeat regions associated with the cluster in the configuration file.

list-nodes cluster-name --oneline


Lists all the nodes associated with the cluster in the configuration file.

list-heartbeats cluster-name --oneline


Lists all the heartbeat regions associated with the cluster in the configuration file.

register-cluster cluster-name
Registers the cluster listed in the configuration file with configfs. If called when the cluster is
already registered, it will update configfs with the current configuration.

unregister-cluster cluster-name
Unregisters the cluster from configfs.

start-heartbeat cluster-name
Starts global heartbeat on all regions for the cluster as listed in the configuration file. If repeated, it
will start heartbeat on new regions and stop on regions since removed. It will silently exit if global
heartbeat has not been enabled.

stop-heartbeat cluster-name
Stops global heartbeat on all regions for the cluster. It will silently exit if global heartbeat has not
been enabled.

cluster-status [cluster-name]
Shows whether the given cluster is offline or online. If no cluster is provided, it shows the cur-
rently active cluster, if any.

EXAMPLE
To create a cluster, mycluster having two nodes, node1 and node2, do:

$ o2cb add-cluster mycluster


$ o2cb add-node mycluster node1 --ip 10.10.10.1
$ o2cb add-node mycluster node2 --ip 10.10.10.2

To specify a global heartbeat device, /dev/sda1, do:

$ o2cb add-heartbeat mycluster /dev/sda1

To enable global heartbeat, do:

$ o2cb heartbeat-mode mycluster global

Version 1.8.2 January 2012 2


o2cb(8) OCFS2 Manual Pages o2cb(8)

SEE ALSO
o2cb(7) o2cb.sysconfig(5) ocfs2.cluster.conf(5)

AUTHORS
Oracle Corporation

COPYRIGHT
Copyright © 2010, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 3


/etc/ocfs2/cluster.conf(5) OCFS2 Manual Pages /etc/ocfs2/cluster.conf(5)

NAME
/etc/ocfs2/cluster.conf − Cluster configuration file for the o2cb cluster stack.
SYNOPSIS
The cluster layout of the o2cb cluster stack is specified in /etc/ocfs2/cluster.conf. It lists the name of the
cluster, the nodes comprising that cluster and its heartbeat regions. The cluster stack expects this file to be
the same on all nodes in that cluster.

This file should be populated using the o2cb(8) cluster configuration utility. A sample of the same is shown
in the example section.

DESCRIPTION
The configuration file is divided into three types of stanzas, each with a list of parameters and values. The
three stanza types are cluster, node and heartbeat. While a configuration file can store definitions of multi-
ple clusters, the o2cb cluster stack allows only one cluster to be active at any one time. The name of this
active cluster is stored in /etc/sysconfig/o2cb [o2cb.sysconfig(5)].

The cluster stanza specifies the name of the cluster, number of nodes and the heartbeat mode. The cluster
name can include upto 16 alphanumeric characters [0-9A-Za-z]. No special characters are allowed.

Parameters Description
node_count Number of nodes in the cluster
heartbeat_mode local or global heartbeat
name Cluster name (upto 16 alphanumeric chars [0-9A-Za-z])

The node stanza specifies the node name that is part of the cluster alongwith its IPv4 address, port and
node number. The node name must match the hostname. The domain name is not required. For example,
appserver1.company.com can be appserver1. The IPv4 address need not be the one associated with that
hostname. As in, any valid IPv4 address on that node can be used. The o2cb cluster stack will not attempt
to match the node name (hostname) with the specified IPv4 address. A low-latency private interconnect
address is recommended for best performance.

Parameters Description
ip_port IPv4 port
ip_address IPv4 address (private interconnect recommended)
number Node number (0 - 254)
name Node name (hostname without the domain name)
cluster Cluster name (should match the name in the cluster stanza)

The heartbeat stanza specifies the global heartbeat region UUIDs. A cluster can have upto 32 heartbeat
regions. This is an optional stanza and is only required if the global heartbeat mode is enabled. In other
words, the regions are only used if heartbeat_mode = global is in the cluster stanza. If not, this stanza is
ignored.

Parameters Description
region Heartbeat region UUID
cluster Cluster name (should match the name in the cluster stanza)

Version 1.8.2 January 2012 1


/etc/ocfs2/cluster.conf(5) OCFS2 Manual Pages /etc/ocfs2/cluster.conf(5)

While manual editing is not recommended, users doing so must follow the format strictly. The stanza
should start at the first column and end with a colon. The parameters must start after a tab. A blank line
must demarcate each stanza. Care should be taken to avoid stray white-spaces.

EXAMPLE
The example below illustrates populating a cluster.conf with a cluster called webcluster, having 3 nodes and
3 global heartbeat regions, using the o2cb(8) utility.

$ o2cb add-cluster webcluster

$ o2cb add-node webcluster node7 --ip 192.168.0.107 --number 7


$ o2cb add-node webcluster node6 --ip 192.168.0.106 --number 6
$ o2cb add-node webcluster node10 --ip 192.168.0.110 --number 10

$ o2cb add-heartbeat webcluster /dev/sdg1


$ o2cb add-heartbeat webcluster /dev/sdk1
$ o2cb add-heartbeat webcluster /dev/sdh1

$ o2cb heartbeat-mode webcluster global

$ o2cb list-cluster webcluster


heartbeat:
region = 77D95EF51C0149D2823674FCC162CF8B
cluster = webcluster

heartbeat:
region = DCDA2845177F4D59A0F2DCD8DE507CC3
cluster = webcluster

heartbeat:
region = BBA1DBD0F73F449384CE75197D9B7098
cluster = webcluster

node:
ip_port = 7777
ip_address = 192.168.0.107
number = 7
name = node7
cluster = webcluster

node:
ip_port = 7777
ip_address = 192.168.0.106
number = 6
name = node6
cluster = webcluster

node:
ip_port = 7777
ip_address = 192.168.0.110
number = 10
name = node10
cluster = webcluster

Version 1.8.2 January 2012 2


/etc/ocfs2/cluster.conf(5) OCFS2 Manual Pages /etc/ocfs2/cluster.conf(5)

cluster:
node_count = 3
heartbeat_mode = global
name = webcluster

SEE ALSO
o2cb(7) o2cb(8) o2cb.sysconfig(5)

AUTHORS
Oracle Corporation

COPYRIGHT
Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 3


/etc/sysconfig/o2cb(5) OCFS2 Manual Pages /etc/sysconfig/o2cb(5)

NAME
/etc/sysconfig/o2cb − Cluster configuration file for the o2cb cluster stack.
SYNOPSIS
The configuration file /etc/sysconfig/o2cb stores the active cluster stack, its name and the various cluster
timeouts for the o2cb cluster stack.

DESCRIPTION
This file can be populated using the o2cb init script. An example of the same is illustrated in the examples
section.

The list of configurable parameters in this file are:

O2CB_STACK
Name of the cluster stack. The possible values are o2cb, pcmk and cman. o2cb is the default
cluster stack of the OCFS2 file system. pcmk (Pacemaker) and cman (rgmanager) are the two
other cluster stacks that are supported by the same file system.

O2CB_BOOTCLUSTER
Name of the active cluster. While /etc/ocfs2/cluster.conf can hold descriptions of multiple clusters,
only one can be active at any one time. The name of that active cluster is specified here. The name
itself can be upto 16 alphanumeric characters [0-9A-Za-z] with no special characters.
The remaining configurable parameters (cluster timeouts) are only relevant for the o2cb cluster stack. These
cluster timeouts are used by the o2cb cluster stack to determine whether a node is dead or alive. The default
timeouts are just a guide and may need to be tweaked depending on the hardware the software is running
on.

The various cluster timeouts for the o2cb cluster stack are:
O2CB_HEARTBEAT_THRESHOLD
The disk heartbeat timeout is the number of two second iterations before a node is considered
dead. The exact formula used to convert the timeout in seconds to the number of iterations is as
follows:

O2CB_HEARTBEAT_THRESHOLD = (((timeout in seconds) / 2) + 1)

For example, to specify a 60 sec timeout, set it to 31. For 120 secs, set it to 61. The default for this
timeout is 60 secs (O2CB_HEARTBEAT_THRESHOLD = 31).

While it defaults to 60 secs, multipath users typically set it to 120 secs.

O2CB_IDLE_TIMEOUT_MS
The network idle timeout specifies the time in milliseconds before a network connection is consid-
ered dead. While it defaults to 30000 ms, network bonding users typically set it to 60000 ms.

O2CB_KEEPALIVE_DELAY_MS
The network keepalive specifies the maximum delay in milliseconds before a keepalive packet is
sent to another node to check whether it is alive or not. It defaults to 2000 ms.

O2CB_RECONNECT_DELAY_MS
The network reconnect specifies the minimum delay in milliseconds between repeated connect
attempts. It defaults to 2000 ms.

Version 1.8.2 January 2012 1


/etc/sysconfig/o2cb(5) OCFS2 Manual Pages /etc/sysconfig/o2cb(5)

EXAMPLE
The example below illustrates populating the o2cb sysconfig file using the o2cb init script.

$ service o2cb configure


Configuring the O2CB driver.

This will configure the on-boot properties of the O2CB driver.


The following questions will determine whether the driver is loaded on
boot. The current values will be shown in brackets (’[]’). Hitting
<ENTER> without typing an answer will keep that current value. Ctrl-C
will abort.

Load O2CB driver on boot (y/n) [n]: y


Cluster stack backing O2CB [o2cb]:
Cluster to start on boot (Enter "none" to clear) [ocfs2]: webcluster
Specify heartbeat dead threshold (>=7) [31]: 62
Specify network idle timeout in ms (>=5000) [30000]: 60000
Specify network keepalive delay in ms (>=1000) [2000]:
Specify network reconnect delay in ms (>=2000) [2000]:
Writing O2CB configuration: OK

$ cat /etc/sysconfig/o2cb
#
# This is a configuration file for automatic startup of the O2CB
# driver. It is generated by running /etc/init.d/o2cb configure.
# On Debian based systems the preferred method is running
# ’dpkg-reconfigure ocfs2-tools’.
#

# O2CB_ENABLED: ’true’ means to load the driver on boot.


O2CB_ENABLED=true

# O2CB_STACK: The name of the cluster stack backing O2CB.


O2CB_STACK=o2cb

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.


O2CB_BOOTCLUSTER=webcluster

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.


O2CB_HEARTBEAT_THRESHOLD=62

# O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead.


O2CB_IDLE_TIMEOUT_MS=60000

# O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent


O2CB_KEEPALIVE_DELAY_MS=2000

# O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts


O2CB_RECONNECT_DELAY_MS=2000

Version 1.8.2 January 2012 2


/etc/sysconfig/o2cb(5) OCFS2 Manual Pages /etc/sysconfig/o2cb(5)

SEE ALSO
o2cb(7) o2cb(8) ocfs2.cluster.conf(5)

AUTHORS
Oracle Corporation

COPYRIGHT
Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 3


mkfs.ocfs2(8) OCFS2 Manual Pages mkfs.ocfs2(8)

NAME
mkfs.ocfs2 − Creates an OCFS2 file system.
SYNOPSIS
mkfs.ocfs2 [−b block−size] [−C cluster−size] [−L volume−label] [−M mount-type] [−N num-
ber−of−nodes] [−J journal−options] [−−fs−features=[no]sparse...] [−−fs−feature−level=feature−level]
[−T filesystem−type] [−−cluster−stack=stackname] [−−cluster−name=clustername] [−−global−heart-
beat] [−FqvV] device [blocks-count]
DESCRIPTION
mkfs.ocfs2 is used to create an OCFS2 file system on a device, usually a partition on a shared disk. In order
to prevent data loss, mkfs.ocfs2 will not format an existing OCFS2 volume if it detects that it is mounted
on another node in the cluster. This tool requires the cluster service to be online.

OPTIONS
−b, −−block−size block−size
Valid block size values are 512, 1K, 2K and 4K bytes per block. If omitted, a value will be heuris-
tically determined based on the expected usage of the file system (see the −T option). A block size
of 512 bytes is never recommended. Choose 1K, 2K or 4K.

−C, −−cluster−size cluster−size


Valid cluster size values are 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K and 1M. If omitted, a
value will be heuristically determined based on the expected usage of the file system (see the −T
option). For volumes expected to store large files, like database files, while a cluster size of 128K
or more is recommended, one can opt for a smaller size as long as that value is not smaller than
the database block size. For others, use 4K.

−F, −−force
For existing OCFS2 volumes, mkfs.ocfs2 ensures the volume is not mounted on any node in the
cluster before formatting. For that to work, mkfs.ocfs2 expects the cluster service to be online.
Specify this option to disable this check.

−J, −−journal-options options


Create the journal using options specified on the command−line. Journal options are comma sepa-
rated, and may take an argument using the equals (’=’) sign. The following options are supported:

size=journal−size
Create a journal of size journal−size. Minimum size is 4M. If omitted, a value is heuris-
tically determined based upon the file system size.

block32
Use a standard 32bit journal. The journal will be able to access up to 2ˆ32-1 blocks. This
is the default. It has been the journal format for OCFS2 volumes since the beginning.
The journal is compatible with all versions of OCFS2. Prepending no is equivalent to the
block64 journal option.

block64
Use a 64bit journal. The journal will be able to access up to 2ˆ64-1 blocks. This allows
large filesystems that can extend to the theoretical limits of OCFS2. It requires a new-
enough filesystem driver that uses the new journalled block device, JBD2. Prepending no
is equivalent to the block32 journal option.

Version 1.8.2 January 2012 1


mkfs.ocfs2(8) OCFS2 Manual Pages mkfs.ocfs2(8)

−L, −−label volume−label


Set the volume label for the file system. This is useful for mounting−by−label. Limit the label to
under 64 bytes.

−M, −−mount mount−type


Valid types are local and cluster. Local mount allows users to mount the volume without the clus-
ter overhead and works only with OCFS2 bundled with Linux kernels 2.6.20 or later. Defaults to
cluster.

−N, −−node−slots number−of−node−slots


Valid number ranges from 1 to 255. This number specifies the maximum number of nodes that can
concurrently mount the partition. If omitted, the number defaults to 8. The number of slots can be
later tuned up or down using tunefs.ocfs2.

−T filesystem−type
Specify how the filesystem is going to be used, so that mkfs.ocfs2 can chose optimal filesystem
parameters for that use. The supported filesystem types are:

mail Appropriate for file systems that will host lots of small files.

datafiles
Appropriate for file systems that will host a relatively small number of very large files.

vmstore
Appropriate for file systems that will host Virtual machine images.

−−fs−features=[no]sparse...
Turn specific file system features on or off. A comma separated list of feature flags can be pro-
vided, and mkfs.ocfs2 will try to create the file system with those features set according to the list.
To turn a feature on, include it in the list. To turn a feature off, prepend no to the name. Choices
here will override individual features set via the −−fs−feature−level option. Refer to the section
titled feature compatibility before selecting specific features. The following flags are supported:

backup-super
mkfs.ocfs2, by default, makes up to 6 backup copies of the super block at offsets 1G, 4G,
16G, 64G, 256G and 1T depending on the size of the volume. This can be useful in dis-
aster recovery. This feature is fully compatible with all versions of the file system and
generally should not be disabled.

local Create the file system as a local mount, so that it can be mounted without a cluster stack.

sparse Enable support for sparse files. With this, OCFS2 can avoid allocating (and zeroing) data
to fill holes. Turn this feature on if you can, otherwise extends and some writes might be
less performant.

unwritten
Enable unwritten extents support. With this turned on, an application can request that a
range of clusters be pre-allocated within a file. OCFS2 will mark those extents with a spe-
cial flag so that expensive data zeroing doesn’t have to be performed. Reads and writes to
a pre-allocated region act as reads and writes to a hole, except a write will not fail due to

Version 1.8.2 January 2012 2


mkfs.ocfs2(8) OCFS2 Manual Pages mkfs.ocfs2(8)

lack of data allocation. This feature requires sparse file support to be turned on.

inline-data
Enable inline-data support. If this feature is turned on, OCFS2 will store small files and
directories inside the inode block. Data is transparently moved out to an extent when it no
longer fits inside the inode block. In some cases, this can also make a positive impact on
cold-cache directory and file operations.

extended-slotmap
The slot-map is a hidden file on an OCFS2 fs which is used to map mounted nodes to sys-
tem file resources. The extended slot map allows a larger range of possible node numbers,
which is useful for userspace cluster stacks. If required, this feature is automatically
turned on by mkfs.ocfs2.

metaecc
Enables metadata checksums. With this enabled, the file system computes and stores the
checksums in all metadata blocks. It also computes and stores an error correction code
capable of fixing single bit errors.

refcount
Enables creation of reference counted trees. With this enabled, the file system allows
users to create inode-based snapshots and clones known as reflinks.

xattr Enable extended attributes support. With this enabled, users can attach name:value pairs
to objects within the file system. In OCFS2, the names can be upto 255 bytes in length,
terminated by the first NUL byte. While it is not required, printable names (ASCII) are
recommended. The values can be upto 64KB of arbitrary binary data. Attributes can be
attached to all types of inodes: regular files, directories, symbolic links, device nodes, etc.
This feature is required for users wanting to use extended security facilities like POSIX
ACLs or SELinux.

usrquota
Enable user quota support. With this feature enabled, filesystem will track amount of
space and number of inodes (files, directories, symbolic links) each user owns. It is then
possible to limit the maximum amount of space or inodes user can have. See a documen-
tation of quota-tools package for more details.

grpquota
Enable group quota support. With this feature enabled, filesystem will track amount of
space and number of inodes (files, directories, symbolic links) each group owns. It is then
possible to limit the maximum amount of space or inodes user can have. See a documen-
tation of quota-tools package for more details.

indexed-dirs
Enable directory indexing support. With this feature enabled, the file system creates
indexed tree for non-inline directory entries. For large scale directories, directory entry
lookup perfromance from the indexed tree is faster then from the legacy directory blocks.

discontig-bg
Enables discontiguous block groups. With this feature enabled, the file system is able to
grow the inode and the extent allocators even when there is no contiguous free chunk

Version 1.8.2 January 2012 3


mkfs.ocfs2(8) OCFS2 Manual Pages mkfs.ocfs2(8)

available. It allows the file system to grow the allocators in smaller (discontiguous)
chunks.

clusterinfo
Enables storing the cluster stack information in the superblock. This feature is needed to
support userspace cluster stacks and the global heartbeat mode in the o2cb cluster stack.
If needed, this feature is automatically turned on by mkfs.ocfs2.

−−fs−feature−level=feature−level
Choose from a set of pre-determined file-system features. This option is designed to allow users to
conveniently choose a set of file system features which fits their needs. There is no downside to
trying a set of features which your module might not support - if it won’t mount the new file sys-
tem simply reformat at a lower level. Feature levels can be fine-tuned via the −−fs−features
option. Currently, there are 3 types of feature levels:

max-compat
Chooses fewer features but ensures that the file system can be mounted from older ver-
sions of the OCFS2 module.

default The default feature set tries to strike a balance between providing new features and main-
taining compatibility with relatively recent versions of OCFS2. It currently enables
sparse, unwritten, inline-data, xattr, indexed-dirs, discontig-bg, refcount, extended-
slotmap and clusterinfo.

max-features
Choose the maximum amount of features available. This will typically provide the best
performance from OCFS2 at the expense of creating a file system that is only compatible
with very recent versions of the OCFS2 kernel module.

−−cluster−stack
Specify the cluster stack. This option is normally not required as mkfs.ocfs2 chooses the currently
active cluster stack. It is required only if the cluster stack is not online and the user wishes to use a
stack other than the default, o2cb. Other supported cluster stacks are pcmk (Pacemaker) and cman
(rgmanager). Once set, OCFS2 will only allow mounting the volume if the active cluster stack and
cluster name matches the one specified on-disk.

−−cluster−name
Specify the name of the cluster. This option is mandatory if the user has specified a cluster−stack.
This name is restricted to a max of 16 characters. Additionally, the o2cb cluster stack allows only
alpha-numeric characters.

−−global−heartbeat
Enable the global heartbeat mode of the o2cb cluster stack. This option is not required if the o2cb
cluster stack with global heartbeat is online as mkfs.ocfs2 will detect the active stack. However, if
the cluster stack is not up, then this option is required alongwith cluster−stack and cluster−name.
For more, refer to o2cb(7).

−−no-backup-super
This option is deprecated, please use --fs-features=nobackup-super instead.

Version 1.8.2 January 2012 4


mkfs.ocfs2(8) OCFS2 Manual Pages mkfs.ocfs2(8)

−n, --dry-run
Display the heuristically determined values without overwriting the existing file system.

−q, −−quiet
Quiet mode.

−U uuid
Specify a custom UUID in the plain (2A4D1C581FAA42A1A41D26EFC90C1315) or traditional
(2a4d1c58-1faa-42a1-a41d-26efc90c1315) format. This option in not recommended because the
file system uses the UUID to uniquely identify a file system. If more than one file system were to
have the same UUID, one is very likely to encounter erratic behavior, if not, outright file sys-
tem corruption.

−v, −−verbose
Verbose mode.

−V, −−version
Print version and exit.

blocks-count
Usually mkfs.ocfs2 automatically determines the size of the given device and creates a file system
that uses all of the available space on the device. This optional argument specifies that the file sys-
tem should only consume the given number of file system blocks (see -b) on the device.

FEATURE COMPATIBILITY
This section lists the file system features that have been added to the OCFS2 file system and the version that
it first appeared in. The table below lists the versions of the mainline Linux kernel and ocfs2-tools. Users
should use this information to enable only those features that are available in the file system that they are
using. Before enabling new features, users are advised to review to the section titled feature values.

Feature Kernel Version Tools Version


local Linux 2.6.20 ocfs2-tools 1.2
sparse Linux 2.6.22 ocfs2-tools 1.4
unwritten Linux 2.6.23 ocfs2-tools 1.4
inline-data Linux 2.6.24 ocfs2-tools 1.4
extended-slotmap Linux 2.6.27 ocfs2-tools 1.6
metaecc Linux 2.6.29 ocfs2-tools 1.6
grpquota Linux 2.6.29 ocfs2-tools 1.6
usrquota Linux 2.6.29 ocfs2-tools 1.6
xattr Linux 2.6.29 ocfs2-tools 1.6
indexed-dirs Linux 2.6.30 ocfs2-tools 1.6
refcount Linux 2.6.32 ocfs2-tools 1.6
discontig-bg Linux 2.6.35 ocfs2-tools 1.6
clusterinfo Linux 2.6.37 ocfs2-tools 1.8

Users can query the features enabled in the file system as follows:

# tunefs.ocfs2 -Q "Label: %V\nFeatures: %H %O\n" /dev/sdg1


Label: apache_files_10

Version 1.8.2 January 2012 5


mkfs.ocfs2(8) OCFS2 Manual Pages mkfs.ocfs2(8)

Features: sparse inline-data unwritten

FEATURE VALUES
This section lists the hex values that are associated with the file system features. This information is useful
when debugging mount failures that are due to feature incompatibility. When a user attempts to mount an
OCFS2 volume that has features enabled that are not supported by the running file system software, it will
fail with an error like:

ERROR: couldn’t mount because of unsupported optional features (200).

By referring to the table below, it becomes apparent that the user attempted to mount a volume with the
xattr (extended attributes) feature enabled with a version of the file system software that did not support it.
At this stage, the user has the option of either upgrading the file system software, or, disabling that on-disk
feature using tunefs.ocfs2.

Some features allow the file system to be mounted with an older version of the software provided the mount
is read-only. If a user attempts to mount such a volume in a read-write mode, it will fail with an error like:

ERROR: couldn’t mount RDWR because of unsupported optional features (1).

This error indicates that the volume had the unwritten RO compat feature enabled. This volume can be
mounted by an older file system software only in the read-only mode. In this case, the user has the option
of either mounting the volume with the ro mount option, or, disabling that on-disk feature using
tunefs.ocfs2.

Feature Category Hex value


local Incompat 8
sparse Incompat 10
inline-data Incompat 40
extended-slotmap Incompat 100
xattr Incompat 200
indexed-dirs Incompat 400
metaecc Incompat 800
refcount Incompat 1000
discontig-bg Incompat 2000
clusterinfo Incompat 4000
unwritten RO Compat 1
usrquota RO Compat 2
grpquota RO Compat 4

SEE ALSO
debugfs.ocfs2(8) fsck.ocfs2(8) mount.ocfs2(8) mounted.ocfs2(8) o2cb(7) o2cluster(8) o2image(8)
o2info(1) tunefs.ocfs2(8)

AUTHORS
Oracle Corporation

Version 1.8.2 January 2012 6


mkfs.ocfs2(8) OCFS2 Manual Pages mkfs.ocfs2(8)

COPYRIGHT
Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 7


mount.ocfs2(8) OCFS2 Manual Pages mount.ocfs2(8)

NAME
mount.ocfs2 − mount an OCFS2 filesystem
SYNOPSIS
mount.ocfs2 [−vn] [−o options] device dir
DESCRIPTION
mount.ocfs2 mounts an OCFS2 filesystem at dir. It is usually invoked indirectly by the mount(8) com-
mand.

OPTIONS
netdev
Indicates that the file system resides on a device that requires network access (used to prevent the
system from attempting to mount these filesystems until the network has been enabled on the sys-
tem). mount.ocfs2(8) transparently appends this option during mount. However, users mounting
the volume via /etc/fstab must explicitly specify this mount option to delay the system from
mounting the volume until after the network has been enabled.

noatime
The file system will not update access time.

relatime
The file system will update atime only if the on-disk atime is older than mtime or ctime.

strictatime,atime quantum=nrsec
The file system will always perform atime updates, but the minimum update interval is specified
by atime_quantum which defaults to 60 secs. Set it to zero to always update atime. These two
options need work together.

[no]acl Enables / disables POSIX ACLs (access control lists) support.

[no]user_xattr
Enables / disables extended user attributes.

commit=nrsec
Sync all data and metadata every nrsec seconds. The default value is 5 seconds. Zero means
default.

data=[ordered|writeback]
Specifies the handling of file data during metadata journalling.

ordered
This is the default mode. Data is flushed to disk before the corresponding meta-data is
committed to the journal.

writeback
Data ordering is not preserved - data may be flushed to disk after the corresponding
meta-data is committed to the journal. This is rumored to be the higher-throughput
option. While it guarantees internal file system integrity, it can allow old data to appear in
files after a crash and journal recovery.

Version 1.8.2 January 2012 1


mount.ocfs2(8) OCFS2 Manual Pages mount.ocfs2(8)

errors=[remount-ro|errors=panic]
Specifies the behavior when an on-disk corruption is encountered.

remount-ro
This is the default mode. The file system is remounted read-only.

panic The system is halted via panic.

localflocks
This disables cluster-aware flock(2).

coherency=[full|coherency]
Specifies the extent of coherency for the cached file data across the cluster. This mount option
works with Linux kernel 2.6.37 and later.

full This is the default mode. The file system ensures the cached file data is coherent across
the cluster for all IO modes.

buffered
The file system only ensures the cached file data coherency for buffered mode IOs. It
does not perform IO serialization for direct IOs. This allows multiple nodes to perform
concurrent direct IOs to the same file. This is the recommended mode for volumes host-
ing database files.

resv_level=level
Specifies the level of allocation reservation for files. The higher the value, the more aggressive it
is. Valid values are between 0 (reservation off) to 8 (maximum space for reservation). It defaults to
2. This mount option works with Linux kernel 2.6.35 and later.

dir_resv_level=level
By default, directory reservation scales with file reserveration. Users should rarely need to change
this value. If the file allocation reservation is turned off, this option will have no effect. This mount
option works with Linux kernel 2.6.35 and later.
inode64
Indicates that the file system can create inodes at any location in the volume, including those
which will result in inode numbers greater than 4 billion.

[no]intr
Specifies whether a signal can interrupt IOs. It is disabled by default.

ro Mount the file system read-only.

rw Mount the file system read-write.

NOTES
To mount and umount a OCFS2 volume, do:

# mount /dev/sda1 /mount/path

Version 1.8.2 January 2012 2


mount.ocfs2(8) OCFS2 Manual Pages mount.ocfs2(8)

...
# umount /mount/path

Users mounting a clustered volume should be aware of the following:

1. The cluster stack must to be online for a clustered mount to succeed.

2. The clustered mount operation is not instantaneous; it must wait for the node to join the DLM
domain.

3. Likewise, clustered umount is also not instantaneous; it involves migrating all mastered lock-
resources to the other nodes in the cluster.

If the mount fails, detailed errors can be found via dmesg(8). These might include incorrect cluster configu-
ration (say, a missing node or incorrect IP address) or a firewall interfering with o2cb network traffic.
Check the configuration as listed in o2cb(7) or the man page of the active cluster stack.

To auto-mount volumes on startup, the file system tools include an ocfs2 init service. This runs after the
o2cb init service has started the cluster. The ocfs2 init service mounts all OCFS2 volumes listed in
/etc/fstab.

# chkconfig --add o2cb


o2cb 0:off 1:off 2:on 3:on 4:off 5:on 6:off

$ chkconfig --add ocfs2


o2cb 0:off 1:off 2:on 3:on 4:off 5:on 6:off

$ cat /etc/fstab
...
/dev/sda1 /u01 ocfs2 _netdev,defaults 0 0
...

SEE ALSO
debugfs.ocfs2(8) fsck.ocfs2(8) mkfs.ocfs2(8) mounted.ocfs2(8) o2cb(7) o2cluster(8) o2image(8)
o2info(1) tunefs.ocfs2(8)

AUTHORS
Oracle Corporation

COPYRIGHT
Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 3


mounted.ocfs2(8) OCFS2 Manual Pages mounted.ocfs2(8)

NAME
mounted.ocfs2 − Detects all OCFS2 volumes on a system.
SYNOPSIS
mounted.ocfs2 [−d] [−f] [device]
DESCRIPTION
mounted.ocfs2 is used to detect OCFS2 volume(s) on a system. When run without specifying a device, it
scans all the partitions listed in /proc/partitions.

OPTIONS
−d Lists the OCFS2 volumes along with their labels and uuids. It also lists the cluster stack, cluster
name and the cluster flags. The possible cluster stacks are o2cb, pcmk and cman. None indicates
a local mount or a non-clustered volume. A G cluster flag indicates global-heartbeat for the o2cb
cluster stack.

−f Lists the OCFS2 volumes along with the list of nodes that have mounted the volume.

NOTES
As this utility gathers information without taking any cluster locks, the information listed in the full detect
mode could be stale. This is only problematic for volumes that were not cleanly umounted by the last node.
Such volumes will show up mounted (as per this utility) on one or more nodes but are in fact not mounted
on any node. Such volumes are awaiting slot-recovery which is auto-performed on the next mount (or file
system check).

EXAMPLES
To view the list of OCFS2 volumes, do:

# mounted.ocfs2 -d
Device Stack Cluster F UUID Label
/dev/sdc1 None 23878C320CF3478095D1318CB5C99EED localmount
/dev/sdd1 o2cb webcluster G 8AB016CD59FC4327A2CDAB69F08518E3 webvol
/dev/sdg1 o2cb webcluster G 77D95EF51C0149D2823674FCC162CF8B logsvol
/dev/sdh1 o2cb webcluster G BBA1DBD0F73F449384CE75197D9B7098 scratch
/dev/sdk1 o2cb webcluster G DCDA2845177F4D59A0F2DCD8DE507CC3 hb1

To view the list of nodes that have potentially (see notes) mounted the OCFS2 volumes, do:

# mounted.ocfs2 -f
Device Stack Cluster F Nodes
/dev/sdc1 None
/dev/sdd1 o2cb webcluster G node1, node3, node10
/dev/sdg1 o2cb webcluster G node1, node3, node10
/dev/sdh1 o2cb webcluster G Not mounted
/dev/sdk1 o2cb webcluster G node1, node3, node10

SEE ALSO
debugfs.ocfs2(8) fsck.ocfs2(8) mkfs.ocfs2(8) mount.ocfs2(8) o2cluster(8) o2image(8) o2info(1)
tunefs.ocfs2(8)

AUTHORS
Oracle Corporation

Version 1.8.2 January 2012 1


mounted.ocfs2(8) OCFS2 Manual Pages mounted.ocfs2(8)

COPYRIGHT
Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 2


tunefs.ocfs2(8) OCFS2 Manual Pages tunefs.ocfs2(8)

NAME
tunefs.ocfs2 − Change OCFS2 file system parameters.
SYNOPSIS
tunefs.ocfs2 [−−cloned−volume[=new-label] [−−fs−features=list−of−features] [−J journal-options] [−L
volume-label] [−N number-of-node-slots] [−Q query-format] [−ipqnSUvVy] [−−backup-super]
[−−list−sparse] device [blocks-count]

DESCRIPTION
tunefs.ocfs2(8) is used to adjust OCFS2 file system parameters on disk. The tool expects the cluster to be
online as it needs to take the appropriate cluster locks to write safely to disk.

OPTIONS
−−cloned−volume[=new-label]
Change the volume UUID (auto-generated) and the label, if provided, of a cloned OCFS2 volume.
This option does not perform volume cloning. It only changes the UUID and label on a cloned
volume so that it can be mounted on the node that has the original volume mounted.

−−fs−features=[no]sparse...
Turn specific file system features on or off. tunefs.ocfs2(8) will attempt to enable or disable the
feature list provided. To enable a feature, include it in the list. To disable a feature, prepend no to
the name. For a list of feature names, refer to mkfs.ocfs2(8).

−J, −−journal−options options


Modify the journal using options specified on the command−line. Journal options are comma sep-
arated, and may take an argument using the equals (’=’) sign. For a list of possible options, refer to
mkfs.ocfs2(8).

−L, −−label volume−label


Change the volume label of the file system. Limit the label to under 64 bytes.

−N, −−node−slots number−of−node−slots


Valid number ranges from 1 to 255. This number specifies the maximum number of nodes that can
concurrently mount the partition. Use this to increase or decrease the number of node slots. One
reason to decrease could be to release the space consumed by the journals for those slots.

−S, −−volume−size
Grow the size of the OCFS2 file system. If blocks-count is not specified, tunefs.ocfs2(8) extends
the volume to the current size of the device.

−Q, −−query query−format


Query the file system for its attributes like block size, label, etc. Query formats are modified ver-
sions of the standard printf(3) formatting. The format is made up of static strings (which may
include standard C character escapes for newlines, tabs, and other special characters) and printf(3)
type formatters. The list of type specifiers is as follows:
B Block size in bytes
T Cluster size in bytes
N Number of node slots

Version 1.8.2 January 2012 1


tunefs.ocfs2(8) OCFS2 Manual Pages tunefs.ocfs2(8)

R Root directory block number


Y System directory block number
P First cluster group block number
V Volume label
U Volume uuid
M Compat flags
H Incompat flags
O RO Compat flags

−q, −−quiet
Quiet mode.

−U, −−uuid−reset[=new-uuid]
Reset the volume UUID of the file system. If not provided, the utility will auto generate it. For
custom UUID, specify in either the plain (2A4D1C581FAA42A1A41D26EFC90C1315) or the
traditional (2a4d1c58-1faa-42a1-a41d-26efc90c1315) format. Users specifying custom UUIDs
must be careful to ensure that no two volumes have the same UUID. If more than one file system
were to have the same UUID, one is very likely to encounter erratic behavior, if not, outright
file system corruption.

−v, −−verbose
Verbose mode.

−V, −−version
Show version and exit.

−y, −−yes
Always answer Yes in interactive command line.

−n, −−no
Always answer No in interactive command line.

−−backup−super
Backs up the superblock to fixed offsets (1G, 4G, 16G, 64G, 256G and 1T) on disk. This option is
useful for users to backup the superblock on volumes that the user either explicitly disallowed
while formatting, or, used a version of mkfs.ocfs2(8) (1.2.2 or older) that did not provide this
facility.

−−list-sparse
Lists the files having holes. This option is useful when disabling the sparse feature.

−−update-cluster-stack
Updating on-disk cluster information to match the running cluster. Users looking to update the on-
disk cluster stack without starting the new cluster should use the o2cluster(8) utility.

blocks-count
During resize, tunefs.ocfs2(8) automatically determines the size of the given device and grows the
file system such that it uses all of the available space on the device. This optional argument

Version 1.8.2 January 2012 2


tunefs.ocfs2(8) OCFS2 Manual Pages tunefs.ocfs2(8)

specifies that the file system should be extended to consume only the given number of file system
blocks on the device.

EXAMPLES
# tunefs.ocfs2 -Q "UUID = %U\nNumSlots = %N\n" /dev/sda1
UUID = CBB8D5E0C169497C8B52A0FD555C7A3E
NumSlots = 4

SEE ALSO
debugfs.ocfs2(8) fsck.ocfs2(8) fsck.ocfs2.checks(8) mkfs.ocfs2(8) mount.ocfs2(8) mounted.ocfs2(8)
o2cluster(8) o2image(8) o2info(1)

AUTHORS
Oracle Corporation

COPYRIGHT
Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 3


o2cluster(8) OCFS2 Manual Pages o2cluster(8)

NAME
o2cluster − Change cluster stack stamped on an OCFS2 file system.
SYNOPSIS
o2cluster [−o|−−show−ondisk] [−r|−−show−running] [−u|−−update[=<clusterstack>]] [−hvVyn]
[device]

DESCRIPTION
o2cluster is used to change the cluster stack stamped on an OCFS2 file system. It also used to list the
active cluster stack and the one stamped on-disk. This utility does not expect the cluster to be online. It only
updates the file system if it is reasonably assured that it is not in-use on any other node. Clean journals
implies the file system is not in-use. This utility aborts if it detects even one dirty journal.

Before using this utility, the user should use other means to ensure that the volume is not in-use, and more
importantly, not about to be put in-use. While clean journals implies the file system is not in-use, there is a
tiny window after the check and before the update during which another node could mount the file system
using the older cluster stack.

If a dirty journal is detected, it implies one of two scenarios. Either the file system is mounted on another
node, or, the last node to have it mounted, crashed. There is no way, short of joining the cluster, that the
utility can use to differentiate between the two. Considering this utility is targetted to be used in scenarios
when the user is looking to change the on-disk cluster stack, it becomes a chicken-and-egg problem.

If one were to run into this scenario, the user should manually re-confirm that the file system is not in-use
on another node and then run fsck.ocfs2(8). It will update the on-disk cluster stack to the active cluster
stack, and, do a complete file system check.

SPECIFYING CLUSTER STACK


The cluster stack can be specified in one of two forms. The first as default, denoting the original classic
o2cb cluster stack with local heartbeat. The second as a triplet with the stack name, the cluster name and
the cluster flags separated by commas. Like o2cb,mycluster,global.

The valid stack names are o2cb, pcmk, and cman.

The cluster name can be upto 16 characters. The o2cb stack further restricts the names to contain only
alphanumeric characters.

The valid flags for the o2cb stack are local and global, denoting the two heartbeat modes. The only valid
flag for the other stacks is none.

OPTIONS
−o|−−show−ondisk
Shows the cluster stack stamped on-disk.

−r|−−show−running
Shows the active cluster stack.

−u|−−update[=<clusterstack>]
Updates the on-disk cluster stack with the one provided. If no cluster stack is provided, the utility
detects the active cluster stack and stamps it on-disk.

Version 1.8.2 January 2012 1


o2cluster(8) OCFS2 Manual Pages o2cluster(8)

−v, −−verbose
Verbose mode.

−V, −−version
Show version and exit.

−y, −−yes
Always answer Yes in interactive command line.

−n, −−no
Always answer No in interactive command line.

EXAMPLES
# o2cluster -r
o2cb,myactivecluster,global

# o2cluster -o /dev/sda1
o2cb,mycluster,global

# o2cluster --update=o2cb,yourcluster,global /dev/sdb1


Changing the clusterstack from o2cb,mycluster,global to o2cb,yourcluster,global. Continue? y
Updated successfully.

SEE ALSO
debugfs.ocfs2(8) fsck.ocfs2(8) fsck.ocfs2.checks(8) mkfs.ocfs2(8) mount.ocfs2(8) mounted.ocfs2(8)
o2image(8) o2info(1) tunefs.ocfs2(8)

AUTHORS
Oracle Corporation

COPYRIGHT
Copyright © 2011, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 2


debugfs.ocfs2(8) OCFS2 Manual Pages debugfs.ocfs2(8)

NAME
debugfs.ocfs2 − OCFS2 file system debugger.
SYNOPSIS
debugfs.ocfs2 [−f cmdfile] [−R command] [−s backup] [−nwV?] [device]
debugfs.ocfs2 −l [tracebit ... [allow|off|deny]] ...
debugfs.ocfs2 −d, −−decode lockname
debugfs.ocfs2 −e, −−encode lock_type block_num [generation | parent]

DESCRIPTION
The debugfs.ocfs2 program is an interactive file system debugger useful in displaying on-disk OCFS2
filesystem structures on the specified device.

OPTIONS
−d, −−decode lockname
Display the information encoded in the lockname.

−e, −−encode lock_type block_num [generation | parent]


Display the lockname obtained by encoding the arguments provided.

−f, −−file cmdfile


Executes the debugfs commands in cmdfile.

−i, −−image
Specifies device is an o2image file created by o2image tool.

−l [tracebit ... [allow|off|deny]] ...


Control OCFS2 filesystem tracing by enabling and disabling trace bits. Do debugfs.ocfs2 -l to get
the list of all trace bits.

−n, −−noprompt
Hide prompt.

−R, −−request command


Executes a single debugfs command.

−s, −−superblock backup−number


mkfs.ocfs2 makes upto 6 backup copies of the superblock at offsets 1G, 4G, 16G, 64G, 256G and
1T depending on the size of the volume. Use this option to specify the backup, 1 thru 6, to use to
open the volume.

−w, −−write
Opens the filesystem in RW mode. By default the filesystem is opened in RO mode.

−V, −−version
Display version and exit.

Version 1.8.2 January 2012 1


debugfs.ocfs2(8) OCFS2 Manual Pages debugfs.ocfs2(8)

−?, −−help
Displays help and exit.

SPECIFYING FILES
Many debugfs.ocfs2 commands take a filespec as an argument to specify an inode (as opposed to a path-
name) in the filesystem which is currently opened by debugfs.ocfs2. The filespec argument may be speci-
fied in two forms. The first form is an inode number or lockname surrounded by angle brackets, e.g., <32>.
The second form is a pathname; if the pathname is prefixed by a forward slash (’/’), then it is interpreted
relative to the root of the filesystem which is currently opened by debugfs.ocfs2. If not, the path is inter-
preted relative to the current working directory as maintained by debugfs.ocfs2, which can be modified
using the command cd. If the pathname is prefixed by a double forward slash (’//’), then it is interpreted rel-
ative to the root of the system directory of the filesystem opened by debugfs.ocfs2.

LOCKNAMES
Locknames are specially formatted strings used by the file system to uniquely identify objects in the filesys-
tem. Most locknames used by OCFS2 are generated using the inode number and its generation number and
can be decoded using the decode command or used directly in place of an inode number in commands
requiring a filespec. Like inode numbers, locknames need to be enclosed in angle brackets, e.g.,
<M000000000000000040c40c044069cf>. Use the encode command to generate all possible locknames for
an object.

COMMANDS
This is a list of the commands which debugfs.ocfs2 supports.

bmap filespec logical_block


Display the physical block number corresponding to the logical block number logical_block in the
inode filespec.

cat filespec
Dump the contents of inode filespec to stdout.

cd filespec
Change the current working directory to filespec.

chroot filespec
Change the root directory to be the directory filespec.

close Close the currently opened filesystem.

controld dump
Display information obtained from ocfs2_controld.

curdev Show the currently open device.

decode <lockname>
Display the inode number encoded in the lockname.

dirblocks <filespec>
Display the directory blocks associated with the given filespec.

Version 1.8.2 January 2012 2


debugfs.ocfs2(8) OCFS2 Manual Pages debugfs.ocfs2(8)

dlm_locks [−f <file>] [−l] [<lockname(s)>]...


Display the status of all lock resources in the o2dlm domain that the file system is a member of.
This command expects the debugfs filesystem to be mounted as mount -t debugfs debugfs /sys/ker-
nel/debug. Use lockname(s) to limit the output to the given lock resources, -l to include contents of
the lock value block and -f <file> to specify a saved copy of /sys/ker-
nel/debug/o2dlm/<DOMAIN>/locking_state.

dump [−p] filespec outfile


Dump the contents of the inode filespec to the output file outfile. If the -p is given, set the owner,
group, timestamps and permissions information on outfile to match those of filespec.

dx_dump filespec
Display the indexed directory information for the given directory.

dx_leaf <block#>
Display the contents of the given indexed directory leaf block.

dx_root <block#>
Display the contents of the given indexed directory root block.

dx_space filespec
Display the directory free space list.

encode filespec
Display the lockname for the filespec.

extent <block#>
Display the contents of the extent structure at block#.

findpath [<lockname>|<inode#>]
Display the pathname for the inode specified by lockname or inode#. This command does not dis-
play all the hard-linked paths for the inode.

frag filespec
Display the inode’s number of extents to clusters ratio.

fs_locks [-f <file>] [-l] [-B] [<lockname(s)>]...


Display the status of all locks known by the file system. This command expects the debugfs
filesystem to be mounted as mount -t debugfs debugfs /sys/kernel/debug. Use lockname(s) to limit
the output to the given lock resources, -B to limit the output to only the busy locks, -l to include
contents of the lock value block and -f <file> to specify a saved copy of /sys/ker-
nel/debug/ocfs2/<UUID>/locking_state.

group <block#>
Display the contents of the group descriptor at block#.

grpextents <block#>
Display free extents in the chain group.

Version 1.8.2 January 2012 3


debugfs.ocfs2(8) OCFS2 Manual Pages debugfs.ocfs2(8)

hb Display the contents of the heartbeat system file.

help, ? Print the list of commands understood by debugfs.ocfs2.

icheck block# ...


Display the inodes that use the one or more blocks specified on the command line. If the inode is
a regular file, also display the corresponding logical block offset.

lcd directory
Change the current working directory of the debugfs.ocfs2 process to the directory on the native
filesystem.

locate [<lockname>|<inode#>] ...


Display all pathnames for the inode(s) specified by locknames or inode#s.

logdump [-T] slot#


Display the contents of the journal for slot slot#. Use -T to limit the output to just the summary of
the inodes in the journal.

ls [−l] filespec
Print the listing of the files in the directory filespec. The −l flag will list files in the long format.

net_stats [interval [count]]


Display net statistics.

ncheck [<lockname>|<inode#>] ...


See locate.

open device
Open the filesystem on device.

quit, q Quit debugfs.ocfs2.

rdump [−v] filespec outdir


Recursively dump directory filespec and all its contents (including regular files, symbolic links and
other directories) into the outdir which should be an existing directory on the native filesystem.

refcount [−e] filespec


Display the refcount block, and optionally its tree, of the specified inode.

slotmap
Display the contents of the slotmap system file.

stat [−t|−T] filespec


Display the contents of the inode structure for the filespec. The -t ("traverse") option selects tra-
versal of the inode’s metadata. The extent tree, chain list, or other extra metadata will be dumped.
This is the default. The -T option turns off traversal to reduce the I/O required when basic inode
information is needed.

Version 1.8.2 January 2012 4


debugfs.ocfs2(8) OCFS2 Manual Pages debugfs.ocfs2(8)

stat_sysdir
Display the contents of all objects in the system directory.

stats [−h] [−s backup−number]


Display the contents of the superblock. Use −s to display a specific backup superblock. Use −h to
hide the inode.

xattr [-v] <filespec>


Display extended attributes associated with the given filespec.

ACKNOWLEDGEMENT
This tool has been modelled after debugfs, a debugging tool for ext2.

SEE ALSO
fsck.ocfs2(8) fsck.ocfs2.checks(8) mkfs.ocfs2(8) mount.ocfs2(8) mounted.ocfs2(8) o2cluster(8)
o2image(8) o2info(1) ocfs2(7) tunefs.ocfs2(8)

AUTHOR
Oracle Corporation

COPYRIGHT
Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 5


o2image(8) OCFS2 Manual Pages o2image(8)

NAME
o2image − Copy or restore OCFS2 file system meta-data
SYNOPSIS
o2image [−r] [−I] device image-file
DESCRIPTION
o2image copies the OCFS2 file system meta-data from the device to the specified image-file. This image
file contains the file system skeleton that includes the inodes, directory names and file names. It does not
include any file data.

This image file can be useful to debug certain problems that are not reproducible otherwise. Like on-disk
corruptions. It could also be used to analyse the file system layout in an aging file system with an eye
towards improving performance.

As the image-file contains a copy of all the meta-data blocks, it can be a large file. By default, it is created
in a packed format, in which all meta-data blocks are written back-to-back. With the −r option, the user
could choose to have the file in the raw (or sparse) format, in which the blocks are written to the same offset
as they are on the device.

debugfs.ocfs2 understands both formats.

o2image also has the option, −I, to restore the meta-data from the image file onto the device. This option
will rarely be useful to end-users and has been written specifically for developers and testers.

OPTIONS
−r Copies the meta-data to the image-file in the raw format. Use this option only if the destination file
system supports sparse files. If unsure, do not use this option and let the tool create the image-file
in the packed format.

−I Restores meta-data from the image-file onto the device. CAUTION: This option could corrupt
the file system.

−i Interactive mode - before writing out the image file print it’s size and ask whether to proceed. This
setting only applies when ’-I’ is not specified. It can be useful when the file system holding the
image is low on disk space and the user might need to free up space once the target image size is
calculated.

EXAMPLES
Copies metadata blocks from /dev/sda1 device to sda1.out file.

# o2image /dev/sda1 sda1.out

Copies meta-data blocks from sda1.out onto the /dev/sda1 device. As this command over-writes an exist-
ing volume, please use with CAUTION.

# o2image -I /dev/sda1 sda1.out

SEE ALSO
debugfs.ocfs2(8) fsck.ocfs2(8) fsck.ocfs2.checks(8) mkfs.ocfs2(8) mount.ocfs2(8) mounted.ocfs2(8)
o2cluster(8) o2info(1) tunefs.ocfs2(8)

Version 1.8.2 January 2012 1


o2image(8) OCFS2 Manual Pages o2image(8)

AUTHORS
Oracle Corporation

COPYRIGHT
Copyright © 2007, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 2


o2hbmonitor(8) OCFS2 Manual Pages o2hbmonitor(8)

NAME
o2hbmonitor − Monitors disk heartbeat in the O2CB cluster stack
SYNOPSIS
o2hbmonitor [−w percent] [−ivV]
DESCRIPTION
o2hbmonitor is a utility to monitor the disk heartbeat in the o2cb cluster stack. It tracks the time elapsed
since the last heartbeat and logs messages once it exceeds the warn threshold.

By default, it runs as a daemon and logs messages to the system logger. It can be started at any time and
stopped using kill(1). It does not affect the functioning of the heartbeat thread. It is typically automatically
started during cluster online and stopped during cluster offline by the o2cb init script.

This utility expects the debugfs file system to be mounted at /sys/kernel/debug.

OPTIONS
−w percent
Warn threshold percent. It is the percentage of the idle threshold. It defaults to 50%.

−i Interactive mode. It works as a daemon by default. This mode is typically only used for debug-
ging.

−v Verbose mode. It logs messages only to the system logger by default. In this mode it also logs the
messages to stdout.

−V Displays version.

NOTES
This utility works with Linux kernel 2.6.37 and later.

SEE ALSO
o2cb(7)

AUTHORS
Oracle Corporation

COPYRIGHT
Copyright © 2010, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 1


ocfs2_hb_ctl(8) OCFS2 Manual Pages ocfs2_hb_ctl(8)

NAME
ocfs2_hb_ctl − Starts and stops the O2CB local heartbeat on a given device.
SYNOPSIS
ocfs2_hb_ctl -S -d device service
ocfs2_hb_ctl -S -u uuid service
ocfs2_hb_ctl -K -d device service
ocfs2_hb_ctl -K -u uuid service
ocfs2_hb_ctl -I -d device
ocfs2_hb_ctl -I -u uuid
ocfs2_hb_ctl -P -d device [-n io_priority]
ocfs2_hb_ctl -P -u uuid [-n io_priority]
ocfs2_hb_ctl -h

DESCRIPTION
ocfs2_hb_ctl starts and stops local heartbeat on a OCFS2 device. Users are strongly urged not to use
this tool directly. It is automatically invoked by mount.ocfs2(8) and other tools that require heartbeat noti-
fications.

This utility only operates in the local heartbeat mode. It fails silently when run in global heartbeat mode.
More information on the heartbeat modes can be found in o2cb(7).

The tools accepts devices to be specified by its name or its uuid. Service denotes the application that is
requesting the heartbeat notification.

OPTIONS
−S Starts the heartbeat.

−K Stops the heartbeat.

−I Prints the heartbeat reference counts for that heartbeat region.

−d Specify region by device name.

−u Specify region by device uuid.

−n Adjust IO priority for the heartbeat thread. This option calls the ionice tool to set its IO scheduling
class to realtime with scheduling class data as provided. This option is usable only with the O2CB
cluster stack.

−h Displays help and exit.

SEE ALSO
mount.ocfs2(8) o2cb(7) o2cb(8) o2cb.sysconfig(5) ocfs2.cluster.conf(5) o2cluster(8)

AUTHORS
Oracle Corporation

COPYRIGHT
Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 1


fsck.ocfs2(8) OCFS2 Manual Pages fsck.ocfs2(8)

NAME
fsck.ocfs2 − Check an OCFS2 file system.
SYNOPSIS
fsck.ocfs2 [ −pafFGnuvVy ] [ −b superblock block ] [ −B block size ] device
DESCRIPTION
fsck.ocfs2 is used to check an OCFS2 file system.

device is the file where the file system is stored (e.g. /dev/sda1). It will almost always be a device file but a
regular file will work as well.

OPTIONS
−a This option does the same thing as the -p option. It is provided for backwards compatibility only:
it is suggested that people use the -p option whenever possible.

−b superblock block
Normally, fsck.ocfs2 will read the superblock from the first block of the device. This option speci-
fies an alternate block that the superblock should be read from. (Use −r instead of this option.)

−B blocksize
The block size, specified in bytes, can range from 512 to 4096. A value of 0, the default, is used to
indicate that the blocksize should be automatically detected.

−D Optimize directories in filesystem. This option causes fsck.ocfs2 to coalesce the directory entries
in order to improve the filesystem performance.

−f Force checking even if the file system is clean.

−F By default fsck.ocfs2 will check with the cluster services to ensure that the volume is not in-use
(mounted) on any node in the cluster before proceeding. -F skips this check and should only be
used when it can be guaranteed that the volume is not mounted on any node in the cluster. WARN-
ING: If the cluster check is disabled and the volume is mounted on one or more nodes, file
system corruption is very likely. If unsure, do not use this option.

−G Usually fsck.ocfs2 will silently assume inodes whose generation number does not match the gen-
eration number of the super block are unused inodes. This option causes fsck.ocfs2 to ask the user
if these inodes should in fact be marked unused.

−n Give the ’no’ answer to all questions that fsck will ask. This guarantees that the file system will
not be modified and the device will be opened read-only. The output of fsck.ocfs2 with this option
can be redirected to produce a record of a file system’s faults.

−p Automatically repair ("preen") the file system. This option will cause fsck.ocfs2 to automatically
fix any problem that can be safely corrected without human intervention. If there are problems
that require intervention, the descriptions will be printed and fsck.ocfs2 will exit with the value 4
logically or’d into the exit code. (See the EXIT CODE section.) This option is normally used by
the system’s boot scripts.

−P Show progress.

Version 1.8.2 January 2012 1


fsck.ocfs2(8) OCFS2 Manual Pages fsck.ocfs2(8)

−r backup-number
mkfs.ocfs2 makes upto 6 backup copies of the superblock at offsets 1G, 4G, 16G, 64G, 256G and
1T depending on the size of the volume. Use this option to specify the backup, 1 thru 6, to use to
recover the superblock.

−t Show I/O statistics. If this option is specified twice, it shows the statistics on a pass by pass basis.

−y Give the ’yes’ answer to all questions that fsck will ask. This will repair all faults that fsck.ocfs2
finds but will not give the operator a chance to intervene if fsck.ocfs2 decides that it wants to dras-
tically repair the file system.

−v This option causes fsck.ocfs2 to produce a very large amount of debugging output.

−V Print version information and exit.

EXIT CODE
The exit code returned by fsck.ocfs2 is the sum of the following conditions:
0 − No errors
1 − File system errors corrected
2 − File system errors corrected, system should
be rebooted
4 − File system errors left uncorrected
8 − Operational error
16 − Usage or syntax error
32 − fsck.ocfs2 canceled by user request
128 − Shared library error

SEE ALSO
debugfs.ocfs2(8) fsck.ocfs2.checks(8) mkfs.ocfs2(8) mount.ocfs2(8) mounted.ocfs2(8) o2cluster(8)
o2image(8) o2info(1) tunefs.ocfs2(8)

AUTHORS
Oracle Corporation. This man page entry derives some text, especially the exit code summary, from
e2fsck(8) by Theodore Y. Ts’o <tytso@mit.edu>.

COPYRIGHT
Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 2


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

NAME
fsck.ocfs2.checks − Consistency checks that fsck.ocfs2(8) performs and its means for fixing inconsisten-
cies.
DESCRIPTION
fsck.ocfs2(8) is used to check an OCFS2 file system. It performs many consistency checks and will offer to
fix faults that it finds. This man page lists the problems it may find and describes their fixes. The problems
are indexed by the error number that fsck.ocfs2(8) emits when it describes the problem and asks if it should
be fixed.

The prompts are constructed such that answering ’no’ results in no changes to the file system. This may
result in errors later on that stop fsck.ocfs2(8) from proceeding.

CHECKS
EB_BLKNO
Extent blocks contain a record of the disk block where they are located. An extent block was found at a
block that didn’t match its recorded location.

Answering yes will update the data structure in the extent block to reflect its real location on disk.

EB_GEN
Extent blocks are created with a generation number to match the generation number of the volume at the
time of creation. An extent block was found which contains a generation number that doesn’t match.

Answering yes implies that the generation number is correct and that the extent block is from a previous file
system. The extent block will be ignored and the file that contains it will lose the data it referenced.

EB_GEN_FIX
Extent blocks are created with a generation number to match the generation number of the volume at the
time of creation. An extent block was found which contains a generation number that doesn’t match.

Answering yes implies that the generation number in the extent block is incorrect and that the extent block
is valid. The generation number in the block is updated to match the generation number in the volume.

EXTENT_MARKED_UNWRITTEN
An extent record has the UNWRITTEN flag set, but the filesystem feature set does not include unwritten
extents.

Answering yes clears the UNWRITTEN flag. This is safe to do; as the feature is disabled anyway.

EXTENT_MARKED_REFCOUNTED
An extent record has the REFCOUNTED flag set, but neither the filesystem nor the file has the REF-
COUNTED flag set.

Answering yes clears the REFCOUNTED flag.

EXTENT_BLKNO_UNALIGNED
The block that marks the start of an extent should always fall on the start of a cluster. An extent was found
that starts part-way into a cluster.

Answering yes moves the start of the extent back to the start of the addressed cluster. This may add data to
the middle of the file that contains this extent.

Version 1.8.2 January 2012 1


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

EXTENT_CLUSTERS_OVERRUN
An extent was found which claims to contain clusters which are beyond the end of the volume.

Answering yes clamps the extent to the end of the volume. This may result in a reduced file size for the file
that contains the extent, but it couldn’t have addressed those final clusters anyway. One can imagine this
problem arising if there are problems shrinking a volume.

EXTENT_EB_INVALID
Deep extent trees are built by forming a tree out of extent blocks. An extent tree references an invalid
extent block.

Answering yes stops the tree from referencing the invalid extent block. This may truncate data from the file
which contains the tree.

EXTENT_LIST_DEPTH
Extent lists contain a record of their depth in the tree. An extent list was found whose recorded depth
doesn’t match the position they have in the tree.

Answering yes updates the depth field in the list to match the tree on disk.

EXTENT_LIST_COUNT
The number of entries in an extent list is bounded by either the size of the inode or the size of the block
which contains it. An extent list was found which claims to have more entries than would fit in its con-
tainer.

Answering yes updates the count field in the extent list to match the container. Answering no to this ques-
tion may stop further fixes from being done because the count value can not be trusted.

EXTENT_LIST_FREE
The number of free entries in an extent list must be less than the total number of entries in the list. A list
was found which claims to have more free entries than possible entries.

Answering yes sets the number of free entries in the list equal to the total possible entries.

EXTENT_BLKNO_RANGE
An extent record was found which references a block which can not be referenced by an extent. The refer-
enced block is either very early in the volume, and thus reserved, or beyond the end of the volume.

Answering yes removes this extent record from the tree. This may remove data from the file which owns
the tree but any such data was inaccessible.

CHAIN_CPG
The bitmap inode indicates a different clusters per group than the group descriptor. This value is typically
static and only modified by tunefs during volume resize and that too only on volumes having only one clus-
ter group.

Answering yes updates the clusters per group on the bitmap inode to the corresponding value in the group
descriptor.

SUPERBLOCK_CLUSTERS
The super block indicates a different total clusters value than the global bitmap. This is only possible due to
a failed volume resize operation.

Version 1.8.2 January 2012 2


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

Answering yes updates the total clusters in the super block to the value specified in the global bitmap.

FIXED_CHAIN_CLUSTERS
The global bitmap inode was repaired, resulting in a change to the total cluster count of the filesystem.

Answering yes updates the total clusters in the super block to the value specified in the global bitmap.

GROUP_UNEXPECTED_DESC
The group descriptors that make up the global bitmap chain allocator reside at predictable locations on disk.
A group descriptor was found in the global bitmap allocator which isn’t at one of these locations and so
shouldn’t be in the allocator.

Answering yes removes this descriptor from the global bitmap allocator.

GROUP_EXPECTED_DESC
The group descriptors that make up the global bitmap chain allocator reside at predictable locations on disk.
A group descriptor at one of these locations was not linked into the global bitmap allocator.

Answering yes will relink this group into the allocator.

GROUP_GEN
A group descriptor was found with a generation number that doesn’t match the generation number of the
volume.

Answering yes sets the group descriptor’s generation equal to the generation number in the volume.

GROUP_PARENT
Group descriptors contain a pointer to the allocator inode which contains the chain they belong to. A group
descriptor was found in an allocator inode that doesn’t match the descriptor’s parent pointer.

Answering yes updates the group descriptor’s parent pointer to match the inode it resides in.

GROUP_DUPLICATE
Group descriptors contain a pointer to the allocator inode which contains the chain they belong to. A group
descriptor was found in two allocator inodes so it may be duplicated.

Answering yes removes the group descriptor from current allocator inode.

GROUP_BLKNO
Group descriptors have a field which records their block location on disk. A group descriptor was found at
a given location but is recorded as being located somewhere else.

Answering yes updates the group descriptor’s recorded location to match where it actually is found on disk.

GROUP_CHAIN
Group descriptors are found in a number of different singly-linked chains in an allocator inode. A group
descriptor records the chain number that it is linked in. A group descriptor was found whose chain field
doesn’t match the chain it was found in.

Version 1.8.2 January 2012 3


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

Answering yes sets the group descriptor’s chain field to match the chain it is found in.

GROUP_FREE_BITS
A group descriptor records the number of bits in its bitmap that are free. A group descriptor was found
which claims to have more free bits than are valid in its bitmap.

Answering yes decreases the number of recorded free bits so that it equals the total number of bits in the
group descriptor’s bitmap.

CHAIN_COUNT
The chain list embedded in an inode is limited by the block size and the number of bytes consumed by the
rest of the inode. A chain list header was found which claimed that there are more entries in the list then
could fit in the inode.

Answering yes resets the header’s cl_count member to the maximum size allowed by the block size after
accounting for the space consumed by the inode.

CHAIN_NEXT_FREE
This is identical to CHAIN_COUNT except that it is testing and fixing the pointer to the next free list entry
recorded in the cl_next_free_rec member instead of the total number of entries.

CHAIN_EMPTY
Chain entries need to be packed such that there are no chains without descriptors found before the chain
that is marked as free by the chain header. A chain without descriptors was found found before that chain
that was marked free.

Answering yes will remove the unused chain and shift the remaining chains forward in the list.

CHAIN_I_CLUSTERS
Chain allocator inodes have an i_clusters value that represents the number of clusters used by the allocator.
An inode was found whose i_clusters value doesn’t match the number of clusters its chains cover.

Answering yes updates i_clusters in the inode to reflect what was actually found by walking the chain.

CHAIN_I_SIZE
Chain allocator inodes multiply the number of bytes per cluster by the their i_clusters value and store it in
i_size. An inode was found which didn’t have the correct value in its i_size.

Answering yes updates i_size to be the product of i_clusters and the cluster size. Nothing else uses this
value, and previous versions of tools didn’t calculate it properly, so don’t be too worried if this error
appears.

CHAIN_GROUP_BITS
The inode that contains an embedded chain list has fields which record the total number of bits covered by
the chain as well as the amount free. These fields didn’t match what was found in the chain.

Answering yes updates the fields in the inode to reflect what was actually found by walking the chain.

CHAIN_HEAD_LINK_RANGE
The header that starts a chain tried to reference a group descriptor at a block number that couldn’t be valid.

Version 1.8.2 January 2012 4


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

Answering yes will clear the reference to this invalid block and truncate the chain that it started.

CHAIN_LINK_GEN
A reference was made to a group descriptor whose generation number doesn’t match the generation of the
volume.

Answering yes to this question implies that the group descriptor is invalid and the chain is truncated at the
point that it referred to this invalid group descriptor. Answering no to this question considers the group
descriptor as valid and its generation may be fixed.

CHAIN_LINK_MAGIC
Chains are built by chain headers and group descriptors which are linked together by block references. A
reference was made to a group descriptor at a given block but a valid group descriptor signature wasn’t
found at that block.

Answering yes clears the reference to this invalid block and truncates the chain at the point of the reference.

CHAIN_LINK_RANGE
Chains are built by chain headers and group descriptors which are linked together by block references. A
reference a block was found which can’t possibly be valid because it was either too small or extended
beyond the volume.

Answering yes truncates the chain in question by zeroing the invalid block reference. This shortens the
chain in question and could result in more fixes later if the part of the chain that couldn’t be referenced was
valid at some point.

CHAIN_BITS
A chain’s header contains members which record the total number of bits in the chain as well as the number
of bits that are free. After walking through a chain it was found that the number of bits recorded in its
header don’t match what was found by totalling up the group descriptors.

Answering yes updates the c_total and c_free members of the header to reflect what was found in the group
descriptors in the chain.

DISCONTIG_BG_DEPTH
A discontiguous block group has an extent list which records all the clusters allocated to it. Discontiguous
block groups only support extent lists with a tree depth of 0. A block group claims to have a tree depth
greater than 0.

Answering yes will set the tree depth of the extent list to 0.

DISCONTIG_BG_COUNT
A discontiguous block group has an extent list which records all the clusters allocated to it. A block group
claims to have more records than can actually fit.

Answering yes will set the record count to the maximum possible.

DISCONTIG_BG_REC_RANGE
Block groups set aside clusters to be used for metadata. A discontiguous block group claims to contain
clusters beyond the end of the volume.

Answering yes will remove the block group.

Version 1.8.2 January 2012 5


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

DISCONTIG_BG_CORRUPT_LEAVES
A discontiguous block group has an extent list which records all the clusters allocated to it. A group has
more than one extent claiming to have an impossible number of clusters.

Answering yes will remove the block group.

DISCONTIG_BG_CLUSTERS
Extent records in a discontiguous block group were found having more clusters allocated then a block
group can have.

Answering yes will remove the block group.

DISCONTIG_BG_LESS_CLUSTERS
Extent records in a discontiguous block group were found having less clusters allocated then a block group
can have.

Answering yes will remove the block group.

DISCONTIG_BG_NEXT_FREE_REC
A discontiguous block group has an extent list which records all the clusters allocated to it. A group was
found with fewer filled in extents than it claims to have. The filled in extents describe a complete and cor-
rect group.

Answering yes will set the used extent count to the number of filled extents.

DISCONTIG_BG_LIST_CORRUPT
A discontiguous block group has an extent list which records all the clusters allocated to it. The group
claims to have more extents than is possible, and the existing extents contain errors.

Answering yes will remove the block group.

DISCONTIG_BG_REC_CORRUPT
A discontiguous block group has a extent list which records all the clusters allocated to it. A group was
found with one extent claiming too many clusters but the sum of the remaining extents are equal to the total
clusters a group must have.

Answering yes will remove the block group.

DISCONTIG_BG_LEAF_CLUSTERS
A discontiguous block group has a extent list which records all the clusters allocated to it. A group was
found with one extent claiming too many clusters, but the remaining extents are correct.

Answering yes will set the number of the clusters on the broken extent to the difference between the total
clusters a group must have and the sum of the remaining extents.

INODE_ALLOC_REPAIR
The inode allocator did not accurately reflect the set of inodes that are free and in use in the volume.

Answering yes will update the inode allocator bitmaps. Each bit that doesn’t match the state of its inode
will be inverted.

Version 1.8.2 January 2012 6


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

INODE_SUBALLOC
Each inode records the node whose allocator is responsible for the inode. An inode was found in a given
node’s allocator but the inode itself claimed to belong to a different node.

Answering yes will correct the inode to point to the node’s allocator that it belongs to.

LALLOC_SIZE
Each node has a local allocator contained in a block that is used to allocate clusters in batches. A node’s
local allocator claims to reflect more bytes than are possible for the volume’s block size.

Answering yes decreases the local allocator’s size to reflect the volume’s block size.

LALLOC_NZ_USED
A given node’s local allocator isn’t in use but it claims to have bits in use in its bitmap.

Answering yes zeros this used field.

LALLOC_NZ_BM
A given node’s local allocator isn’t in use but it has a field which records the bitmap as starting at a non-
zero cluster offset.

Answering yes zeros the bm_off field.

LALLOC_BM_OVERRUN
Each local allocator contains a reference to the first cluster that its bitmap addresses. A given local alloca-
tor was found which references a starting cluster that is beyond the end of the volume.

Answering yes resets the given local allocator. No allocated data will be lost.

LALLOC_BM_SIZE
The given local allocator claims to cover more bits than are possible for the size in bytes of its bitmap.

Answering yes decreases the number of bits the allocator covers to reflect the size in bytes of the bitmap
and resets the allocator. No allocated data will be lost.

LALLOC_BM_STRADDLE
The given local allocator claims to cover a region of clusters which extents beyond the end of the volume.

Answering yes resets the given local allocator. No allocated data will be lost.

LALLOC_USED_OVERRUN
The given local allocator claims to have more bits in use than it has total bits in its bitmap.

Answering yes decreases the number of bits used so that it equals the total number of available bits.

LALLOC_CLEAR
A local allocator inode was found to have problems. This gives the operator a chance to just reset the local
allocator inode.

Answering yes clears the local allocator. No information is lost but the global bitmap allocator may need to
be updated to reflect clusters that were reserved for the local allocator but were free.

Version 1.8.2 January 2012 7


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

DEALLOC_COUNT
The given truncate log inode contains a count that is greater than the value that is possible given the size of
the inode.

Answering yes resets the count value to the possible maximum.

DEALLOC_USED
The given truncate log inode claims to have more records in use than it is possible to store in the inode.

Answering yes resets the record of the number used to the maximum value possible.

TRUNCATE_REC_START_RANGE
A truncate record was found which claims to start at a cluster that is beyond the number of clusters in the
volume.

Answering yes will clear the truncate record. This may result in previously freed space being marked as
allocated. This will be fixed up later as the allocator is updated to match what is used by the file system.

TRUNCATE_REC_WRAP
Clusters are recorded as 32bit values. A truncate record was found which claims to have enough clusters to
cause this value to wrap. This could never be the case and is a sure sign of corruption.

Answering yes will clear the truncate record. This may result in previously freed space being marked as
allocated. This will be fixed up later as the allocator is updated to match what is used by the file system.

TRUNCATE_REC_RANGE
A truncate record was found which claims to reference a region of clusters which partially extends beyond
the number of clusters in the volume.

Answering yes will clear the truncate record. This may result in previously freed space being marked as
allocated. This will be fixed up later as the allocator is updated to match what is used by the file system.

INODE_GEN
Inodes are created with a generation number to match the generation number of the volume at the time of
creation. An Inode was found which contains a generation number that doesn’t match.

Answering yes implies that the generation number is correct and that the inode is from a previous file sys-
tem. The inode will be recorded as free.

INODE_GEN_FIX
Inodes are created with a generation number to match the generation number of the volume at the time of
creation. An inode was found which contains a generation number that doesn’t match.

Answering yes implies that the generation number in the inode is incorrect and that the inode is valid. The
generation number in the inode is updated to match the generation number in the volume.

INODE_BLKNO
Inodes contain a field that must match the block that they reside in. An inode was found at a block that
doesn’t match the field in the inode.

Answering yes updates the field to match the inode’s position on disk.

Version 1.8.2 January 2012 8


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

ROOT_NOTDIR
The super block contains a reference to the inode that contains the root directory. This block was found to
contain an inode that isn’t a directory.

Answering yes clears this inode. The operator will be asked to recreate the root directory at a point in the
near future.

INODE_NZ_DTIME
Inodes contain a field describing the time at which they were deleted. This can not be set for an inode that
is still in use. An inode was found which is in use but which contains a non-zero dtime.

Answering yes implies that the inode is still valid and resets its dtime to zero.

LINK_FAST_DATA
The target name for a symbolic link is stored either as file contents for that inode or in the inode structure
itself on disk. Only small destination names are stored in the inode structure. The i_blocks field of the
inode indicates that the name is stored in the inode when it is zero. An inode was found that has both
i_blocks set to zero and file contents.

Answering yes clears the inode and so deletes the link.

LINK_NULLTERM
The targets of links on disk must be null terminated. A link was found whose target wasn’t null terminated.

Answering yes clears the inode and so deletes the link.

LINK_SIZE
The size of a link on disk must match the length of its target string. A link was found whose size does not.

Answering yes updates the link’s size to reflect the length of its target string.

LINK_BLOCKS
Links can not be sparse. There must be exactly as many blocks allocated as are needed to cover its size. A
link was found which doesn’t have enough blocks allocated to cover its size.

Answering yes clears the link’s inode thus deleting the link.

DIR_ZERO
Directories must at least contain a block that has the "." and ".." entries. A directory was found which
doesn’t contain any blocks.

Answering yes to this question clears the directory’s inode thus deleting the directory.

INODE_SIZE
Certain inodes record the size of the data they reference in an i_size field. This can be the number of bytes
in a file, directory, or symlink target which are stored in data mapped by extents of clusters. This error
occurs when the extent lists are walked and the amount of data found does not match what is stored in
i_size.

Answering yes to this question updates the inode’s i_size to match the amount of data referenced by the
extent lists. It is vitally important that i_size matches the extent lists and so answering yes is strongly
encouraged.

Version 1.8.2 January 2012 9


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

INODE_SPARSE_SIZE
Certain inodes record the size of the data they reference in an i_size field. This can be the number of bytes
in a file, directory, or symlink target which are stored in data mapped by extents of clusters. This error
occurs when a sparse inode was found that had data allocated past its i_size.

Answering yes to this question will update the inode’s i_size to cover all of its allocated storage. It is
vitally important that i_size matches the extent lists and so answering yes is strongly encouraged.

INODE_INLINE_SIZE
Inodes can only fit a certain amount of inline data. This inode has its data inline but claims an i_size larger
than will actually fit.

Answering yes to this question updates the inode’s i_size to the maximum available inline space.

INODE_CLUSTERS
Inodes contain a record of how many clusters are allocated to them. An inode was found whose recorded
number of clusters doesn’t match the number of blocks that were found associated with the inode.

Answering yes resets the inode’s number of clusters to reflect the number of blocks that were associated
with the file.

INODE_SPARSE_CLUSTERS
Inodes contain a record of how many clusters are allocated to them. An sparse inode was found whose
recorded number of clusters doesn’t match the number of blocks that were found associated with the inode.

Answering yes resets the inode’s number of clusters to reflect the number of blocks that were associated
with the file.

INODE_INLINE_CLUSTERS
Inlined inode should not have allocated clusters. An inode who has inline data flag set was found with clus-
ters allocated.

Answering yes resets the inode’s number of clusters to zero.

LALLOC_REPAIR
An active local allocator did not accurately reflect the set of clusters that are free and in use in its region.

Answering yes will update the local allocator bitmap. Each bit that doesn’t match the use of its cluster will
be inverted.

LALLOC_USED
A local allocator records the number of bits that are used in its bitmap. An allocator was found whose used
value doesn’t reflect the number of bits that are set in its bitmap.

Answering yes sets the used value to match the number of bits set in the allocator’s bitmap.

CLUSTER_ALLOC_BIT
A specific cluster’s use didn’t match the setting of its bit in the cluster allocator.

Answering yes will invert the bit in the allocator to match the use of the cluster -- either allocated and in use
or free.

Version 1.8.2 January 2012 10


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

REFCOUNT_FLAG_INVALID
Refcount file can only exist in a volume with refcount supported, Fsck has found that a file in a non-ref-
count volume has refcount flag set.

Answering yes remove this flag from the file.

REFCOUNT_LOC_INVALID
Refcount loc can only be valid if the file has refcount flag set. Fsck has found that a file has refcount loc
while it does’t have refcount flag set.

Answering yes reset refcount loc to zero for the file.

RB_BLKNO
refcount blocks contain a record of the disk block where they are located. An refcount block was found at a
block that didn’t match its recorded location.

Answering yes will update the data structure in the refcount block to reflect its real location on disk.

RB_GEN
Refcount blocks are created with a generation number to match the generation number of the volume at the
time of creation. An refcount block was found which contains a generation number that doesn’t match.

Answering yes implies that the generation number is correct and that the refcount block is from a previous
file system. The refcount block will be removed and the file that uses it will lose the refcounted informa-
tion, but it may be regenerated later.

RB_GEN_FIX
Refcount blocks are created with a generation number to match the generation number of the volume at the
time of creation. An refcount block was found which contains a generation number that doesn’t match.

Answering yes implies that the generation number in the refcount block is incorrect and that the refcount
block is valid. The generation number in the block is updated to match the generation number in the vol-
ume.

RB_PARENT
refcount blocks contain a record of the parent this disk block belongs to. An refcount block was found stor-
ing a wrong parent location.

Answering yes will update the data structure in the refcount block to reflect its parent’s real location on
disk.

REFCOUNT_LIST_COUNT
The number of entries in a refcount list is bounded by the size of the block which contains it. An refcount
list was found which claims to have more entries than would fit in its container.

Answering yes updates the count field in the refcount list to match the container. Answering no to this ques-
tion may stop further fixes from being done because the count value can not be trusted.

REFCOUNT_LIST_USED
The number of free entries in a refcount list must be less than the total number of entries in the list. A list
was found which claims to have more free entries than possible entries.

Version 1.8.2 January 2012 11


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

Answering yes sets the number of free entries in the list equal to the total possible entries.

REFCOUNT_CLUSTER_RANGE
A refcount record was found which references a cluster which can not be referenced by a refcount. The ref-
erenced cluster is either very early in the volume, and thus reserved, or beyond the end of the volume.

Answering yes removes this refcount record from the tree.

REFCOUNT_CLUSTER_COLLISION
A refcount record was found which references a cluster which has a collision with the previous valid ref-
count record.

Answering yes removes this refcount record from the tree.

REFCOUNT_LIST_EMPTY
A refcount list was found which has no refcount record in it. It is normally caused by a corrupted refcount
record.

Answering yes removes this refcount block from the tree. It will be re-generated in refcounted extent
records handler if all the other information is sane.

REFCOUNT_BLOCK_INVALID
Refcount block stores the refcount record for physical clusters of a file. It is found refering an invalid ref-
count block.

Answering yes remove this refcount block.

REFCOUNT_CLUSTERS
Refcount tree contains a record of how many clusters are allocated to them. A tree was found whose
recorded number of clusters doesn’t match the number of blocks that were found associated with it.

Answering yes resets the number of clusters to reflect the real number of clusters that were associated with
the tree.

REFCOUNT_ROOT_BLOCK_INVALID
Root refcount block is the root of the refcount record for a file. It is found refering an invalid refcount
block.

Answering yes remove this refcount block and clear refcount flag from this file.

REFCOUNT_REC_REDUNDANT
Refcount record is used to store the refcount for physical clusters. Some refcount record is found to have no
physical clusters corresponding to it.

Answering yes remove the refcount record.

REFCOUNT_COUNT_INVALID
Refcount record is used to store the refcount for physical clusters. A record record is found whichs claims
the wrong refcount for some physical clusters.

Answering yes update the corresponding refcount record.

Version 1.8.2 January 2012 12


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

REFCOUNT_COUNT
Refcount tree contains a record of how many files refering to this tree. A tree was found whose recorded
number of files doesn’t match the real files refering to the tree.

Answering yes resets the number of files to reflect the real number of files that were associated with the
tree.

DUP_CLUSTERS_SYSFILE_CLONE
A system file inode claims clusters that are also claimed by another inode. ocfs2 does not allow this. Sys-
tem files may be cloned but may not be deleted. Allocation system files may not be cloned or deleted.

Answering yes will copy the data of this inode to newly allocated extents. This will break the claim on the
overcommitted clusters.

DUP_CLUSTERS_CLONE
An inode claims clusters that are also claimed by another inode. ocfs2 does not allow this.

Answering yes will copy the data of this inode to newly allocated extents. This will break the claim on the
overcommitted clusters.

DUP_CLUSTERS_DELETE
An inode claims clusters that are also claimed by another inode. ocfs2 does not allow this.

Answering yes will remove this inode, thus breaking its claim on the overcommitted clusters.

DUP_CLUSTERS_ADD_REFCOUNT
An inode claims clusters that are also claimed by another inode. ocfs2 does not allow this.

Answering yes will try to add a refcount record for all these inodes, so that they will share the cluster.

DIRENT_DOTTY_DUP
There can be only one instance of both the "." and ".." entries in a directory. A directory entry was found
which duplicated one of these entries.

Answering yes will remove the duplicate directory entry.

DIRENT_NOT_DOTTY
The first and second directory entries in a directory must be "." and ".." respectively. One of these direc-
tory entries was found to not match these rules.

Answering yes will force the directory entry to be either "." or "..". This might consume otherwise valid
entries and cause some files to appear in lost+found.

DIRENT_DOT_INODE
The inode field of the "." directory entry must refer to the directory inode that contains the given directory
block. A "." entry was found which doesn’t do so.

Answering yes sets the directory entry’s inode reference to the parent directory that contains the entry.

Version 1.8.2 January 2012 13


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

DIRENT_DOT_EXCESS
A "." directory entry was found whose lengths exceeds the amount required for the single dot in the name.

Answering yes creates another empty directory entry in this excess space.

DIRENT_ZERO
A directory entry was found with a zero length name.

Answering yes clears the directory entry so its space can be reused.

DIRENT_NAME_CHARS
Directory entries can not contain either the NULL character (ASCII 0) or the forward slash (ASCII 47). A
directory entry was found which contains either.

Answering yes will change each instance of these forbidden characters into a period (ASCII 46).

DIRENT_INODE_RANGE
Each directory entry contains a inode field which the entry’s name corresponds to. An entry was found
which referenced an inode number that is invalid for the current volume.

Answering yes clears this entry so its space can be reused. If the entry once corresponded to a real inode
and was corrupted this inode may appear in lost+found.

DIRENT_INODE_FREE
Each directory entry contains a inode field which the entry’s name corresponds to. An entry was found
which referenced an inode number that isn’t in use.

Answering yes clears this directory entry.

DIRENT_TYPE
Each directory entry contains a field which describes the type of file that the entry refers to. An entry was
found whose type doesn’t match the inode it is referring to.

Answering yes resets the entry’s type to match the target inode.

DIR_PARENT_DUP
Each directory can only be pointed to by one directory entry in a parent directory. A directory entry was
found which was the second entry to point to a given directory inode.

Answering yes clears this entry which was the second to refer to a given directory. This reflects the policy
that hard links to directories are not allowed.

DIRENT_DUPLICATE
File names within a directory must be unique. A file name occurred in more than one directory entry in a
given directory.

Answering yes renames the duplicate entry to a name that doesn’t collide with recent entries and is unlikely
to collide with future entries in the directory.

Version 1.8.2 January 2012 14


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

DIRENT_LENGTH
There are very few directory entry lengths that are valid. The lengths must be greater than the minimum
required to record a single character directory, be rounded to 12 bytes, be within the amount of space
remaining in a directory block, and be properly rounded for the size of the name of the directory entry. An
entry was found which didn’t meet these criteria.

Answering yes will try to repair the directory entry. This runs a very good chance of invalidating all the
entries in the directory block. Orphaned inodes may appear in lost+found.

DIR_TRAILER_INODE
A directory block trailer is a fake directory entry at the end of the block. The trailer has compatibility fields
for when it is viewed as a directory entry. The inode field must be zero.

Answering yes will set the inode field to zero.

DIR_TRAILER_NAME_LEN
A directory block trailer is a fake directory entry at the end of the block. The trailer has compatibility fields
for when it is viewed as a directory entry. The name length field must be zero.

Answering yes will set the name length field to zero.

DIR_TRAILER_REC_LEN
A directory block trailer is a fake directory entry at the end of the block. The trailer has compatibility fields
for when it is viewed as a directory entry. The record length field must be equal to the size of the trailer.

Answering yes will set the record length field to the size of the trailer.

DIR_TRAILER_BLKNO
A directory block trailer is a fake directory entry at the end of the block. The self-referential block number
is incorrect.

Answering yes will set the block number to the correct block on disk.

DIR_TRAILER_PARENT_INODE
A directory block trailer is a fake directory entry at the end of the block. It has a pointer to the directory
inode it belongs to. This pointer is incorrect.

Answering yes will set the parent inode pointer to the inode referencing this directory block.

ROOT_DIR_MISSING
The super block contains a reference to the inode that serves as the root directory. This reference points to
an inode that isn’t in use.

Answering yes will create a new inode and update the super block to refer to this inode as the root direc-
tory.

LOSTFOUND_MISSING
The super block contains a reference to the inode that serves as the lost+found directory. This reference
points to an inode that isn’t in use.

Version 1.8.2 January 2012 15


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

Answering yes will create a new lost+found directory in the root directory.

DIR_NOT_CONNECTED
Every directory in the file system should be reachable by a directory entry in its parent directory. This is
verified by walking every directory in the system. A directory inode was found during this walk which
doesn’t have a parent directory entry.

Answering yes moves this directory entry into the lost+found directory and gives it a name based on its
inode number.

DIR_DOTDOT
A directory inode’s ".." directory entry must refer to the parent directory. A directory was found whose ".."
doesn’t refer to its parent.

Answering yes will read the directory block for the given directory and update its ".." entry to reflect its
parent.

INODE_NOT_CONNECTED
Most all inodes in the system should be referenced by a directory entry. An inode was found which isn’t
referred to by any directory entry.

Answering yes moves this inode into the lost+found directory and gives it a name based on its inode num-
ber.

INODE_COUNT
Each inode records the number of directory entries that refer to it. An inode was found whose recorded
count doesn’t match the number of entries that refer to it.

Answering yes sets the inode’s count to match the number of referring directory entries.

INODE_ORPHANED
While files are being deleted they are placed in an internal directory. If the machine crashes while this is
taking place the files will be left in this directory. Fsck has found an inode in this directory and would like
to finish the job of truncating and removing it.

Answering yes removes the file data associated with the inode and frees the inode.

RECOVER_BACKUP_SUPERBLOCK
When fsck.ocfs2 successfully uses the specified backup superblock, it provides the user with this option to
overwrite the existing superblock with that backup.

Answering yes will refresh the superblock from the backup. Answering no will only disable the copying of
the backup superblock and will not effect the remaining fsck.ocfs2 processing.

ORPHAN_DIR_MISSING
While files are being deleted they are placed in an internal directory, named orphan directory. If an orphan
directory does not exist, an OCFS2 volume cannot be mounted successfully. Fsck has found the orphan
directory is missing and would like to create it for future use.

Answering yes creates the orphan directory in the system directory.

Version 1.8.2 January 2012 16


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

JOURNAL_FILE_INVALID
OCFS2 uses JDB for journalling and some journal files exist in the system directory. Fsck has found some
journal files that are invalid.

Answering yes to this question will regenerate the invalid journal files.

JOURNAL_UNKNOWN_FEATURE
Fsck has found some journal files with unknown features. Other journals on the filesystem have only
known features, so this is likely a corruption. If you think your filesystem may be newer than this version
of fsck.ocfs2, say N here and grab the latest version of fsck.ocfs2.

Answering yes resets the journal features to match other journals.

JOURNAL_MISSING_FEATURE
Fsck has found some journal files have features that are not set on all journal files. All journals on filesys-
tem should have the same set of features.

Answering yes will set all journals to the union of set features.

JOURNAL_TOO_SMALL
Fsck has found some journal files are too small.

Answering yes extends these journals.

RECOVER_CLUSTER_INFO
The currently active cluster stack is different than the one the filesystem is configured for. Thus, fsck.ocfs2
cannot determine whether the filesystem is mounted on an another node or not. The recommended solution
is to exit and run fsck.ocfs2 on this device from a node that has the appropriate active cluster stack. How-
ever, you can proceed with the fsck if you are sure that the volume is not in use on any node.

Answering yes reconfigures the filesystem to use the current cluster stack. DANGER: YOU MUST BE
ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS FILESYSTEM BEFORE CONTINU-
ING. OTHERWISE, YOU CAN CORRUPT THE FILESYSTEM AND LOSE DATA.

INLINE_DATA_FLAG_INVALID
Inline file can only exist in a volume with inline supported, Fsck has found that a file in a non-inline volume
has inline flag set.

Answering yes remove this flag from the file.

INLINE_DATA_COUNT_INVALID
For an inline file, there is a limit for id2.id_data.id_count. Fsck has found that this value isn’t right.

Answering yes change this value to the right number.

XATTR_BLOCK_INVALID
Extended attributes are stored off an extended attribute block referenced by the inode. This inode refer-
ences an invalid extended attribute block.

Answering yes will remove this block.

Version 1.8.2 January 2012 17


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

XATTR_COUNT_INVALID
The count of extended attributes in an inode, block, or bucket does not match the number of entries found
by fsck.

Answering yes will change this to the correct count.

XATTR_ENTRY_INVALID
An extended attribute entry points to already used space.

Answering yes will remove this entry.

XATTR_NAME_OFFSET_INVALID
The name_offset field of an extended attribute entry is not correct. Without a correct name_offset field, the
entry cannot be used.

Answering yes will remove this entry.

XATTR_VALUE_INVALID
The value region of an extended attribute points to already used space.

Answering yes will remove this entry.

XATTR_LOCATION_INVALID
The xe_local field and xe_value_size field of an extended attribute entry does not match. So the entry can-
not be used.

Answering yes will remove this entry.

XATTR_HASH_INVALID
Extended attributes use a hash of their name for lookup purposes. The name_hash of this extended attribute
entry is not correct.

Answering yes will change this to the correct hash.

XATTR_FREE_START_INVALID
Extended attributes use free_start to indicate the offset of the free space in inode, block, or bucket. The
free_start field of this object is not correct.

Answering yes will change this to the correct offset.

XATTR_VALUE_LEN_INVALID
Extended attributes use name_value_len to store the total length of all entry’s name and value in inode,
block or bucket. the name_value_len filed of this object is not correct.

Answering yes will change this to the correct value.

XATTR_BUCKET_COUNT_INVALID
The count of extended attributes bucket pointed by one extent record does not match the number of buckets
found by fsck.

Answering yes will change this to the correct count.

Version 1.8.2 January 2012 18


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

QMAGIC_INVALID
The magic number in the header of quota file does not match the proper number.

Answering yes will make fsck use values in the quota file header anyway.

QTREE_BLK_INVALID
Block with references to other blocks with quota data is corrupted.

Answering yes will make fsck use references in the block.

DQBLK_INVALID
The structure with quota limits was found in a corrupted block.

Answering yes will use the values of limits for the user / group.

DUP_DQBLK_INVALID
The structure with quota limits was found in a corrupted block and fsck has already found quota limits for
this user / group.

Answering yes will use new values of limits for the user / group.

DUP_DQBLK_VALID
The structure with quota limits was found in a correct block but fsck has already found quota limits for this
user / group.

Answering yes will use new values of limits for the user / group.

IV_DX_TREE
A directory index was found on an inode but that feature is not enabled on the file system.

Answering yes will truncate the invalid index.

DX_LOOKUP_FAILED
A directory entry is missing an entry in the directory index. The missing index entry will cause lookups on
this name to fail.

Answering yes will rebuild the directory index, restoring the missing entry.

NO_HOLES
A metadata structure encountered a hole where it should not. Examples of such structures are directories,
refcount trees, dx_trees etc.

Answering yes will remove the hole by updating the offset to the expected value.

EXTENT_OVERLAP
The extents of the file overlap, which means there could be two or more possible data for a particular offset
for the file.

Answering yes will serialize the extents.

Version 1.8.2 January 2012 19


fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

SEE ALSO
debugfs.ocfs2(8) fsck.ocfs2(8) mkfs.ocfs2(8) mount.ocfs2(8) mounted.ocfs2(8) o2cluster(8) o2image(8)
o2info(1) tunefs.ocfs2(8)

AUTHORS
Oracle Corporation.

COPYRIGHT
Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 20

You might also like