Apriltag: A Robust and Flexible Visual Fiducial System: Edwin Olson

AprilTag: A robust and flexible visual fiducial system
Edwin Olson
University of Michigan
ebolson@umich.edu
http://april.eecs.umich.edu
Abstract— While the use of naturally-occurring features is a

central focus of machine perception, artificial features (fiducials)
play an important role in creating controllable experiments,
ground truthing, and in simplifying the development of systems
where perception is not the central objective.
We describe a new visual fiducial system that uses a 2D bar
code style “tag”, allowing full 6 DOF localization of features
from a single image. Our system improves upon previous
systems, incorporating a fast and robust line detection system,
a stronger digital coding system, and greater robustness to
occlusion, warping, and lens distortion. While similar in concept
to the ARTag system, our method is fully open and the
algorithms are documented in detail.
I. I NTRODUCTION
Visual fiducials are artificial landmarks designed to be
Fig. 1. Example detections. This paper describes a visual fiducial system
easy to recognize and distinguish from one another. Although based on 2D planar targets. The detector is robust to lighting variation and
related to other 2D barcode systems such as QR codes [1], occlusions and produces accurate localizations of the tags.
they have significantly goals and applications. With a QR
code, a human is typically involved in aligning the camera
with the tag and photographs it at fairly high resolution generate user interfaces that overlay robots’ plans and task
obtaining hundreds of bytes, such as a web address. In assignments onto a head-mounted display [6].
contrast, a visual fiducial has a small information payload Performance evaluation and benchmarking of robot sys-
(perhaps 12 bits), but is designed to be automatically detected tems have become central issues for the research community;
and localized even when it is at very low resolution, unevenly visual fiducials are particularly helpful in this domain. For
lit, oddly rotated, or tucked away in the corner of an example, fiducials can be used to generate ground-truth
otherwise cluttered image. Aiding their detection at long robot trajectories and close control loops [7]. Similarly,
ranges, visual fiducials are comprised of many fewer data artificial features can make it possible to evaluate Simulta-
cells: the alignment markers of a QR tag comprise about neous Localization and Mapping (SLAM) algorithms under
268 pixels (not including required headers or the payload), controlled algorithms [8]. Robotics applications have led to
whereas the visual fiducials described in this paper range the development of additional tag detection systems [9], [10].
from about 49 to 100 pixels, including the payload. Designing robust fiducials while minimizing the size
Unlike 2D barcode systems in which the position of the required is a challenging both from a marker detection
barcode in the image is unimportant, visual fiducial systems standpoint (which pixels in the image correspond to a tag?)
provide camera-relative position and orientation of a tag. and from a error-tolerant data coding standpoint (which tag
Fiducial systems also are designed to detect multiple markers is it?)
in a single image. In this paper, we describe a new visual fiducial system that
Visual fiducial systems are perhaps best known for their significantly improves performance over previous systems.
application to augmented reality, which spurred the devel- The central contributions of this paper are:
opment of several popular systems including ARToolkit [2] • We describe a method for robustly detecting visual fidu-
and ARTag [3]. Real-world objects can be augmented with cials. We propose a graph-based image segmentation
visual fiducials, allowing virtually-generated imagery to be algorithm based on local gradients that allows lines to be
super-imposed. Similarly, visual fiducials can be used for precisely estimated. We also describe a quad extraction
basic motion capture [4]. method that can handle significant occlusions.
Visual fiducial systems have been used to improve hu- • We demonstrate that our detection system provides
man/robot interaction, allowing humans to signal commands significantly better localization accuracy than previous
(such as “follow me” or “wait here”) by flashing an appro- systems.
priate card to a robot [5]. Planar tags have also been used to • We describe a new coding system that address problems
unique to 2D barcoding systems: robustness to rotation, finally Studierstube Tracker [12]. These versions introduced
and robustness to false positives arising from natural digitally-encoded payloads like those used in ARTag. Despite
imagery. As demonstrated by our experimental results, being later work, our experiments show that these encoding
our coding system provides significant theoretical and systems do not perform as well as that used by ARTag, which
real-world benefits over previous work. in turn is outperformed by our coding system.
• We specify and provide results on a set of benchmarks In addition to monochrome tags, other coding systems
which will allow better comparisons of fiducial systems have been developed. For example, color information has
in the future. been used to increase the amount of information that can be
In contrast to previous methods (including ARTag and encoded [13], [14]. Tags using retro-reflectors [15] have also
Studierstube Tracker), our implementation is released under been used. A particularly interesting approach is that used
an Open Source license, and its algorithms and implementa- by Bokode [16], which exploits the bokeh effect to detect
tion are well documented. The closed nature of these other extremely small tags by intentionally defocusing the camera.
systems was a challenge for our experimental evaluation.
For the comparisons in this paper, we have used the limited
publicly-available information to enable as many objective
comparisons as possible. On the other hand, ARToolkitPlus
is open source and so we were able to make a more detailed
comparison. In addition to code, we are also making our
evaluation code available in order to make it easier for future
authors to perform comparisons.
In the next section, we review related work. We describe
our method in the following two sections: the tag detector
in Section 3, and the coding system in Section 4. In Section
5, we provide an experimental evaluation of our methods,
comparing them to previous algorithms.
II. P REVIOUS W ORK
Fig. 2. Input image. This paper describe the processing of this sample
ARToolkit [11] was among the first tag tracking systems, image which contains two tags. This example is purposefully simple for
and was targeted at artificial reality applications. Like the explanatory reasons, though note that the tags are not rigidly planar. See
systems that would follow, its tags contained a square-shaped Fig. 1 for a more challenging result.
payload surrounded by a black border. It differed, however, in
that its payload was not directly encoded in binary: instead, it Besides two-dimensional barcodes, a number of other
used symbols such as the latin character ’A’. When decoding artificial landmarks have been developed. Overhead cameras
a tag, the payload of the tag (sampled at high resolution) was have been used to track robots equipped with blinking
correlated against a database of known tags, with the best- LEDs [17]. In contrast, the NorthStar system puts the vi-
correlating tag reported to the user. A major disadvantage sual fiducials on the ceiling [18]. Two-dimensional planar
of this approach is the computational cost associated with systems, like the one described in this paper, have two
decoding tags, since each template required a separate, main advantages over LED-based systems: the targets can be
slow correlation operation. A second disadvantage is that cheaply printed on a standard printer, and they provide 6DOF
it is difficult to generate templates that are approximately position estimation without the need for multiple LEDs.
orthogonal to each other.
The tag detection scheme used by ARToolkit is based on III. D ETECTOR
a simple binarization of the input image based on a user- Our system is composed of two major components: the tag
specified threshold. This scheme is very fast, but not robust detector and the coding system. In this section, we describe
to changes in illumination. In general, ARToolkit’s detections the detector whose job is to estimate the position of possible
can not handle even modest occlusions of the tag’s border. tags in an image. Loosely speaking, the detector attempts to
ARTag [3] provided improved detection and coding find four-sided regions (“quads”) that have a darker interior
schemes. Like our own approach, the detection mechanism than their exterior. The tags themselves have black and white
was based on the image gradient, making it robust to changes borders in order to facilitate this (see Fig. 2).
in lighting. While the details of the detector algorithm are not The detection process is comprised of several distinct
public, ARTag’s detection mechanism is able to detect tags phases, which are described in the following subsections and
whose border is partially occluded. ARTag also provided the illustrated using the example shown in Fig. 2.
first coding system based on forward error correction, which Note that the quad detector is designed to have a very
made tags easier to generate, faster to correlate, and provided low false negative rate, and consequently has a high false
greater orthogonality between tags. positive rate. We rely on the coding system (described in
The performance of ARTag inspired several improvements the next section) to reduce this false positive rate to useful
to ARToolkit, which evolved into ARToolkitPlus [2], and levels.
Fig. 3. Early processing steps. The tag detection algorithm begins by computing the gradient at every pixel, computing their magnitudes (first) and direction
(second). Using a graph-based method, pixels with similar gradient directions and magnitude are clustered into components (third). Using weighted least
squares, a line segment is then fit to the pixels in each component (fourth). The direction of the line segment is determined by the gradient direction, so
that segments are dark on the left, light on the right. The direction of the lines are visualized by short perpendicular “notches” at their midpoint; note that
these “notches” always point towards the lighter region.
A. Detecting line segments

Our approach begins by detecting lines in the image. Our
approach, similar in basic approach to the ARTag detector,
computes the gradient direction and magnitude at every pixel
(see Fig. 3) and agglomeratively clusters the pixels into
components with similar gradient directions and magnitudes.
The clustering algorithm is similar to the graph-based
method of Felzenszwalb [19]: a graph is created in which
each node represents a pixel. Edges are added between
adjacent pixels with an edge weight equal to the pixels’ dif-
ference in gradient direction. These edges are then sorted and
processed in terms of increasing edge weight: for each edge,
we test whether the connected components that the pixels
belong to should be joined together. Given a component n, Fig. 4. Quad detection and sampling. Four quads are detected in the image
we denote the range of gradient directions as D(n) and the (which contains two tags). The third detected quad corresponds to three of
the edges of the foreground tag plus the edge of the paper (See Fig. 2).
range of magnitudes as M (n). Put another way, D(n) and A fourth quad is detected around one of the payload bits of the larger
M (n) are scalar values representing the difference between tag. These two extraneous detections are eventually discarded because their
the maximum and minimum values of the gradient direction payload is invalid. The white dots correspond to samples around the tags
border which are used to fit a linear model of intensity of “white” pixels;
and magnitude respectively. In the case of D(), some care a model is similarly fit for the black pixels. These two models are used to
must be taken to handle 2π wrap-around. However, since threshold the data payload bits, shown as yellow dots.
useful edges will have a span of much less than π degrees,
this is straightforward. Given two components n and m,
we join them together if both of the conditions below are be sorted using a linear-time counting sort [20]. The actual
satisfied: merging operation can be efficiently carried out by the union-
D(n ∪ m) ≤ min(D(n), D(m)) + KD /|n ∪ m| (1) find algorithm [20] with the upper and lower bounds of
gradient direction and magnitude stored in a simple array
M (n ∪ m) ≤ min(M (n), M (m)) + KM /|n ∪ m| indexed by each component’s representative member.
The conditions are adapted from [19] and can be intu- This gradient-based clustering method is sensitive to noise
itively understood: small values of D() and M () indicate in the image: even modest amounts of noise will cause local
components with little intra-component variation. Two clus- gradient directions to vary, inhibiting the growth of the com-
ters are joined together if their union is about as uniform ponents. The solution to this problem is to low-pass filter the
as the clusters taken individually. A modest increase in image [19], [21]. Unlike other problem domains where this
intra-component variation is permitted via the KD and KM filtering can blur useful information in the image, the edges
parameters, however this rapidly shrinks as the components of a tag are intrinsically large-scale features (particularly in
become larger. During early iterations, the K parameters comparison to the data field), and so this filtering does not
essentially allow each component to “learn” its intra-cluster cause information loss. We recommend a value of σ = 0.8.
variation. In our experiments, we used KD = 100 and Once the clustering operation is complete, line segments
KM = 1200, though the algorithm works well over a broad are fit to each connected component using a traditional
range of values. least-squares procedure, weighting each point by its gradient
For performance reasons, the edge weights are quantized magnitude (see Fig. 3). We adjust each line segment so that
and stored as fixed-point numbers. This allows the edges to the dark side of the line is on its left, and the light side is
on its right. In the next phase of processing, this allows us matrix are typically 4 × 4, but every position on the tag
to enforce a winding rule around each quad. is at z = 0 in the tag’s coordinate system. Thus, we can
The segmentation algorithm is the slowest phase in our rewrite every tag coordinate as a 2D homogeneous point
detection scheme. As an option, this segmentation can be with z implicitly zero, and remove the third column of the
performed at half the image resolution with a 4x improve- extrinsics matrix, forming the truncated extrinsics matrix.
ment in speed. The sub-sampling operation can be efficiently We represent the rotation components of P as Rij and the
combined with the recommended low-pass filter. The conse- translation components as Tk . We also represent the unknown
quence of this optimization is a modestly reduced detection scale factor as s:
range, since very small quads may no longer be detected. 
h00 h01 h02

 h10 h11 h12  = sP E
B. Quad detection
h20 h21 h22
At this point, a set of directed line segments have been  R

R01 Tx

(2)
fx 0 0 0  00

computed for an image. The next task is to find sequences R R T
10 11 y

of line segments that form a 4-sided shape, i.e., a quad. The = s  0 fy 0 0    R20 R21 Tz 

challenge is to do this while being as robust as possible to 0 0 1 0
0 0 1
occlusions and noise in the line segmentations.
Our approach is based on a recursive depth-first search Note that we cannot directly solve for E because P is rank
with a depth of four: each level of the search tree adds an deficient. We can expand the right hand side of Eqn. 2, and
edge to the quad. At depth one, we consider all line segments. write the expression for each hij as a set of simultaneous
At depths two through four, we consider all of the line equations:
segments that begin “close enough” to where the previous h00 = sR00 fx (3)
line segment ended and which obey a counter-clockwise
h01 = sR01 fx
winding order. Robustness to occlusions and segmentation
errors is handled by adjusting the “close enough” threshold: h02 = sTx fx
by making the threshold large, significant gaps around the ...
edges can be handled. Our threshold for “close enough” is
twice the length of the line plus five additional pixels. This These are all easily solved for the elements of Rij and Tk
is a large threshold which leads to a low false negative rate, except for the unknown scale factor s. However, since the
but also results in a high false positive rate. columns of a rotation matrix must all be of unit magnitude,
We populate a two-dimensional lookup table to accelerate we can constrain the magnitude of s. We have two columns
queries for line segments that begin near a point in space. of the rotation matrix, so we compute s as the geometric the
With this optimization, along with early rejection of can- geometric average of their magnitudes. The sign of s can
didate quads that do not obey the winding rule, or which be recovered by requiring that the tag appear in front of the
use a segment more than once, the quad detection algorithm camera, i.e., that Tz < 0. The third column of the rotation
represents a small fraction of the total computational require- matrix can be recovered by computing the cross product of
ments. the two known columns, since the columns of a rotation
matrix must be orthonormal.
Once four lines have been found, a candidate quad detec-
The DLT procedure and the normalization procedure
tion is created. The corners of this quad are the intersections
above do not guarantee that the rotation matrix is strictly
of the lines that comprise it. Because the lines are fit using
orthonormal. To correct this, we compute the polar decom-
data from many pixels, these corner estimates are accurate
position of R, which yields a proper rotation matrix while
to a small fraction of a pixel.
minimizing the Frobenius matrix norm of the error [23].
C. Homography and extrinsics estimation
IV. PAYLOAD DECODING
We compute the 3×3 homography matrix that projects 2D
The final task is to read the bits from the payload field.
points in homogeneous coordinates from the tag’s coordinate
We do this by computing the tag-relative coordinates of each
system (in which [0 0 1]T is at the center of the tag and the
bit field, transforming them into image coordinates using the
tag extends one unit in the x̂ and ŷ directions) to the 2D
homography, and then thresholding the resulting pixels. In
image coordinate system. The homography is computed us-
order to be robust to lighting (which can vary not only from
ing the Direct Linear Transform (DLT) algorithm [22]. Note
tag to tag, but also within a tag), we use a spatially-varying
that since the homography projects points in homogeneous
threshold.
coordinates, it is defined only up to scale.
Specifically, we build spatially-varying model of the inten-
Computation of the tag’s position and orientation requires
sity of “black” pixels, and a second model for the intensity of
additional information: the camera’s focal length and the
“white” models. We use the border of the tag, which contains
physical size of the tag. The 3 × 3 homography matrix
known examples of both white and black pixels, to learn this
(computed by the DLT) can be written as the product of
model (see Fig. 4). We use the following intensity model:
the 3 × 4 camera projection matrix P (which we assume is
known) and the 4×3 truncated extrinsics matrix E. Extrinsics I(x, y) = Ax + Bxy + Cy + D (4)
This model has four parameters which are easily computed The ARTag encoding system, for example, explicitly forbids
using least squares regression. We build two such models, two codes because they are too likely to occur by chance.
one for black, the other for white. The threshold used when Rather than identify problematic tags manually, we fur-
decoding data bits is then just the average of the predicted ther modify the lexicode generation algorithm by rejecting
intensity values of the black and white models. candidate codewords that result in simple geometric patterns.
Our metric is based on the number of rectangles required to
V. C ODING SYSTEM generate the tag’s 2D pattern. For example, a solid pattern
Once the data payload is decoded from a quad, it is the requires just one rectangle, while a black-white-black stripe
job of the coding system to determine it is valid or not. The would require two rectangles (one large black rectangle with
goals of a coding system are to: a smaller white rectangle drawn second). Our hypothesis,
supported by experimental results later in this paper, is
• Maximize the number of distinguishable codes
that tag patterns with high complexity (which require many
• Maximize the number of bit errors that can be detected
rectangles to be reconstructed) occur less frequently in nature
or corrected
and thus lead to lower false positive rates.
• Minimize the false positive/inter-tag confusion rate
Using this idea, we again modify the lexicode generation
• Minimize the total number of bits per tag (and thus the
algorithm to reject candidate codewords that are too simple.
size of the tag)
We approximate the number of rectangles required to gen-
These goals are often in conflict, and so a given code erate the tag’s pattern with a simple greedy approach that
represents a trade-off. In this section, we describe a new repeatedly considers all possible rectangles and adds the one
coding system based on lexicodes that provides significant that reduces the error the most. Since the tags are generally
advantages over previous methods. Our procedure can gen- very small, this computation is not a bottleneck. Tags with
erate lexicodes with a variety of properties, allowing the user a minimum complexity less than a threshold (typically 10
to use a code that best fits their needs. in our experiments) are rejected. The appropriateness and
effectiveness of this heuristic are demonstrated in the results
A. Methodology
section.
We propose the use of a modified lexicode [24]. Classical Lastly, we have empirically observed lower false positive
lexicodes are parameterized by two quantities: the number scores by making one more modification to the lexicode
of bits n in each codeword and the minimum Hamming generation algorithm. Rather than test codewords in order
distance between any two codewords d. Lexicodes can (0, 1, 2, 3, ...), we consider (b, b+1p, b+2p, b+3p, ...) where
correct ⌊(d − 1)/2⌋ bit errors and detect d/2 bit errors. b is an arbitrary number, p is a large prime, and the lowest
For convenience, we will denote a 36 bit encoding with a n bits are kept at each step. Intuitively, the tags generated
minimum Hamming distance of 10 (for example) as a 36h10 by this method have greater entropy at every bit position;
code. the lexicographic order, on the other hand, favors small-
Lexicodes derive their name from the heuristic used to valued codes. The disadvantage of this method is that fewer
generate valid codewords: candidate codewords are consid- distinguishable codes are created: the lexicographic ordering
ered in lexicographic order (from smallest to largest), adding tends to pack codewords quite densely, whereas the more
new codewords to the codebook when they are at least a random order results in a less efficient packing of codewords.
distance d from every codeword previously added to the To summarize, we use a lexicode system that can generate
codebook. While very simple, this scheme is often very close codes for any arbitrary tag size (e.g., 3x3, 4x4, 5x5, 6x6)
to optimal [25]. and minimum Hamming distance. Our approach explicitly
In the case of visual fiducials, the coding scheme must be guarantees the minimum Hamming distance for all four
robust to rotation. In other words, it is critical that when rotations of each tag and eliminates tags which are of
a tag is rotated by 90, 180, or 270 degrees, that it still low geometric complexity. Computing the tags can be an
have a Hamming distance of d from every other code. The expensive operation, but is done offline. Small tags (5x5)
standard lexicode generation algorithm does not guarantee can be easily computed in seconds or minutes, but larger
this property. However, the standard generation algorithm tags (6x6) can take several days of CPU time. Many useful
can be trivially extended to support this: when testing a code families are already computed and distributed with our
new candidate codeword, we can simply ensure that all four software; most users will not need to generate their own code
rotations have the required minimum Hamming distance. families.
The fact that the lexicode algorithm can be easily extended
to incorporate additional constraints is an advantage of our B. Error correction analysis
approach. Theoretical false positive rates can be easily estimated.
Some codewords, despite satisfying the Hamming distance Assume that a false quad is identified and that the bit pattern
constraint, are poor choices. For example, a code word is random. The probability of a false positive is the fraction of
consisting of all zeros would result in a tag that looks like codewords which are accepted as valid tags versus the total
a single black square. Such simple geometric patterns com- number of possible codewords, 2n . More aggressive error
monly occur in natural scenes, resulting in false positives. correction increases this rate, since it increases the number
of codewords that are accepted. This unavoidable increase in
error rate is illustrated for the 36h10 and 36h15 codes below:
Bits corrected 36h10 FP (%) 36h15 FP (%)
0 0.000001 0.000000
1 0.000041 0.000002
2 0.000744 0.000029
3 0.008714 0.000341
4 0.074459 0.002912
5 0.495232 0.019370
6 N/A 0.104403
7 N/A 0.468827
Fig. 5. Hamming distances. Good codes have large Hamming distances
Of course, the better performance of the 36h15 encoding between valid codewords. Shown above are the Hamming distances for
several contemporary systems. Note that our coding scheme guarantees a
comes at a price: there are only 27 distinguishable code- minimum Hamming distance by construction, whereas other systems have
words, as opposed to 36h10’s 2221 distinguishable code- some very similar codewords which leads to higher inter-tag confusion rates.
words.
Our coding scheme is significantly stronger than previous
schemes, including that used by ARTag and both systems image corpus from the LabelMe dataset [26], which contains
used by ARToolkitPlus: our coding system achieves a greater 180,829 images from a wide variety of indoor and outdoor
minimum Hamming distance between all pairs of codewords environments. Since none of these images contain one of our
while encoding a larger number of distinguishable ids. This tags, we can measure the false positive rate of our coding
improvement in minimum Hamming distance is illustrated systems by using these images.
in Fig. 5 and in the table below:
Encoding Scheme Length Unique codes Min. Hamming

ARToolkit+ (simple) 36 512 4
ARToolkit+ (BCH) 36 4096 2
ARTag 36 2046 4
Proposed (36h9) 36 4164 9
Proposed (36h10) 36 2221 10
In order to decode a possibly-corrupted code word, the
Hamming distance between the code word and each valid Fig. 6. Empirical false positives versus tag complexity. Our theoretical error
rates assume that all codewords are equally likely to occur by chance in a
code word in the code book is computed. If the best match real-world environment. Our hypothesis is that real-world environments are
has a Hamming distance less than the user-specified thresh- biased towards codes which have a low rectangle covering complexity and
old, a detection is reported. By specifying this threshold, the that by selecting codewords that have high rectangle covering complexity,
we can decrease false positive rates. This hypothesis is validated by the
user is able to control the tradeoff between false positives graph above, which shows empirical false positive rates from the LabelMe
and false negatives. dataset (solid lines) for rectangle covering complexities from c=2 to c=10.
A disadvantage of our method is that this decoding process At complexities of c=9 and c=10, the false positive rate drops below the rate
predicted by the pessimistic model that real-world payloads are randomly
takes linear time in the size of the codebook, since every distributed.
valid codeword must be considered. However, the coefficient
is so small that the computational complexity is negligible Evaluation of complexity heuristic: We first wish to evalu-
in comparison to the other image processing steps. ate our hypothesis that the false positive rate can be reduced
For a given coding scheme, larger tags (i.e., those with by imposing our geometric complexity heuristic to reject
36 bits versus 25 bits) have dramatically better coding candidate codewords. To do this, we generated ten variants
performance than smaller tags, although this comes at a price. of the 25h9 family with minimum complexities ranging from
All other things being equal, the range at which a given 1 to 10. In Fig. 6, the false positive rate is given for each
camera can read a 36 bit tag will be shorter than the range at of these complexities as a function of the maximum number
which the same camera can read a 16 or 25 bit tag. However, of bit errors corrected. Also displayed is the theoretical false
the benefit in range for smaller tags is quite modest due to the positive rate which was based on the assumption that data
4 pixel overhead of the borders; only a 25% improvement in payloads are randomly distributed.
detection range can be expected by using 16 bit tags instead Fig. 6 clearly demonstrates that our heuristic is effective
of 36 bit tags. Thus, it is only in the most range-sensitive in reducing the false positive rate. Interestingly, once the
applications where smaller tags are advantageous. complexity exceeds 8, the performance is actually better than
predicted by theory.
VI. E XPERIMENTAL R ESULTS
Comparison to other coding schemes: We next compare
A. Empirical Experiments the false positive of our rate of our coding systems to those
A key question we wish to answer is whether our analyt- used by ARToolkitPlus and ARTag.
ical predictions regarding false positive rates holds in real- Using the same real-world imagery dataset, we plot the
world imagery. To answer this question, we used a standard empirical false positive rates for five codes in Fig. 7.
Fig. 7. Empirical false positives. The best performing methods, in terms
of the rate of false positives on the LabelMe dataset, are 36h15 and
ARToolkitPlus-Simple. These two coding families also have the fewest
number of distinguishable codes, which gives them an advantage. The other
three systems have approximately comparable numbers of distinguishable
codes; ARToolkitPlus-BCH performs very poorly; ARTag does much better,
and our proposed 36h10 encoding does even better.
Fig. 9. Orientation accuracy. Using our ray-tracing simulator (which

ARToolkitPlus’s BCH coding scheme has the highest false provides ground truth), we evaluated the accuracy of two tag detector’s
positive rate, followed by ARTag. Our 36h10 encoding, localization accuracy. We fixed the range to the tag and varied the angle
between the tag’s normal and the camera direction. For both systems,
generated with a minimum complexity of 10, performs better the performance in localization accuracy and in success rate worsens as
than both of these systems. This is a central result of this the tag rotates away from the camera. However, the proposed system has
paper. dramatically lower localization error and is able to detect targets more
reliably.
and the vector to the camera. When φ is 0, the target is

facing directly towards the target; as φ approaches π/2,
the target rotates out of view and we expect performance
to decrease. We measure performance in terms of both the
localization accuracy and detection rate. In Fig. 9, we see
that our detector significantly outperforms the ARToolkitPlus
detector: not only are the orientation and distance estimates
more accurate, but it can detect tags over a larger range of
φ.
The complementary experiment is to hold φ = 0 and
Fig. 8. Example synthetic image. We generated ray-traced images in order
to create ground-truthed datasets for our evaluation. In this example, the tag to vary the distance. We expect that as distance increases,
is 10m from the camera, and its normal vector points 30.3 degrees away accuracy will decrease. In Fig. 10, we see that our detector
from the camera. works reliably to 50 m, while the ARToolkitPlus detector’s
detection rate drops to under 50% at around 25 m. In
The plot shows data for two additional schemes: ARTP- addition, our detector provides significantly more accurate
Simple performs about the same as our 36h10 encoding, but localization results.
because its tag family has one quarter as many distinguish- Naturally, real-world performance of the system will be
able tags, its false positive rate is correspondingly lower. For lower than these synthetic experiments due to noise, lighting
comparison purposes, we also include the false positive rate variation, and other non-idealities (such as lens distortion or
for 36h15 family with only 27 distinguishable codewords. tag non-planarity). Still, the real-world performance of our
As expected, it has an exceptionally low false positive rate. system has been very good.
While our methods are generally more computationally
B. Localization Accuracy
expensive than those used by ARToolkitPlus, our Java im-
To evaluate the localization accuracy of the detector, we plementation runs at interactive rates (30 fps) on VGA
used a ray tracer to generate images with known ground truth resolution images (Intel Core2 CPU at 2.6GHz). Higher
(see Fig. 8 for an example). The true location and orientation resolutions signficantly impact runtime due to the graph-
of the tags was varied randomly and compared to the detected based clustering. We expect significant speedups by making
position. Images were generated at a resolution of 400 × 400 use of SIMD optimizations and accelerated image procesing
with a pinhole lens and focal length of 400 pixels. libraries in our ongoing C port.
The main factor in localization accuracy is the size of the
target, which is affected by both the distance and orientation VII. C ONCLUSION
of the tag. To decouple these factors, we conducted two We have described a visual fiducial system that signifi-
experiments. The first experiment measured the orientation cantly improves upon previous methods. We described a new
accuracy of targets while fixing the distance. The critical approach for detecting edges using a graph-based clustering
parameter is the angle φ between the target’s normal vector method along with a coding system that is demonstrably
[10] T. Lochmatter, P. Roduit, C. Cianci, N. Correll, J. Jacot, and
A. Martinoli, “SwisTrack - A Flexible Open Source Tracking
Software for Multi-Agent Systems,” in Proceedings of the IEEE/RSJ
2008 International Conference on Intelligent Robots and Systems
(IROS 2008). IEEE, 2008, pp. 4004–4010. [Online]. Available:
http://iros2008.inria.fr/
[11] H. Kato and M. Billinghurst, “Marker tracking and hmd calibration for
a video-based augmented reality conferencing system,” in IWAR ’99:
Proceedings of the 2nd IEEE and ACM International Workshop on
Augmented Reality. Washington, DC, USA: IEEE Computer Society,
1999, p. 85.
[12] D. Wagner and D. Schmalstieg, “Making augmented reality practical
on mobile phones, part 1,” IEEE Computer Graphics and Applications,
vol. 29, pp. 12–15, 2009.
[13] S.-w. Lee, D.-c. Kim, D.-y. Kim, and T.-d. Han, “Tag detection algo-
rithm for improving the instability problem of an augmented reality,”
in ISMAR ’06: Proceedings of the 5th IEEE and ACM International
Symposium on Mixed and Augmented Reality. Washington, DC, USA:
IEEE Computer Society, 2006, pp. 257–258.
Fig. 10. Distance accuracy. In order to evaluate the accuracy of distance
[14] D. Parikh and G. Jancke, “Localization and segmentation of a 2d high
estimation to the target, we again used ground-truthed simulation data. For
capacity color barcode,” in WACV ’08: Proceedings of the 2008 IEEE
this experiment, we fixed the target so that it faced the camera but varied the
Workshop on Applications of Computer Vision. Washington, DC,
distance between the tag and the camera. Our proposed method significantly
USA: IEEE Computer Society, 2008, pp. 1–6.
outperforms ARToolkitPlus, providing more accurate range estimates, and
[15] P. C. Santos, A. Stork, A. Buaes, C. E. Pereira, and J. Jorge, “A
over twice the working detection range.
real-time low-cost marker-based multiple camera tracking solution for
virtual reality applications,” January 2009.
[16] A. Mohan, G. Woo, S. Hiura, Q. Smithwick, and
R. Raskar, “Bokode: imperceptible visual tags for camera
stronger than previous methods. We have also described a set based interaction from a distance,” ACM Trans. Graph.,
of benchmarks that we hope will make it easier to evaluate vol. 28, pp. 98:1–98:8, July 2009. [Online]. Available:
other methods in the future. In contrast to other systems (with http://portal.acm.org/citation.cfm?id=1531326.1531404
[17] J. McLurkin, “Analysis and implementation of distributed algorithms
the notable exception of ARToolkit), our implementation is for Multi-Robot systems,” Ph.D. thesis, Massachusetts Institute of
fully open. Our source code and benchmarking software are Technology, 2008.
freely available: [18] Y. Yamamoto, P. Pirjanian, M. Munich, E. Dibernardo, L. Goncalves,
J. Ostrowski, and N. Karlsson, “Optical sensing for robot perception
http://april.eecs.umich.edu/ and localization,” in 2005 IEEE Workshop on Advanced Robotics and
its Social Impacts, 2005, pp. 14–17.
[19] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based im-
R EFERENCES age segmentation,” International Journal of Computer Vision, vol. 59,
no. 2, pp. 167–181, 2004.
[1] C.-H. Chu, D.-N. Yang, and M.-S. Chen, “Image stablization for 2d [20] R. L. Rivest and C. E. Leiserson, Introduction to Algorithms. New
barcode in handheld devices,” in MULTIMEDIA ’07: Proceedings of York, NY, USA: McGraw-Hill, Inc., 1990.
the 15th international conference on Multimedia. New York, NY, [21] D. G. Lowe, “Distinctive image features from scale-invariant key-
USA: ACM, 2007, pp. 697–706. points,” International Journal of Computer Vision, vol. 60, no. 2, pp.
[2] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, and D. Schmal- 91–110, November 2004.
stieg, “Pose tracking from natural features on mobile phones,” in IS- [22] R. Hartley and A. Zisserman, Multiple View Geometry in Computer
MAR ’08: Proceedings of the 7th IEEE/ACM International Symposium Vision, 2nd ed. Cambridge University Press, 2004.
on Mixed and Augmented Reality. Washington, DC, USA: IEEE [23] K. Shoemake and T. Duff, “Matrix animation and polar decomposi-
Computer Society, 2008, pp. 125–134. tion,” in In Proceedings of the conference on Graphics interface 92.
[3] M. Fiala, “ARTag, a fiducial marker system using digital techniques,” Morgan Kaufmann Publishers Inc, 1992, pp. 258–264.
in CVPR ’05: Proceedings of the 2005 IEEE Computer Society [24] A. Trachtenberg, A. T. M. S, E. Vardy, and C. L. Liu, “Computational
Conference on Computer Vision and Pattern Recognition (CVPR’05) methods in coding theory,” Tech. Rep., 1996.
- Volume 2. Washington, DC, USA: IEEE Computer Society, 2005, [25] R. A. Brualdi and V. S. Pless, “Greedy codes,” J. Comb. Theory Ser.
pp. 590–596. A, vol. 64, no. 1, pp. 10–30, 1993.
[4] A. C. Sementille, L. E. Lourenço, J. R. F. Brega, and I. Rodello, [26] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman,
“A motion capture system using passive markers,” in VRCAI ’04: “Labelme: A database and web-based tool for image annotation,” Tech.
Proceedings of the 2004 ACM SIGGRAPH international conference Rep. MIT-CSAIL-TR-2005-056, Massachusetts Institute of Technol-
on Virtual Reality continuum and its applications in industry. New ogy, Tech. Rep., 2005.
York, NY, USA: ACM, 2004, pp. 440–447.
[5] J. Sattar, P. Giguère, and G. Dudek, “Sensor-based behavior control for
an autonomous underwater vehicle,” Int. J. Rob. Res., vol. 28, no. 6,
pp. 701–713, 2009.
[6] M. Fiala, “A robot control and augmented reality interface for multiple
robots,” in CRV ’09: Proceedings of the 2009 Canadian Conference on
Computer and Robot Vision. Washington, DC, USA: IEEE Computer
Society, 2009, pp. 31–36.
[7] ——, “Vision guided control of multiple robots,” Computer and Robot
Vision, Canadian Conference, vol. 0, pp. 241–246, 2004.
[8] U. Frese, “Deutsches zentrum für luft- und raumfahrt (DLR) dataset,”
2003.
[9] J. Sattar, E. Bourque, P. Giguere, and G. Dudek, “Fourier tags:
Smoothly degradable fiducial markers for use in human-robot interac-
tion,” Computer and Robot Vision, Canadian Conference, vol. 0, pp.
165–174, 2007.

Apriltag: A Robust and Flexible Visual Fiducial System: Edwin Olson

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Apriltag: A Robust and Flexible Visual Fiducial System: Edwin Olson

Uploaded by

Copyright:

Available Formats

AprilTag: A robust and flexible visual fiducial system

Abstract— While the use of naturally-occurring features is a

A. Detecting line segments

Encoding Scheme Length Unique codes Min. Hamming

Fig. 9. Orientation accuracy. Using our ray-tracing simulator (which

and the vector to the camera. When φ is 0, the target is

You might also like