Professional Documents
Culture Documents
• Hadoop allows setting arbitrary key-value pairs in the job configuration using Configuration
(or JobConf in the old MapReduce API).
• In tasks, the data can be retrieved from the configuration using the getConfiguration()
method in the Context.
• In the older API, it involves overriding the configure() method in Mapper or Reducer and
using the JobConf object passed in to retrieve the data.
• Handling Serialization:
• Primitive types are often sufficient for encoding metadata. For more complex objects,
developers can either handle the serialization process themselves or use Hadoop's Stringifier class.
• Memory Consideration:
• It's essential to be cautious with the volume of data stored in the job configuration as it can
significantly impact memory usage in the Hadoop daemons.
• All entries in the configuration are read into memory whenever the configuration is read by
the jobtracker, tasktracker, or child JVM, even if not all entries are utilized. Unused user properties
can waste resources and time.
The Distributed Cache in Hadoop is a mechanism to efficiently share auxiliary data or files required
for a MapReduce job across multiple nodes in a cluster. It's a service that copies necessary files to
each task node so they can be used when tasks are executed.
How it Works:
• File Distribution:
• Files are specified using command-line options like -files or -archives for distribution.
• These files can be on the local file system, HDFS, or other Hadoop-compatible file systems.
• File Copying:
• When a job is launched, Hadoop copies these specified files to the jobtracker's filesystem
(usually HDFS).
• Before a task is executed, the tasktracker copies the files from HDFS to a local cache on the
node where the task will run.
• Localized Files:
• For the task, these files appear as if they are local files, even though they originated from
HDFS.
• The reference count for each file in the cache is maintained. After the task completes, the
reference count decreases. When the count reaches zero, the file is removed from the cache to free
up space.
• If the cache exceeds a certain size (typically 10 GB by default), files may be deleted to
accommodate new files. This cache size is configurable.
• The distributed cache API contains methods to add files to the cache, set the classpath for
the MapReduce task, create symlinks, and access localized files or archives within the task code.
• API Methods:
• The API allows adding or setting files and archives in the cache and configuring classpath
using Java methods.
• It provides functions to retrieve localized files and archives in the task code.
• In the task code, localized files can be accessed directly or retrieved using specific methods
provided by the distributed cache API.
• Files are returned as local Path objects, which can be accessed using Hadoop's local
FileSystem instance or via Java's File API.
• Provides a mechanism to distribute datasets or files to task nodes in a Hadoop cluster for
efficient use by map or reduce tasks.
• Tools employing the GenericOptionsParser can specify files using options like -files and -
archives in a comma-separated list of URIs.
• The -files option copies files to task nodes, while the -archives option copies archives, which
are unarchived on the task node.
• -libjars adds JAR files to the classpath of mapper and reducer tasks.
4. File Localization:
• Files are initially copied to the jobtracker's filesystem and then to a local disk on the task
node (cache) before task execution. This localization optimizes access for tasks.
• Each file's usage is tracked by a reference count. When tasks complete, the count
decrements; when it reaches zero, the file is deleted.
• By default, the cache size is limited to 10 GB and may be adjusted using the local.cache.size
configuration property. Files are deleted when the cache exceeds this size.
• Create Symbolic Links: createSymlink() method creates symbolic links for localized files in the
task's working directory.
• When using the Java API directly, files specified in add() or set() methods are not copied to
the shared filesystem. However, using GenericOptionsParser copies the file to the default shared
filesystem.
• getFileClassPaths() and getArchiveClassPaths() retrieve files and archives added to the task
classpath.
• Retrieved files are returned as local Path objects. To read files, a Hadoop local FileSystem
instance or java.io.File API is utilized.
• The setup() method exemplifies how to initialize and access a distributed cache file in a
MapReduce task.