You are on page 1of 3

Utilizing Job Configuration:

• Setting Key-Value Pairs:

• Hadoop allows setting arbitrary key-value pairs in the job configuration using Configuration
(or JobConf in the old MapReduce API).

• This method is useful for passing small metadata to tasks.

• Retrieving Data in Tasks:

• In tasks, the data can be retrieved from the configuration using the getConfiguration()
method in the Context.

• In the older API, it involves overriding the configure() method in Mapper or Reducer and
using the JobConf object passed in to retrieve the data.

• Handling Serialization:

• Primitive types are often sufficient for encoding metadata. For more complex objects,
developers can either handle the serialization process themselves or use Hadoop's Stringifier class.

• DefaultStringifier is used to serialize objects utilizing Hadoop's serialization framework.


However, it's recommended not to transfer large amounts of data via this method to prevent
excessive memory usage within the Hadoop system, especially in environments with numerous jobs
running simultaneously.

• Memory Consideration:

• It's essential to be cautious with the volume of data stored in the job configuration as it can
significantly impact memory usage in the Hadoop daemons.

• All entries in the configuration are read into memory whenever the configuration is read by
the jobtracker, tasktracker, or child JVM, even if not all entries are utilized. Unused user properties
can waste resources and time.

The Distributed Cache in Hadoop is a mechanism to efficiently share auxiliary data or files required
for a MapReduce job across multiple nodes in a cluster. It's a service that copies necessary files to
each task node so they can be used when tasks are executed.

How it Works:

• File Distribution:

• Files are specified using command-line options like -files or -archives for distribution.

• These files can be on the local file system, HDFS, or other Hadoop-compatible file systems.

• File Copying:

• When a job is launched, Hadoop copies these specified files to the jobtracker's filesystem
(usually HDFS).

• Before a task is executed, the tasktracker copies the files from HDFS to a local cache on the
node where the task will run.

• Localized Files:
• For the task, these files appear as if they are local files, even though they originated from
HDFS.

• The reference count for each file in the cache is maintained. After the task completes, the
reference count decreases. When the count reaches zero, the file is removed from the cache to free
up space.

• Cache Size and Deletion:

• If the cache exceeds a certain size (typically 10 GB by default), files may be deleted to
accommodate new files. This cache size is configurable.

• Cache Location and APIs:

• The distributed cache API contains methods to add files to the cache, set the classpath for
the MapReduce task, create symlinks, and access localized files or archives within the task code.

• Files are localized under a specific directory on the tasktrackers.

Using the API:

• API Methods:

• The API allows adding or setting files and archives in the cache and configuring classpath
using Java methods.

• It provides functions to retrieve localized files and archives in the task code.

• Handling Files in Task Code:

• In the task code, localized files can be accessed directly or retrieved using specific methods
provided by the distributed cache API.

• Files are returned as local Path objects, which can be accessed using Hadoop's local
FileSystem instance or via Java's File API.

Distributed Cache: Side Data Distribution

1. Purpose of Distributed Cache:

• Provides a mechanism to distribute datasets or files to task nodes in a Hadoop cluster for
efficient use by map or reduce tasks.

2. Usage via GenericOptionsParser:

• Tools employing the GenericOptionsParser can specify files using options like -files and -
archives in a comma-separated list of URIs.

3. Copying Files and Archives:

• The -files option copies files to task nodes, while the -archives option copies archives, which
are unarchived on the task node.

• -libjars adds JAR files to the classpath of mapper and reducer tasks.

4. File Localization:
• Files are initially copied to the jobtracker's filesystem and then to a local disk on the task
node (cache) before task execution. This localization optimizes access for tasks.

• Each file's usage is tracked by a reference count. When tasks complete, the count
decrements; when it reaches zero, the file is deleted.

5. Cache Size and Deletion:

• By default, the cache size is limited to 10 GB and may be adjusted using the local.cache.size
configuration property. Files are deleted when the cache exceeds this size.

6. API Methods for Managing Cache:

• Job API methods: addCacheFile(), addCacheArchive(), and setCacheFiles() add files or


archives to the distributed cache. addFileToClassPath() and addArchiveToClassPath() add files to the
task's classpath.

• Create Symbolic Links: createSymlink() method creates symbolic links for localized files in the
task's working directory.

7. Java API vs. GenericOptionsParser:

• When using the Java API directly, files specified in add() or set() methods are not copied to
the shared filesystem. However, using GenericOptionsParser copies the file to the default shared
filesystem.

8. Accessing Cached Files:

• JobContext methods: getLocalCacheFiles() and getLocalCacheArchives() are used in map or


reduce tasks to access localized files from the distributed cache.

• getFileClassPaths() and getArchiveClassPaths() retrieve files and archives added to the task
classpath.

9. Retrieving and Accessing Files:

• Retrieved files are returned as local Path objects. To read files, a Hadoop local FileSystem
instance or java.io.File API is utilized.

• The setup() method exemplifies how to initialize and access a distributed cache file in a
MapReduce task.

You might also like