Data Serialization in Big Data

What is Serialization?
Serialization is the process of converting a data object—a combination of code and

data represented within a region of data storage—into a series of bytes that saves the
state of the object in an easily transmittable form. In this serialized form, the data can
be delivered to another data store (such as an in-memory computing platform),
application, or some other destination.
Data serialization is the process of converting an object into a stream of bytes to more
easily save or transmit it.
The reverse process—constructing a data structure or object from a series of bytes—
is deserialization. The deserialization process recreates the object, thus making the
data easier to read and modify as a native structure in a programming language.
Serialization and de–serialization work together to transform/recreate data objects

to/from a portable format.
Serialization enables us to save the state of an object and recreate the object in a new
location. Serialization encompasses both the storage of the object and exchange of
data. Since objects are composed of several components, saving or delivering all the
parts typically requires significant coding effort, so serialization is a standard way to
capture the object into a sharable format. With serialization, we can transfer objects:
 Over the wire for messaging use cases

 From application to application via web services such as REST APIs
 Through firewalls (as JSON or XML strings)
 Across domains
 To other data stores
 To identify changes in data over time
 While honoring security and user-specific details across applications
Why Is Data Serialization Important for Distributed Systems?
In some distributed systems, data and its replicas are stored in different partitions on
multiple cluster members. If data is not present on the local member, the system will
retrieve that data from another member. This requires serialization for use cases such
as:
 Adding key/value objects to a map
 Putting items into a queue, set, or list
 Sending a lambda functions to another server
 Processing an entry within a map
 Locking an object
 Sending a message to a topic
What Are Common Languages for Data Serialization?
A number of popular object-oriented programming languages provide either native

support for serialization or have libraries that add non-native capabilities for
serialization to their feature set. Java, .NET, C++, Node.js, Python, and Go, for
example, all either have native serialization support or integrate with libraries for
serialization.
Data formats such as JSON and XML are often used as the format for storing
serialized data. Customer binary formats are also used, which tend to be more space-
efficient due to less markup/tagging in the serialization.
What Is Data Serialization in Big Data?
Big data systems often include technologies/data that are described as “schemaless.”
This means that the managed data in these systems are not structured in a strict
format, as defined by a schema. Serialization provides several benefits in this type of
environment:
 Structure. By inserting some schema or criteria for a data structure through

serialization on read, we can avoid reading data that misses mandatory fields,
is incorrectly classified, or lacks some other quality control requirement.
 Portability. Big data comes from a variety of systems and may be written in a
variety of languages. Serialization can provide the necessary uniformity to
transfer such data to other enterprise systems or applications.
 Versioning. Big data is constantly changing. Serialization allows us to apply
version numbers to objects for lifecycle management.
Serialization in Java
Java provides a mechanism, called object serialization where an object can be

represented as a sequence of bytes that includes the object's data as well as
information about the object's type and the types of data stored in the object.
After a serialized object is written into a file, it can be read from the file and
deserialized. That is, the type information and bytes that represent the object and its
data can be used to recreate the object in memory.
ObjectInputStream and ObjectOutputStream classes are used to serialize and

deserialize an object respectively in Java.
Serialization in Hadoop
Generally in distributed systems like Hadoop, the concept of serialization is used

for Interprocess Communication and Persistent Storage.
Interprocess Communication
To establish the interprocess communication between the nodes connected in a

network, RPC technique was used.
RPC used internal serialization to convert the message into binary format before
sending it to the remote node via network. At the other end the remote system
deserializes the binary stream into the original message.
The RPC serialization format is required to be as follows −
Compact − To make the best use of network bandwidth, which is the most scarce
resource in a data center.
Fast − Since the communication between the nodes is crucial in distributed systems,
the serialization and deserialization process should be quick, producing less overhead.
Extensible − Protocols change over time to meet new requirements, so it should be

straightforward to evolve the protocol in a controlled manner for clients and servers.
Interoperable − The message format should support the nodes that are written in
different languages.
Persistent Storage
Persistent Storage is a digital storage facility that does not lose its data with the loss of
power supply. Files, folders, databases are the examples of persistent storage.

Data Serialization in Big Data

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Serialization in Big Data

Uploaded by

Copyright:

Available Formats

What is Serialization?

Serialization is the process of converting a data object—a combination of code and

Serialization and de–serialization work together to transform/recreate data objects

 Over the wire for messaging use cases

Why Is Data Serialization Important for Distributed Systems?

What Are Common Languages for Data Serialization?

A number of popular object-oriented programming languages provide either native

What Is Data Serialization in Big Data?

 Structure. By inserting some schema or criteria for a data structure through

Java provides a mechanism, called object serialization where an object can be

ObjectInputStream and ObjectOutputStream classes are used to serialize and

Generally in distributed systems like Hadoop, the concept of serialization is used

To establish the interprocess communication between the nodes connected in a

The RPC serialization format is required to be as follows −

Extensible − Protocols change over time to meet new requirements, so it should be

You might also like