You are on page 1of 3

What is Serialization?

Serialization is the process of converting a data object—a combination of code and


data represented within a region of data storage—into a series of bytes that saves the
state of the object in an easily transmittable form. In this serialized form, the data can
be delivered to another data store (such as an in-memory computing platform),
application, or some other destination.

Data serialization is the process of converting an object into a stream of bytes to more
easily save or transmit it.
The reverse process—constructing a data structure or object from a series of bytes—
is deserialization. The deserialization process recreates the object, thus making the
data easier to read and modify as a native structure in a programming language.

Serialization and de–serialization work together to transform/recreate data objects


to/from a portable format.

Serialization enables us to save the state of an object and recreate the object in a new
location. Serialization encompasses both the storage of the object and exchange of
data. Since objects are composed of several components, saving or delivering all the
parts typically requires significant coding effort, so serialization is a standard way to
capture the object into a sharable format. With serialization, we can transfer objects:

 Over the wire for messaging use cases


 From application to application via web services such as REST APIs
 Through firewalls (as JSON or XML strings)
 Across domains
 To other data stores
 To identify changes in data over time
 While honoring security and user-specific details across applications

Why Is Data Serialization Important for Distributed Systems?

In some distributed systems, data and its replicas are stored in different partitions on
multiple cluster members. If data is not present on the local member, the system will
retrieve that data from another member. This requires serialization for use cases such
as:

 Adding key/value objects to a map
 Putting items into a queue, set, or list
 Sending a lambda functions to another server
 Processing an entry within a map
 Locking an object
 Sending a message to a topic

What Are Common Languages for Data Serialization?

A number of popular object-oriented programming languages provide either native


support for serialization or have libraries that add non-native capabilities for
serialization to their feature set. Java, .NET, C++, Node.js, Python, and Go, for
example, all either have native serialization support or integrate with libraries for
serialization.

Data formats such as JSON and XML are often used as the format for storing
serialized data. Customer binary formats are also used, which tend to be more space-
efficient due to less markup/tagging in the serialization.

What Is Data Serialization in Big Data?

Big data systems often include technologies/data that are described as “schemaless.”
This means that the managed data in these systems are not structured in a strict
format, as defined by a schema. Serialization provides several benefits in this type of
environment:

 Structure. By inserting some schema or criteria for a data structure through


serialization on read, we can avoid reading data that misses mandatory fields,
is incorrectly classified, or lacks some other quality control requirement.
 Portability. Big data comes from a variety of systems and may be written in a
variety of languages. Serialization can provide the necessary uniformity to
transfer such data to other enterprise systems or applications.
 Versioning. Big data is constantly changing. Serialization allows us to apply
version numbers to objects for lifecycle management.

Serialization in Java

Java provides a mechanism, called object serialization where an object can be


represented as a sequence of bytes that includes the object's data as well as
information about the object's type and the types of data stored in the object.
After a serialized object is written into a file, it can be read from the file and
deserialized. That is, the type information and bytes that represent the object and its
data can be used to recreate the object in memory.

ObjectInputStream and ObjectOutputStream classes are used to serialize and


deserialize an object respectively in Java.

Serialization in Hadoop

Generally in distributed systems like Hadoop, the concept of serialization is used


for Interprocess Communication and Persistent Storage.

Interprocess Communication

To establish the interprocess communication between the nodes connected in a


network, RPC technique was used.

RPC used internal serialization to convert the message into binary format before
sending it to the remote node via network. At the other end the remote system
deserializes the binary stream into the original message.

The RPC serialization format is required to be as follows −

Compact − To make the best use of network bandwidth, which is the most scarce
resource in a data center.

Fast − Since the communication between the nodes is crucial in distributed systems,
the serialization and deserialization process should be quick, producing less overhead.

Extensible − Protocols change over time to meet new requirements, so it should be


straightforward to evolve the protocol in a controlled manner for clients and servers.

Interoperable − The message format should support the nodes that are written in
different languages.

Persistent Storage

Persistent Storage is a digital storage facility that does not lose its data with the loss of
power supply. Files, folders, databases are the examples of persistent storage.

You might also like