You are on page 1of 5

EXPERIMENT NO: 09

Aim: To prepare a case study on Distributed File Systems.

Theory:

What is a distributed file system?

A distributed file system is a computer network consisting of multiple computers linked to a


standard network. These computers are called nodes; each node stores a subset of the
distributed data. This data is then distributed among the nodes, allowing users to access it from
any location. By distributing data across multiple machines, a distributed file system also
provides security and reliability.

File systems are essential components of any business because they allow for the storage and
retrieval of data. Distributed file systems, on the other hand, provide a distinct set of advantages
that make them an essential component of any organization’s infrastructure. As a result,
organizations can reduce costs while increasing storage capacity and performance by
implementing a distributed file system.

A distributed file system operates by storing a virtual file system on multiple computers. Each computer
has its file system and keeps a portion of the data. When a user requests a file, the request is routed to
the file server, and the data is obtained from other computers. This process is referred to as replication
and is used to ensure data integrity and security.

Types of Distributed File Systems?

Distributed file systems are classified into three types: local area network (LAN), wide area
network (WAN), and cloud-based. Each type has its own set of advantages and disadvantages.

Local Area Network (LAN) (LAN)

A local area network (LAN) is a computer network restricted to a specific geographical area.
Computers within that area can be linked, allowing for sharing resources such as files, printers,
and applications. A LAN is used to connect remote computers to a distributed file system. This
feature means that users can access files regardless of their physical location on the network.
This ability is advantageous because it allows users to access the same shared resources without
transferring files back and forth.

The corporate environment is one of the best examples of a LAN in a distributed file system.
Companies frequently have multiple physical locations, and all of those locations must have
access to duplicate files. Using a LAN, all users can access duplicate files regardless of
location. This feature means that everyone is on the same page and that no vital information is
being overlooked.

The home is another example of a LAN in a distributed file system. Many families have
multiple computers, and it’s critical that everyone access files without having to transfer them.
Using a LAN, all home computers can access duplicate files, regardless of which computer is
used.

Wide Area Network (WAN)

A wide area network (WAN) is a distributed file system that stores and accesses data over a
large geographical area. The WAN acts as a bridge between the different nodes in a distributed
file system, allowing the same file to be accessed from any node in the network. This feat is
achievable by routing data across the WAN using a protocol, allowing data to flow between the
various locations. In addition, the WAN enables data transmission by establishing a secure
connection between the nodes.

The WAN also serves as the distribution file system’s backbone. It ensures that data is securely
transmitted between nodes while also providing a reliable connection. This action ensures that
the information is delivered on time and in good condition. Ethernet, Fibre Channel, and Token
Ring are WAN networks in distributed file systems. Each of these networks provides varying
levels of performance and security, so selecting the right one for your application is critical.

Cloud-based

A cloud-based distributed file system is a type of distributed file system that uses the internet to
store and access data. Amazon S3, Microsoft Azure, and Google Cloud Storage are examples of
cloud-based distribution file systems. However, these are just a few cloud-based distribution
file systems available today.

Cloud-based distribution file systems are a type of software that stores files across multiple
computers in a distributed network. This feature means that files are stored on various
computers rather than just one. This action improves performance by reducing the strain on a
single computer and enhancing security by spreading files across multiple computers.

You must first upload your files to a remote server to use a cloud-based distributed file system.
This file could be a private server, a secure file storage service, or a public cloud storage
platform. Once your files are uploaded, they are stored in a distributed network of computers,
allowing you to access them from any device.

Architecture of DFS
The architecture of a distributed file system (DFS) involves various components and layers that
facilitate communication and data management between end users, DFS servers, cloud services,
local area networks (LANs), and storage systems. Here's an overview of the architecture from
end user to storage layers:

● End User to DFS Server:


○ End users interact with the DFS through client software or applications installed
on their devices (e.g., computers, smartphones).
○ The client software communicates with the DFS server over the network using
protocols such as NFS (Network File System), SMB (Server Message Block), or
HTTP-based APIs.
○ End users can perform file operations such as read, write, delete, and access
control through the DFS client software.
● DFS Server to Cloud and Local Area Network (LAN):
○ The DFS server manages file storage and metadata operations within the
distributed file system.
○ It communicates with cloud services (if the DFS is integrated with cloud storage
providers) for storage provisioning, data replication, backup, and retrieval.
○ Within the LAN, the DFS server interacts with metadata servers, data nodes, and
other components to handle file system operations, maintain data consistency,
and manage storage resources.
● Cloud to Cloud Storage:
○ If the DFS leverages cloud storage for data storage and replication,
communication occurs between the DFS server and cloud storage services (e.g.,
Amazon S3, Google Cloud Storage, Azure Blob Storage).
○ Cloud-to-cloud communication involves data transfer, synchronization, and
replication of file data and metadata across different cloud storage regions or
providers.
○ The DFS server orchestrates data movements and ensures data consistency and
availability across cloud storage environments.
● Local Area Network (LAN) to Local Storage:
○ Within the LAN, data nodes and storage devices store file data in local storage
systems (e.g., hard drives, SSDs, network-attached storage).
○ Data nodes manage data blocks, replicas, and storage policies (e.g., data
redundancy, caching) to optimize performance and fault tolerance.
○ Communication between data nodes, metadata servers, and DFS clients within
the LAN ensures efficient data access, retrieval, and management.

Overall, the architecture of a distributed file system encompasses end-user interactions, DFS
server operations, cloud integration, LAN connectivity, and local storage management,
providing a seamless and scalable platform for file storage and access in distributed computing
environments.

Applications of Distributed File System

Some of the major applications of the distributed file system are shown below:

NFS
Network File System (NFS) is a file sharing protocol works in a client server architecture. As a matter of
fact, it allows users to access and mount directories located on the remote system. It is one of various
DFS standards for Network-Attached Storage. Chiefly NFS uses a file locking system that allows many
clients to share the same files. The NFS manages multiple application or compute threads for operation.
Hadoop

Free, open source distributed file system used to store process and analyse data which are very
huge in volume. Designed to process large data sets across clusters of computers using simple
programming models. Using Hadoop, you scale up from single servers to thousands of
machines, each offering local computation and storage.
SMB

A Server Message Block or SMB is a file sharing protocol developed by IBM. All in all, it
allows you to read and write files on a remote server over the local area network. With SMB,
you share files, directories, printers and other resources on a companies internal network.

NetWare

Network operating system developed by Novell. NetWare uses IPX network protocol to run
different services on a personal computer. Additionally supports several operating systems,
including Microsoft Windows, DOS, IBM OS/2, and Unix.

Advantages of DFS

Uptime

The great advantage of DFS is high reliability and uptime as a result of data redundancy i.e. the
same information is stored on several nodes. It increases access speed and ensures high fault
tolerance. If something goes wrong with one node, the same information could be easily
retrieved from all others. And the risk of all the nodes experiences downtime simultaneously is
close to zero.

Maintenance

Following the previous point, distributed systems are great when it comes to server
maintenance, as you can make sure that all the components are up-to-date, and perform system
upgrades without work disruption. Even if a reboot is required to implement the changes, you
can do it gradually making your data loaded from a different node.

Collaboration and Ease of Access

With DFS your staff will be able to access the corporate data simultaneously from different
locations. Apart from the access, your employees will be able to upload, download, and share
files giving several people the possibility to work on one file at a time. It decreases the
document processing time and boosts productivity even if all the team is working remotely.

Disadvantages of DFS:

Security Threats

Since the data are stored on multiple physical machines and locations, it requires more
investments for the security of storage, file transfer protocols, and connection to the
storage since the breach can occur on all these stages. Otherwise, utilizing the system
vulnerabilities, sensitive data leakage exposing the information on the internet might
take place. That’s why it’s necessary to restrict access with an internal VPN connection,
adding several security layers like 2FA or OTP to ensure that the data remains
inaccessible outside the company.

Reduced Speed

Depending on the data nature and physical system capacity, if your team is working
with big data masses, the reading speed can be slower as the connections, incoming and
outgoing server requests consume a lot of CPU and RAM resources. Especially, it’s
valid for the SBM servers of the older generation which are still popular for cost-related
reasons. The way out from this is using reliable infrastructure both in terms of network
and hardware.

Management Complications

Managing distributed system management is a more complex process than a centralised


one. Database connection and its management to resolving issues might serve as an
example. When data is loaded from different sources, it may be difficult to find a faulty
node and conduct proper system diagnostics. That’s why if you opt for the distributed
systems, a team of highly skilled engineers is a must. Even if you have one, the time for
handling incidents may be longer due to the nature of the system.

Examples of Distributed File Systems


There are various types of distributed file systems. Apache Hadoop, GlusterFS, GFS, and Ceph
are famous examples.

● Apache Hadoop Distributed File System (HDFS) was developed specifically for
processing large amounts of data. It is an open-source platform that allows for the
storage and analysis of vast amounts of distributed data. It has applications in many
fields, including e-commerce, healthcare, and financial sectors.
● GlusterFS is an open-source distributed file system that is known for its scalability and
high performance. It is a clustered file system that uses commodity hardware to create a
single, large, high-performance storage pool.
● Google File System (GFS) is a distributed file system that is used by Google to store its
data. GFS is designed to scale to very large sizes and to handle high volumes of data
traffic. It is a highly reliable and fault-tolerant system.
● Ceph is an open-source distributed storage system that is based on the object storage
model. Object storage is a type of storage that treats data as a collection of objects,
rather than as files and directories. Ceph is a highly scalable and fault-tolerant system
that can be used to store petabytes of data.

Conclusion: We have successfully prepared a case study on distributed file systems.

You might also like