You are on page 1of 7

Unit-2

What is a Data Science Sandbox

• The Data Science Sandbox is an environment specifically designed for data


science and analytics.
• It gives data scientists and analysts a protected, shared environment
where models can be built and experiments conducted without harm to
application databases.
• A data sandbox, in the context of big data, is a scalable and developmental
platform used to explore an organization's rich information sets through
interaction and collaboration. It allows a company to realize its actual
investment value in big data.
Data Science Sandbox
• A data sandbox includes massive parallel central processing units,
high-end memory, high-capacity storage and I/O capacity and
typically separates data experimentation and production
database environments in data warehouses.
• The IBM Netezza 1000 is an example of a data sandbox platform
which is a stand-alone analytic data mart. An example of a logical
partition in an enterprise data warehouse, which also serves as a
data sandbox platform, is the IBM Smart Analytics System.
• A Hadoop cluster like IBM InfoSphere BigInsights Enterprise
Edition is also included in this category.
Example
Data Science Sandbox characteristics:
Data Science Sandbox characteristics:
• Scalability: can grow and shrink to accommodate the volume of data and computation needed. Cloud
environments provide powerful scalability.
• Shareable Code and Models: a combination of source code repository and sharable code snippets enable model
management. Code sharing via Python Notebooks and Zepelin Notebooks is a best practice.
• Data Science Platforms and Languages: provide the data scientist with base to develop solutions. Platforms can
provide a graphical interface, such as: Alteryx, Knime and Rapid Miner. Programming languages provide detailed
control, such as: Python, R, Scala, C++ and Julia. Julia has 10x to 30x faster execution speeed compared to Python
and R.
• Data Science Libraries: provide prebuilt solutions to data science challenges. Python core libraries like Numpy and
Pandas are a given, plus data science libraries like Scikit-Learn, TensorFlow and Keras provide a solution
framework.
• Data Protection: provides security for at risk data such as customer personal information. Data protection
measures can include: user authorization, user authentication, firewalls, data encryption and data obfuscation. In
many cases, senstive data can be removed before loading to the Data Science Sandbox environment.
• Data Engineering: supplies the data environment: datastores and data pipelines. This function is performed by
Data Engineers.
• Data Wrangling: enables data to be "mssaged" at a detail level. It includes: data cleansing, filtering and organizing.
This function is performed by Data Scientists.
• Flexible Data Access: data virtualization, data lake, data import
• On Demand: enables new projects and research efforts to start quickly without
• Play and Experimentation: enable creative juices to flow for the development of innovative solutions. Data
Scientists can quickly develop and test hypothesis through experimentation.

You might also like