• The Data Science Sandbox is an environment specifically designed for data
science and analytics. • It gives data scientists and analysts a protected, shared environment where models can be built and experiments conducted without harm to application databases. • A data sandbox, in the context of big data, is a scalable and developmental platform used to explore an organization's rich information sets through interaction and collaboration. It allows a company to realize its actual investment value in big data. Data Science Sandbox • A data sandbox includes massive parallel central processing units, high-end memory, high-capacity storage and I/O capacity and typically separates data experimentation and production database environments in data warehouses. • The IBM Netezza 1000 is an example of a data sandbox platform which is a stand-alone analytic data mart. An example of a logical partition in an enterprise data warehouse, which also serves as a data sandbox platform, is the IBM Smart Analytics System. • A Hadoop cluster like IBM InfoSphere BigInsights Enterprise Edition is also included in this category. Example Data Science Sandbox characteristics: Data Science Sandbox characteristics: • Scalability: can grow and shrink to accommodate the volume of data and computation needed. Cloud environments provide powerful scalability. • Shareable Code and Models: a combination of source code repository and sharable code snippets enable model management. Code sharing via Python Notebooks and Zepelin Notebooks is a best practice. • Data Science Platforms and Languages: provide the data scientist with base to develop solutions. Platforms can provide a graphical interface, such as: Alteryx, Knime and Rapid Miner. Programming languages provide detailed control, such as: Python, R, Scala, C++ and Julia. Julia has 10x to 30x faster execution speeed compared to Python and R. • Data Science Libraries: provide prebuilt solutions to data science challenges. Python core libraries like Numpy and Pandas are a given, plus data science libraries like Scikit-Learn, TensorFlow and Keras provide a solution framework. • Data Protection: provides security for at risk data such as customer personal information. Data protection measures can include: user authorization, user authentication, firewalls, data encryption and data obfuscation. In many cases, senstive data can be removed before loading to the Data Science Sandbox environment. • Data Engineering: supplies the data environment: datastores and data pipelines. This function is performed by Data Engineers. • Data Wrangling: enables data to be "mssaged" at a detail level. It includes: data cleansing, filtering and organizing. This function is performed by Data Scientists. • Flexible Data Access: data virtualization, data lake, data import • On Demand: enables new projects and research efforts to start quickly without • Play and Experimentation: enable creative juices to flow for the development of innovative solutions. Data Scientists can quickly develop and test hypothesis through experimentation.