You are on page 1of 23

Architecture Stories for Big

and Tiny Data

We are
Unnatidata Labs
@raghothams @nischalhp
3 Stories
Velocity | Volume | Variety
What are
we solving? Infrastructure

Architecture Learnings
Small Data | Early Startup | Data Driven
FinTech | What are we solving?
Evaluate college students to determine their
Lack of credit history
Tiny data
Enrich data with alternate data sources
Statistical modelling to evaluate students initially.
As the user activity increases, build machine learning
models to predict creditworthiness.
FinTech | Thought process for Infrastructure

Data velocity estimation for the next 6 months

Complexity of data science algorithms
No. of calls being serviced by the data science APIs

AWS instance x1, 8 GB ram, 4 cores

FinTech | Architecture
FinTech | Learnings

Small data problems are tricky

Go behind low hanging fruits first
Need clever techniques
Beware of data sanity with NoSQL
Embracing data science early helps the business
grow taller, stronger & sharper
Campaign Management
Medium Size Data | Established Startup
Campaign Management | What are we solving?

Predict user behavior

Business has amassed data over 2-3 years
Educate team about data science & benefits
Ideate & prioritize problems that can be solved
RoI, pricing for new plugins
Campaign Management | Thought process for
200+ Million rows
Parallel Analytics data warehouse
Data pipelines, automated workflows
Distributed machine learning models
Prediction as a Service

Dedicated bare metal server

32 GB ram | 8 cores | 1 TB SSD
Campaign Management | Architecture
Campaign Management | Learnings

Postgresql read replicas pause long running queries

Understand postgresql WALs
Data pipelines break. Exception handling,
notifications, logging is utmost important
We wired luigi exceptions to slack for notifications
Pandas transformations are slow for large datasets
PySpark to the rescue!
Use monitoring tools like Munin for profiling
Unstructured Healthcare
Medium - Big Size Data | Generic Data
Science Platform
Healthcare | What are we solving?
Analytics on healthcare spend
Medical claims - many data providers - no standard
Data volume 500 M rows to start + high velocity
Robust data ingestion, data cleaning system
Data Security and HIPAA compliance
Data pipeline is the heart of the platform

Adding more servers is
easy, writing more code is
Healthcare | Thought process for Infrastructure
Flexible schema calling out for NoSQL
Massive ingestion & cleaning tasks
Denormalize + Wide format
100s of transformation & analytics tasks
Luigi to the rescue
Spark for transformation & analytics

Database Instance
32 GB ram | 8 cores | 5 TB SSD
Application Instance
API Server 4 GB ram | 2 cores | 500 GB SSD
Healthcare | Architecture
Healthcare | Learnings
Authorizations for databases are very important
Aim to parallelize tasks for ingestion
Data redundancy is totally fine for data science
Polyglot of services - Use the right tools
Understand business expectations & landscape
before jumping into architecture

Any questions?

tweet : @unnati_xyz