IndicThreads-Pune12-NoSQL Now and Path Ahead

NoSQL: Now and Path Ahead
Shubham Kumar Srivastava

MakeMyTrip
Who am I
Abstract
What and Why : NoSql Fundamentals
Use Case
Challenges Path Ahead
3
What is NoSql
Database which does not adhere to the traditional relational database management system (RDMS) structure .
Why NoSql
Scalability and Performance Cost Data Modeling
Why NoSql : Motives and Drivers

Scalability and Performance
Horizontal scalability better than Vertical
Hardware getting cheaper and processing power increasing
Less Operational complexity as against RDBMS solutions.
In most of the solutions you get automatic sharding etc as default .
Why NoSql : Motives and Drivers contd..

Cost Scale(as with NoSql) with Hefty Cost
Commodity hardware, software versions, upgrades, maintenance.
This brought organizations look out for alternatives and the need for a cost effective scale out option.

Data Modeling
SQL has been for
Concurreny,Consistency,Integrity
For Summations,Aggregations,Groupings
Schema Says: What all Do I answer ??

Data Modeling
A plain key-value store is very powerful and fit the max use cases for a NoSQL solution Hierarchical or graph-like data modelling and processing. Values like maps of maps of maps. Document Databases which even store arbitrary complex objects.
Document based indexing data stores are a huge success.

At times SW apps are not limited to these constraints . This lead to data models like
Key/Value Store :
Redis,MemcacheDb/Voldemort etc. Wide Column Store / Column Families : Cassandra/Hadoop(Hbase)/Hypertable/Cloudera etc. Document Based Stores : Solr/Lucene/MongoDb/CouchDb/TerraStore etc. Graph Data Store : Neo4J/GraphBase/FlockDb etc.

Schema Says: What are the questions Data modeling is based on the set of Queries
Exploit De-normalization Duplication

Use Aggregates Manage Joins with App + Aggregation + DeNormalization etc.
Some Fanda-mentals
CAP Theorem
At the most only two properties of the three in a shared/distributed system can be satisfied.
Consistency
Availability
Tolerance to Network Partitions
CAP : Pictorially
Explanation
Use case:
Scaling Web Apps
Critical facts :
Network outages are common Customer shopping carts, email search, social network queriescan tolerate stale data
How:
Compromise on Consistency in-order to remain available vs disrupt user service at outages.
Explanation
Rather than requiring consistency after every transaction, it is enough for the database to eventually be in a consistent state.
Brewers CAP theorem says you have no choice if you want to scale up.
Explanation contd..
Sharp Contrast : High Speed Financial Application
Highly Transactional
Consistent Automated
Cant live with Eventual consistency
ACID vs BASE
ACID
Atomic: Everything in a transaction succeeds or the entire transaction is rolled back. Consistent: A transaction cannot leave the database in an inconsistent state. Isolated: Transactions cannot interfere with each other.
Durable: Completed transactions persist, even when servers restart etc.
Some Fanda-mentals cont..

BASE Basic Availability
Soft-state
Eventual consistency
Consistent Hashing
Common way to load balance .
The machine chosen to cache object o will be:

hash(o) mod n n:total number of machines
Consistent Hashing contd..

Adding a machine to the cache means hash(o) mod (n + 1) Removing a machine to the cache means hash(o) mod (n - 1)
Result on any above: Disaster Swamped machines with redistribution
Commonly, a hash function(e.g MD5 hash) will map a value into a 128-bit key, 0~2^127-1(or 32 bit even as given next) .

Both Key and Machine hashed with the same function

Adding a Node

Removing a Node
Use Case and NoSQL Solution

Problem:
Need to store bookings per day of all hotels . Queries centered around city and regions.
Hotel count : 1 Million
Date Range : Now to next 365 *2 Days
NoSQL: Path Ahead

ACID equivalence(Neo4J,CouchDb etc)
Transaction Support
Atomicity MVCC
NoSQL: Path Ahead contd..

Possible Solution
Work with SQL Db w.r.t Creation/Updation etc.
Archive the data in NoSQL for query/analysis etc.

Enterprise Adoption and Challenges
NoSQL looks good for Unstructured data largely

SQL is the best choice for a broad range of traditional workloads.

Shout out loud
Hybrid
ACID + BASE They are not alternatives but supplements

Maturity
Support
Skillset and Administration/Operation
Analytics and BI support
Q&A
References
Nancy Lynch and Seth Gilbert, Brewer's conjecture and the feasibility of consistent, available, partitiontolerant web services, ACM SIGACT News, Volume 33 Issue 2 (2002), pg. 51-59. Brewer's CAP Theorem", julianbrowne.com, Retrieved 02-Mar-2010 Brewers CAP theorem on distributed systems", royans.net
CAP Twelve Years Later: How the "Rules" Have Changed on-line resource
E. Brewer, "Towards Robust Distributed Systems," Proc. 19th Ann. ACM Symp.Principles of Distributed Computing (PODC 00), ACM, 2000, pp. 7-10; on-line resource D. Abadi, "Problems with CAP, and Yahoos Little Known NoSQL System," DBMS Musings, blog, 23 Apr. 2010; on-line resource. C. Hale, "You Cant Sacrifice Partition Tolerance," 7 Oct. 2010; on-line resource. Facebook: Scaling Out on-line resource. Gemstone : The Hardest Problems In Data Management on-line resource The Log-Structured Merge-Tree (Research Paper) CodeProject : Consistent Hashing on-line resource
HighlyScalable : NoSQL Data Modeling Techniques on-line resource

eBay Tech Blog :Cassandra Data Modeling Best Practices on-line resource John D Cook : Acid Vs Base on-line resource Merkle Trees Phy-Accural Faliover Detaection (Research Paper)
Backup Slides
Better than the Original 1
Document Based DataStore

{ _id : ObjectId("4e77bb3b8a3e000000004f7a"), when : Date("2011-09-19T02:10:11.3Z",
author : "alex",
title : "No Free Lunch", text : "This is the text of the post. It could be very long.", tags : [ "business", "ramblings" ],
votes : 5,
voters : [ "jane", "joe", "spencer", "phyllis", "li" ], comments : [ { who : "jane", when : Date("2011-09-19T04:00:10.112Z"), comment : "I agree." }, { who : "meghan", when : Date("2011-09-20T14:36:06.958Z"), comment : "You must be joking. etc etc ..." } ] }
User and Items
User and Items : Option 1
Cassandra CF
Cassandra SuperCF
Use Case 1
Ecommerce Site
Problem : Record User Preferences e.g : Location,IP,Currency selected, Source of Traffic, Multiple other dynamic values
Solution : In a CF based structure keep it simple UserId_Key: Pref2_Name:Value1,Pref2_Name:Value2,.PrefN_ Name:ValueN
Use Case 1
RowKey: 1350136093705_6501082438199894 => (column=1350136093764, value=-3242432#911167901131523, timestamp=1350136093766000) => (column=1350283322499, value=GOI#200701231712126570, timestamp=1350283322502001) => (column=1350283566051, value=GOI#200703221605283033, timestamp=1350283566054001)
=> (column=1350749595676, value=GOI#200805261514037199, timestamp=1350749595677001)

(column=1350785230322, value=BOM#200701251747233158, timestamp=1350785230324001) RowKey: 1354499614310_10861558002828044 => (column=1354499614368, value=TRV#201104071059204768, timestamp=1354499614370000, ttl=1728000) ------------------RowKey: 1349760150553_6114662943774777 => (column=1349760152066, value=BLR#200802111324575807, timestamp=1349760152068001) ------------------RowKey: 1349805109805_6167423558533191 => (column=1349805111833, value=TRV#312254274337517, timestamp=1349805111835001)
------------------RowKey: 1354435656227_7908056941568359 => (column=1354435656367, value=IDR#200701211254519381, timestamp=1354435656369000, ttl=1728000) ------------------RowKey: 1347648097261_15570089270962881 => (column=1347648097304, value=DEL#201101192008115545, timestamp=1347648097307000)
Use Case 1
Get private Map<String, String> getPrerences(Keyspace keySpace, String userId, String... prefernceNames) throws IOException, CharacterCodingException { SliceQuery<String, String, String> rsq = HFactory.createSliceQuery(keySpace, StringSerializer.get(), StringSerializer.get(), StringSerializer.get()); rsq.setColumnFamily(USER_PREFERENCE); rsq.setKey(userId);
rsq.setColumnNames(prefernceNames);
QueryResult<ColumnSlice<String, String>> orows = rsq.execute(); Map<String, String> preferenceMap = new LinkedHashMap<String, String>();
for (HColumn<String, String> column : orows.get().getColumns()) {

preferenceMap.put(column.getName(), column.getValue()); } return preferenceMap; }
Use Case 1
Save Mutator<String> m = HFactory.createMutator(keySpace, StringSerializer.get()); HColumn<String, String> userPrefrences = HFactory.createColumn(colkey, colvalue, StringSerializer.get(), StringSerializer.get()); userPrefrences.setTtl(ttlUserPrefrences); m.addInsertion(rowkey, USER_PREFERENCE, userPrefrences); m.execute();
Use Case 2
Online Travel Site
Problem:
Need to know different metrics for a city hotels e.g.:

Hotels booked in last X Time Hotels Last viewed in Y Time Hotels Left with Z Inventory
Use Case 2
RowKey: 2d323436353731 => (super_column=911167901297486, (column=6c6173747669657765646d657373616765, value=VIEWED#Last viewed 23 hour(s) ago., timestamp=1354962852610000)
column=6c6173747669657765646d657373616762, value=Inventory#20 , timestamp=1354962852610000,

column=6c6173747669657765646d657373616769, value=Bookings#8 , timestamp=135496282610000 ) ------------------RowKey: 58524f => (super_column=200903041759196196, (column=6c617374626f6f6b65646d657373616765, value=Booked#Last booked 1 day(s) ago., timestamp=1347781187842000)
(column=6c6173747669657765646d657373616765, value=VIEWED#Last viewed 2 hours ago., timestamp=1347707080147000))

=> (super_column=200903041848352230, (column=6c6173747669657765646d657373616765, value=VIEWED#Last viewed 1 day(s) ago., timestamp=1347266107708000))
Use Case 2
SuperSliceQuery<String, String, String, String> superQuery = HFactory.createSuperSliceQuery(getKeySpace(), StringSerializer.get(), StringSerializer.get(), StringSerializer.get(), StringSerializer.get()); superQuery.setColumnFamily(SUPER_SOCIAL_MESSAGE).setKey(cityCode); QueryResult<SuperSlice<String, String, String>> result = superQuery.execute(); List<HSuperColumn<String, String, String>> superColumns = result.get().getSuperColumns(); if (superColumns != null) { for (HSuperColumn<String, String, String> superColumn : superColumns) { Map<String, String> messages = new HashMap<String, String>(); List<HColumn<String, String>> columns = superColumn.getColumns(); if (columns != null) { for (HColumn<String, String> column : columns) { messages.put(column.getName(), column.getValue()); } } /* The equivalent doc *\ document.addField(superColumn.getName(), messages); documents.add(document); } }
Pig Script : MR
<document> <pigscript start="-16" end="-43200" start1="-1441" end1="-10080" start2="0" end2="-15" start3="0" end3="-1440"> <comment>Delete All Messages</comment> <query><![CDATA[rows0 = LOAD 'cassandra://LH/HotelMessage' USING com.mmt.solr.hotels.cassandra.CassandraStorage() as (key:chararray, cols:bag{T:tuple(name:chararray, value:chararray) } );]]></query> <query><![CDATA[cols0 = FOREACH rows0 GENERATE key as key,flatten($1) as (name:chararray, value:chararray);]]></query> <query><![CDATA[cols0 = FOREACH rows0 GENERATE key as key,flatten($1) as (name:chararray, value:chararray);]]></query> <query><![CDATA[userhotel0 = FOREACH cols0 GENERATE key as key,com.mmt.solr.hotels.cassandra.ByteBufferToString($1) as name,com.mmt.solr.hotels.cassandra.ByteBufferToString($2) as value;]]></query> <query><![CDATA[uriCounts0 = FOREACH userhotel0 GENERATE key as citycode,com.mmt.solr.hotels.cassandra.ToBag(TOTUPLE(name,null));]]></query>
<comment>Last Viewed start 15 minutes to 30 days ago</comment> <query><![CDATA[rows = LOAD 'cassandra://LH/LastViewedHotels?slice_start=#start&slice_end=#end&limit=1024&reversed=true' USING com.mmt.solr.hotels.cassandra.CassandraStorage() as (key:chararray, cols:bag{T:tuple(name:long, value:chararray) } );]]></query> <query><![CDATA[cols = FOREACH rows GENERATE key as key,flatten($1) as (name:long, value:chararray);]]></query> <query><![CDATA[userhotel = FOREACH cols GENERATE key as key,com.mmt.solr.hotels.cassandra.LongToHours($1) as name,com.mmt.solr.hotels.cassandra.ByteBufferToString($2) as value;]]></query> <query><![CDATA[userhotelByCity = FOREACH userhotel GENERATE key as key,flatten($1) as name,flatten(org.apache.pig.piggybank.evaluation.string.Split(value,'#',2)) as (citycode:chararray,hotelid:chararray);]]></query> <query><![CDATA[groupByhotels = GROUP userhotelByCity BY hotelid;]]></query> <query><![CDATA[uriCounts = FOREACH groupByhotels { D = LIMIT userhotelByCity 1; GENERATE flatten(D.citycode) as citycode,com.mmt.solr.hotels.cassandra.ToBag( TOTUPLE(group,com.mmt.solr.hotels.cassandra.StringAppend('VIEWED#Last viewed ',D.name,' ago.'))); };]]></query>
<comment>Last Booked 1 to 8 days ago</comment> <query><![CDATA[rows1 = LOAD 'cassandra://LH/BookedHotels?slice_start=#startA&slice_end=#endA&limit=1024&reversed=true' USING com.mmt.solr.hotels.cassandra.CassandraStorage() as (key:chararray, cols:bag{T:tuple(name:long, value:chararray) } );]]></query> <query><![CDATA[cols1 = FOREACH rows1 GENERATE key as key,flatten($1) as (name:long, value:chararray);]]></query> <query><![CDATA[userhotel1 = FOREACH cols1 GENERATE key as key,com.mmt.solr.hotels.cassandra.LongToHours($1) as name,com.mmt.solr.hotels.cassandra.ByteBufferToString($2) as value;]]></query> <query><![CDATA[userhotelByCity1 = FOREACH userhotel1 GENERATE key as key,flatten($1) as name,flatten(org.apache.pig.piggybank.evaluation.string.Split(value,'#',2)) as (citycode:chararray,hotelid:chararray);]]></query> <query><![CDATA[groupByhotels1 = GROUP userhotelByCity1 BY hotelid;]]></query> <query><![CDATA[uriCounts1 = FOREACH groupByhotels1 { D = LIMIT userhotelByCity1 1;
GENERATE flatten(D.citycode) as citycode,com.mmt.solr.hotels.cassandra.ToBag( TOTUPLE(group,com.mmt.solr.hotels.cassandra.StringAppend('Booked#Last booked ',D.name,' ago.'))); };]]></query>
Criteria's to Evaluate NoSQL Solutions Internal partitioning
Automated flexible data distribution

Hot swappable nodes Replication-style Automated failover strategy

IndicThreads-Pune12-NoSQL Now and Path Ahead

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IndicThreads-Pune12-NoSQL Now and Path Ahead

Uploaded by

Copyright:

Available Formats

NoSQL: Now and Path Ahead

Shubham Kumar Srivastava

Why NoSql : Motives and Drivers

Hardware getting cheaper and processing power increasing

Less Operational complexity as against RDBMS solutions.

In most of the solutions you get automatic sharding etc as default .

Why NoSql : Motives and Drivers contd..

Why NoSql : Motives and Drivers contd..

Why NoSql : Motives and Drivers contd..

Commodity hardware, software versions, upgrades, maintenance.

Why NoSql : Motives and Drivers contd..

Schema Says: What all Do I answer ??

Why NoSql : Motives and Drivers contd..

Document based indexing data stores are a huge success.

Why NoSql : Motives and Drivers contd..

Why NoSql : Motives and Drivers contd..

Why NoSql : Motives and Drivers contd..

Exploit De-normalization Duplication

Cant live with Eventual consistency

Durable: Completed transactions persist, even when servers restart etc.

Some Fanda-mentals cont..

The machine chosen to cache object o will be:

Consistent Hashing contd..

Result on any above: Disaster Swamped machines with redistribution

Consistent Hashing contd..

Consistent Hashing contd..

Consistent Hashing contd..

Consistent Hashing contd..

Consistent Hashing contd..

Use Case and NoSQL Solution

Date Range : Now to next 365 *2 Days

NoSQL: Path Ahead

NoSQL: Path Ahead contd..

Work with SQL Db w.r.t Creation/Updation etc.

Archive the data in NoSQL for query/analysis etc.

NoSQL: Path Ahead contd..

NoSQL looks good for Unstructured data largely

NoSQL: Path Ahead contd..

NoSQL: Path Ahead contd..

ACID + BASE They are not alternatives but supplements

NoSQL: Path Ahead contd..

Skillset and Administration/Operation

Analytics and BI support

NoSQL: Path Ahead contd..

HighlyScalable : NoSQL Data Modeling Techniques on-line resource

Better than the Original 1

Document Based DataStore

User and Items

User and Items : Option 1

User and Items : Option 2

User and Items : Option 3

User and Items : Option 4

=> (column=1350749595676, value=GOI#200805261514037199, timestamp=1350749595677001)

for (HColumn<String, String> column : orows.get().getColumns()) {

Need to know different metrics for a city hotels e.g.:

column=6c6173747669657765646d657373616762, value=Inventory#20 , timestamp=1354962852610000,

(column=6c6173747669657765646d657373616765, value=VIEWED#Last viewed 2 hours ago., timestamp=1347707080147000))

GENERATE flatten(D.citycode) as citycode,com.mmt.solr.hotels.cassandra.ToBag( TOTUPLE(group,com.mmt.solr.hotels.cassandra.StringAppend('Booked#Last booked ',D.name,' ago.'))); };]]></query>

Criteria's to Evaluate NoSQL Solutions Internal partitioning

Automated flexible data distribution

You might also like