You are on page 1of 40

1

Scaling Security on 100s of


Millions of Mobile Devices
Using Kafka and Scylla
Mobile cybersecurity leader Lookout talks through their data ingestion journey
2

Speakers

Richard Ney Eyal Gutkind Jeff Bean


Principal Engineer VP of Solutions Partner Solutions Architect
Over 30 years experience predominantly A solution architect for Scylla, prior to An experienced software engineer
dealing with event pipelines and data Scylla, Eyal held product management and technical evangelist with many
retrieval. Currently he is a platform roles at Mirantis and DataStax and 12 years in the open source software
architect and principal developer at years with Mellanox Technologies in ecosystem. He leads the Confluent
Lookout Inc on the Ingestion Pipeline and various engineering management and Verified Integrations Program that
Query Services team working on the next product marketing roles. helps partners build and verify
scale of data ingestion. integrations with Confluent Platform.
3

Agenda Confluent Platform


ScyllaDB
Lookout’s Journey
About Lookout
Current data ingestion design and issues
Scaling to 1 Million devices
Technology decisions and cost analysis
Testing results
Takeaways
4

Jeff Bean
Partner Solutions Architect,
Confluent
Without Confluent
Contextual Event-Driven Applications
With Real-Time
Inventory
Real-Time
Fraud
Real-Time
Customer 360
Machine
Learning
Real-Time
Data ...
Detection Models Transformation

Confluent
TREAMS
STREAMS

CONNECT CLIENTS

Universal Event Pipeline


Data Stores Logs 3rd Party Apps Custom Apps/Microservices
Confluent Platform
The Event Streaming Platform Built by the Original Creators of Apache Kafka®

Confluent Platform
Training & Partners
Support, Services,

Operations and Security Mission-critical Reliability

Development & Stream Processing Complete Event


Streaming Platform

Apache Kafka

Self-Managed Software Fully-Managed Service Freedom of Choice

Datacenter Public Cloud Confluent Cloud


Support, Services, Training & Partners
Confluent Platform
Operations and Security Mission-critical
Reliability
Security plugins | Role-Based Access Control
Control Center | Replicator | Auto Data Balancer | Operator

Development & Stream Processing Complete Event


Streaming Platform
Clients | REST Proxy
Connectors KSQL
MQTT Proxy | Schema Registry

Apache Kafka
Connect Continuous Commit Log Streams

Self-Managed Software Fully-Managed Service Freedom of Choice

Datacenter Public Cloud Confluent Cloud


9

Eyal Gutkind
VP of Solutions,
Scylla
10

About ScyllaDB
+ The Real-Time Big Data Database
+ Drop-in replacement for Apache Cassandra
+ 10X the performance & low tail latency
+ Open source and enterprise editions
+ New: Scylla Cloud, DBaaS
+ Founded by the creators of KVM hypervisor
+ HQs: Palo Alto, CA; Herzelia, Israel
11

Scylla & Confluent


12

Scylla Benefits

Lower Node Predictable, Less Cassandra


Count Low Latencies Complexity Compatibility
Millions of OPS per node Consistent single-digit Smaller footprint, Drop-in replacement,
reduces the number of millisecond p99 self-optimizing, compatible with full C*
nodes your application latencies works out-of-the-box ecosystem (drivers, etc.)
requires
13

Version Apache Cassandra Scylla Enterprise

Data Layer: 600 nodes i3.2xlarge 60 nodes i3.2xlarge

Caching Layer: 60 nodes Varnish No caching


m4.4xlarge

Annual Spend*: $3.7 million/yr $328k/yr

*per publicly posted AWS hourly fees

+ Simplified deployments
+ Reduced infrastructure
+ Performance improvements
14

Scylla Alternator
+ DynamoDB compatible API
for Scylla
+ Scale-out system with
complete observability
+ Deployment flexibility
+ On prem, Multi-Cloud, Hybrid
+ Open Source or Scylla Cloud

+ No vendor lock-in
scylladb.com/alternator
+ Better performance and
less expensive
15

Richard Ney
Principal Engineer,
Lookout
16

What does Lookout do?


● Founded in 2004 when the original founders discovered a vulnerability in the
Bluetooth and Nokia phones
● Demonstrated the need for mobile security through a demonstration at the
2005 Academy Awards downloading information from celebrity phones 1.5
miles away from the venue
● Provides security scanning for mobile devices for Enterprise and Consumer
markets
● Enterprise customers have the ability to apply corporate policies against
devices registered in their enterprise
● To apply these policies Lookout ingests data about device configuration and
applications installed on devices
17

Starting Point
18

What is the Common Device Service


● Functions as a proxy for all mobile devices in the Lookout fleet
● Device telemetry is sent at various intervals for these categories

○ Binary Artifacts
○ Software
○ Risky Configuration
○ Hardware
○ Personal Content Protection
○ Client
(safe browsing)
○ Filesystem
○ Device Settings
○ Configuration
○ Device Permissions
19

Cheap to Fail but Expensive to


Succeed!
20

The World Today


21

Benefits of Current Design

● Easy to setup and maintain


● Scaling is easy
● Cost Effective
● Simple to handle the Unexpected
22

Long Term Issue with Current Design

● Some of the components are “single region” (EMR)


● As the system grows the costs increase significantly (DynamoDB)
● Limits on PK and SK for DynamoDB - Not designed for time series data
23

Expensive as we Scale!
24

And Off We Go!


25

What is needed to scale to 1 Million Devices?


A highly scalable and fault tolerant streaming framework that can process
messages (for example Device Telemetry Messages) and persist these messages
into a scalable, fault tolerant persistent store and support operational queries.
Key Requirements:
● Infrastructure should scale to support 100M devices
● Cost effective ingestion, storage and querying at this scale
● Low Latency, High Availability at scale (up/down)
● Failure handling (no loss of data)
● Ease of deployment and management
26

Why we considered Scylla


● A NoSQL database that implements almost all the features of Cassandra
● Written in C++ 14 instead of Java to increase the performance.
● Uses a shared nothing approach and uses the Seastar framework to shard
requests by core - http://seastar.io/
● Scylla’s close-to-the-hardware design significantly reduces the number of
instances needed.
● Can horizontal scale-out and is fault-tolerance like Apache Cassandra, but
delivers 10X the throughput and consistent, low single-digit latencies.
● Has support for tunable job prioritization to support extremely high read and
write throughput (which was a problem that Cassandra has not solved yet).
Has really high throughput on instances with NVMe volumes (compared to EBS
or non NVMe volumes).
27

Why we considered Kafka and


Confluent Platform?

● Schema Registry
● Kafka Connect
● Confluent Control Center
● Ability to create new message flows using JSON
28

The approach
29

Final Environment Setup


● Kafka
○ 6 Kafka Brokers - R5.xlarge ● ScyllaDB
○ 6 Zookeepers - M5.large ○ 4 ScyllaDB instances - I3.4xlarge
○ Split over 2 AZs
○ 3 Schema Registries - M5.large
● Load
○ 6 Kafka Connect Workers - C5.xlarge
○ 12 different device telemetry emulated
○ 1 Control Center - M5.2xlarge
○ Messages sent in avro format
○ Split over 3 AZs
○ 14 instances generating load -
○ # partitions C5d.4xlarge
■ Loaded Libraries - 120 partitions
■ Device Settings - 150 partitions
■ Other topics - 60 partitions
30

Interesting Finding During This Journey


● The default partitioner (<murmur2 hash> mod <# partitions>) that comes with Kafka is not very
efficient with sharding when the number of partitions grows (approx 50% of the partitions were idle).
● We replaced it by using a murmur3 hash and then putting it through a consistent hashing algorithm
(jump hash) to get an even distribution across all partitions (we used Google’s guava library). - “A
Fast, Minimal Memory, Consistent Hash Algorithm” - https://arxiv.org/pdf/1406.2294.pdf
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();
if (keyBytes == null) {
int nextValue = nextValue(topic);
List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
if (availablePartitions.size() > 0) {
int part = Utils.toPositive(nextValue) % availablePartitions.size();
return availablePartitions.get(part).partition();
} else {
// no partitions are available, give a non-available partition
//return Utils.toPositive(nextValue) % numPartitions;
return Hashing.consistentHash(Utils.toPositive(nextValue), numPartitions);
}
} else {
// hash the keyBytes to choose a partition
int hashLong = Hashing.murmur3_128().hashBytes(keyBytes).asInt();
return Hashing.consistentHash(Utils.toPositive(hashLong), numPartitions);
}
}
31

{
The Connector Template
"connector.class": "com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector",
"errors.log.include.messages": "true",
"connect.cassandra.key.space": "cds",
"tasks.max": "10",
"topics": "metron.client",
"connect.cassandra.kcql": "UPSERT INTO client SELECT * from metron.client",
"connect.cassandra.password": "cassandra",
"connect.progress.enabled": "true",
"connect.cassandra.username": "cassandra",
"connect.cassandra.contact.points": "scylla-node1.staging.hollandaise.com,scylla-node2.staging.hollandaise.com,scylla-node3.staging.hollandaise.com",
"connect.cassandra.port": "9042",
"name": "client-flow",
"errors.tolerance": "all",
"errors.log.enable": "true",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://sr0.staging.hollandaise.com:8081"
}
32

The Crash!

● Part of the testing included a Destructive Test


○ Loaded the Scylla Cluster to 90%+ CPU usage
○ Executed a “Repair” against the cluster
● Repair ran for over 8 hours
● One of the Three nodes crashed
33

The Aftermath and Retest


● Worked closely with Scylla to identify the “root” cause of the node crash
● Identified configuration issue that was introduced during a software upgrade
● CPU reservation parameters were lost allowing all `vCores` to get allocated to
DB operations so none were reserved for maintenance tasks
● Corrected the CPU reservation configuration and repeated crash test
● Repair still took over 8 hours to run but it was successful second time
● During the proof of concept run several other issues did occur. Worked
closely with Scylla each time to find root cause and fix issue.
34

Test Results
● Message latency was in milliseconds
on average, unless the system was
overtaxed.
● Repairs forced the load and was
generally taxing on the system (CPU
at 100%), but the cluster continued to
function.
● The latency increased when Kafka
Connect tasks failed (when repairs
were running on ScyllaDB).
● ScyllaDB Cluster was running near
capacity (CPU between 75-90%)
● Overall, the results were really
positive.
35

The Take Aways


36

Costs
DynamoDB Scylla

# Devices $ Cost/Mo # Devices $ Cost/Mo

38,000,000 $304,400.00 38,000,000 $14,564.24


On Demand
100,000,000 $801,052.63 100,000,000 $61,198.94
+20% Engineer cost
(Maintenance)

38,000,000 $55,610.00
Provisioned
100,000,000 $146,342.11

● This does not include:


○ Query load and associated costs.
○ Dynamo streams and it’s equivalent on Scylla and associated costs.
37

Cost on each Platform


38

Q and A
39

Keep in touch

scylladb.com Richard Ney cnfl.io/contact


info@scylladb.com @rney_home
cnfl.io/blog
Scylla Summit 2019 richard.ney@
Nov. 5-6 lookout.com
San Francisco, CA cnfl.io/download
40

You might also like