Scaling Security Using Kafka

1
Scaling Security on 100s of

Millions of Mobile Devices
Using Kafka and Scylla
Mobile cybersecurity leader Lookout talks through their data ingestion journey
2
Speakers
Richard Ney Eyal Gutkind Jeff Bean

Principal Engineer VP of Solutions Partner Solutions Architect
Over 30 years experience predominantly A solution architect for Scylla, prior to An experienced software engineer
dealing with event pipelines and data Scylla, Eyal held product management and technical evangelist with many
retrieval. Currently he is a platform roles at Mirantis and DataStax and 12 years in the open source software
architect and principal developer at years with Mellanox Technologies in ecosystem. He leads the Confluent
Lookout Inc on the Ingestion Pipeline and various engineering management and Verified Integrations Program that
Query Services team working on the next product marketing roles. helps partners build and verify
scale of data ingestion. integrations with Confluent Platform.
3
Agenda Confluent Platform

ScyllaDB
Lookout’s Journey
About Lookout
Current data ingestion design and issues
Scaling to 1 Million devices
Technology decisions and cost analysis
Testing results
Takeaways
4
Jeff Bean
Partner Solutions Architect,
Confluent
Without Confluent
Contextual Event-Driven Applications
With Real-Time
Inventory
Real-Time
Fraud
Real-Time
Customer 360
Machine
Learning
Real-Time
Data ...
Detection Models Transformation
Confluent
TREAMS
STREAMS
CONNECT CLIENTS
Universal Event Pipeline

Data Stores Logs 3rd Party Apps Custom Apps/Microservices
Confluent Platform
The Event Streaming Platform Built by the Original Creators of Apache Kafka®
Confluent Platform
Training & Partners
Support, Services,
Operations and Security Mission-critical Reliability
Development & Stream Processing Complete Event

Streaming Platform
Apache Kafka
Self-Managed Software Fully-Managed Service Freedom of Choice
Datacenter Public Cloud Confluent Cloud

Support, Services, Training & Partners
Confluent Platform
Operations and Security Mission-critical
Reliability
Security plugins | Role-Based Access Control
Control Center | Replicator | Auto Data Balancer | Operator
Development & Stream Processing Complete Event

Streaming Platform
Clients | REST Proxy
Connectors KSQL
MQTT Proxy | Schema Registry
Apache Kafka
Connect Continuous Commit Log Streams
Self-Managed Software Fully-Managed Service Freedom of Choice
Datacenter Public Cloud Confluent Cloud

9
Eyal Gutkind
VP of Solutions,
Scylla
10
About ScyllaDB
+ The Real-Time Big Data Database
+ Drop-in replacement for Apache Cassandra
+ 10X the performance & low tail latency
+ Open source and enterprise editions
+ New: Scylla Cloud, DBaaS
+ Founded by the creators of KVM hypervisor
+ HQs: Palo Alto, CA; Herzelia, Israel
11
Scylla & Confluent

12
Scylla Benefits
Lower Node Predictable, Less Cassandra

Count Low Latencies Complexity Compatibility
Millions of OPS per node Consistent single-digit Smaller footprint, Drop-in replacement,
reduces the number of millisecond p99 self-optimizing, compatible with full C*
nodes your application latencies works out-of-the-box ecosystem (drivers, etc.)
requires
13
Version Apache Cassandra Scylla Enterprise
Data Layer: 600 nodes i3.2xlarge 60 nodes i3.2xlarge
Caching Layer: 60 nodes Varnish No caching

m4.4xlarge
Annual Spend*: $3.7 million/yr $328k/yr
*per publicly posted AWS hourly fees
+ Simplified deployments
+ Reduced infrastructure
+ Performance improvements
14
Scylla Alternator
+ DynamoDB compatible API
for Scylla
+ Scale-out system with
complete observability
+ Deployment flexibility
+ On prem, Multi-Cloud, Hybrid
+ Open Source or Scylla Cloud
+ No vendor lock-in
scylladb.com/alternator
+ Better performance and
less expensive
15
Richard Ney
Principal Engineer,
Lookout
16
What does Lookout do?

● Founded in 2004 when the original founders discovered a vulnerability in the
Bluetooth and Nokia phones
● Demonstrated the need for mobile security through a demonstration at the
2005 Academy Awards downloading information from celebrity phones 1.5
miles away from the venue
● Provides security scanning for mobile devices for Enterprise and Consumer
markets
● Enterprise customers have the ability to apply corporate policies against
devices registered in their enterprise
● To apply these policies Lookout ingests data about device configuration and
applications installed on devices
17
Starting Point
18
What is the Common Device Service

● Functions as a proxy for all mobile devices in the Lookout fleet
● Device telemetry is sent at various intervals for these categories
○ Binary Artifacts
○ Software
○ Risky Configuration
○ Hardware
○ Personal Content Protection
○ Client
(safe browsing)
○ Filesystem
○ Device Settings
○ Configuration
○ Device Permissions
19
Cheap to Fail but Expensive to

Succeed!
20
The World Today

21
Benefits of Current Design
● Easy to setup and maintain

● Scaling is easy
● Cost Effective
● Simple to handle the Unexpected
22
Long Term Issue with Current Design
● Some of the components are “single region” (EMR)

● As the system grows the costs increase significantly (DynamoDB)
● Limits on PK and SK for DynamoDB - Not designed for time series data
23
Expensive as we Scale!
24
And Off We Go!

25
What is needed to scale to 1 Million Devices?

A highly scalable and fault tolerant streaming framework that can process
messages (for example Device Telemetry Messages) and persist these messages
into a scalable, fault tolerant persistent store and support operational queries.
Key Requirements:
● Infrastructure should scale to support 100M devices
● Cost effective ingestion, storage and querying at this scale
● Low Latency, High Availability at scale (up/down)
● Failure handling (no loss of data)
● Ease of deployment and management
26
Why we considered Scylla

● A NoSQL database that implements almost all the features of Cassandra
● Written in C++ 14 instead of Java to increase the performance.
● Uses a shared nothing approach and uses the Seastar framework to shard
requests by core - http://seastar.io/
● Scylla’s close-to-the-hardware design significantly reduces the number of
instances needed.
● Can horizontal scale-out and is fault-tolerance like Apache Cassandra, but
delivers 10X the throughput and consistent, low single-digit latencies.
● Has support for tunable job prioritization to support extremely high read and
write throughput (which was a problem that Cassandra has not solved yet).
Has really high throughput on instances with NVMe volumes (compared to EBS
or non NVMe volumes).
27
Why we considered Kafka and

Confluent Platform?
● Schema Registry
● Kafka Connect
● Confluent Control Center
● Ability to create new message flows using JSON
28
The approach
29
Final Environment Setup

● Kafka
○ 6 Kafka Brokers - R5.xlarge ● ScyllaDB
○ 6 Zookeepers - M5.large ○ 4 ScyllaDB instances - I3.4xlarge
○ Split over 2 AZs
○ 3 Schema Registries - M5.large
● Load
○ 6 Kafka Connect Workers - C5.xlarge
○ 12 different device telemetry emulated
○ 1 Control Center - M5.2xlarge
○ Messages sent in avro format
○ Split over 3 AZs
○ 14 instances generating load -
○ # partitions C5d.4xlarge
■ Loaded Libraries - 120 partitions
■ Device Settings - 150 partitions
■ Other topics - 60 partitions
30
Interesting Finding During This Journey

● The default partitioner (<murmur2 hash> mod <# partitions>) that comes with Kafka is not very
efficient with sharding when the number of partitions grows (approx 50% of the partitions were idle).
● We replaced it by using a murmur3 hash and then putting it through a consistent hashing algorithm
(jump hash) to get an even distribution across all partitions (we used Google’s guava library). - “A
Fast, Minimal Memory, Consistent Hash Algorithm” - https://arxiv.org/pdf/1406.2294.pdf
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();
if (keyBytes == null) {
int nextValue = nextValue(topic);
List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
if (availablePartitions.size() > 0) {
int part = Utils.toPositive(nextValue) % availablePartitions.size();
return availablePartitions.get(part).partition();
} else {
// no partitions are available, give a non-available partition
//return Utils.toPositive(nextValue) % numPartitions;
return Hashing.consistentHash(Utils.toPositive(nextValue), numPartitions);
}
} else {
// hash the keyBytes to choose a partition
int hashLong = Hashing.murmur3_128().hashBytes(keyBytes).asInt();
return Hashing.consistentHash(Utils.toPositive(hashLong), numPartitions);
}
}
31
{
The Connector Template
"connector.class": "com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector",
"errors.log.include.messages": "true",
"connect.cassandra.key.space": "cds",
"tasks.max": "10",
"topics": "metron.client",
"connect.cassandra.kcql": "UPSERT INTO client SELECT * from metron.client",
"connect.cassandra.password": "cassandra",
"connect.progress.enabled": "true",
"connect.cassandra.username": "cassandra",
"connect.cassandra.contact.points": "scylla-node1.staging.hollandaise.com,scylla-node2.staging.hollandaise.com,scylla-node3.staging.hollandaise.com",
"connect.cassandra.port": "9042",
"name": "client-flow",
"errors.tolerance": "all",
"errors.log.enable": "true",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://sr0.staging.hollandaise.com:8081"
}
32
The Crash!
● Part of the testing included a Destructive Test

○ Loaded the Scylla Cluster to 90%+ CPU usage
○ Executed a “Repair” against the cluster
● Repair ran for over 8 hours
● One of the Three nodes crashed
33
The Aftermath and Retest

● Worked closely with Scylla to identify the “root” cause of the node crash
● Identified configuration issue that was introduced during a software upgrade
● CPU reservation parameters were lost allowing all `vCores` to get allocated to
DB operations so none were reserved for maintenance tasks
● Corrected the CPU reservation configuration and repeated crash test
● Repair still took over 8 hours to run but it was successful second time
● During the proof of concept run several other issues did occur. Worked
closely with Scylla each time to find root cause and fix issue.
34
Test Results
● Message latency was in milliseconds
on average, unless the system was
overtaxed.
● Repairs forced the load and was
generally taxing on the system (CPU
at 100%), but the cluster continued to
function.
● The latency increased when Kafka
Connect tasks failed (when repairs
were running on ScyllaDB).
● ScyllaDB Cluster was running near
capacity (CPU between 75-90%)
● Overall, the results were really
positive.
35
The Take Aways

36
Costs
DynamoDB Scylla
# Devices $ Cost/Mo # Devices $ Cost/Mo
38,000,000 $304,400.00 38,000,000 $14,564.24

On Demand
100,000,000 $801,052.63 100,000,000 $61,198.94
+20% Engineer cost
(Maintenance)
38,000,000 $55,610.00
Provisioned
100,000,000 $146,342.11
● This does not include:

○ Query load and associated costs.
○ Dynamo streams and it’s equivalent on Scylla and associated costs.
37
Cost on each Platform

38
Q and A
39
Keep in touch
scylladb.com Richard Ney cnfl.io/contact

info@scylladb.com @rney_home
cnfl.io/blog
Scylla Summit 2019 richard.ney@
Nov. 5-6 lookout.com
San Francisco, CA cnfl.io/download
40

Scaling Security Using Kafka

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scaling Security Using Kafka

Uploaded by

Copyright:

Available Formats

1

Scaling Security on 100s of

Richard Ney Eyal Gutkind Jeff Bean

Agenda Conﬂuent Platform

Universal Event Pipeline

Operations and Security Mission-critical Reliability

Development & Stream Processing Complete Event

Self-Managed Software Fully-Managed Service Freedom of Choice

Datacenter Public Cloud Conﬂuent Cloud

Development & Stream Processing Complete Event

Self-Managed Software Fully-Managed Service Freedom of Choice

Datacenter Public Cloud Conﬂuent Cloud

Scylla & Conﬂuent

Lower Node Predictable, Less Cassandra

Version Apache Cassandra Scylla Enterprise

Data Layer: 600 nodes i3.2xlarge 60 nodes i3.2xlarge

Caching Layer: 60 nodes Varnish No caching

Annual Spend*: $3.7 million/yr $328k/yr

*per publicly posted AWS hourly fees

What does Lookout do?

What is the Common Device Service

Cheap to Fail but Expensive to

The World Today

Beneﬁts of Current Design

● Easy to setup and maintain

Long Term Issue with Current Design

● Some of the components are “single region” (EMR)

And Off We Go!

What is needed to scale to 1 Million Devices?

Why we considered Scylla

Why we considered Kafka and

Final Environment Setup

Interesting Finding During This Journey

● Part of the testing included a Destructive Test

The Aftermath and Retest

The Take Aways

# Devices $ Cost/Mo # Devices $ Cost/Mo

38,000,000 $304,400.00 38,000,000 $14,564.24

● This does not include:

Cost on each Platform

scylladb.com Richard Ney cnﬂ.io/contact

You might also like