Building a large scale SaaS app

Open Source, Storage and Scalability Dan Hanley, CTO http://www.magus.co.uk
14 March, 2008

1

Agenda
Who are Magus? What do we do? Who do we do it for? How do we do it? SOA Scalability Storage F/OSS

2

The Magus proposition
• Leading provider of innovative web-content engineering solutions to g g g global corporations p • Specialise in managed applications that help clients build value from their online assets and from the wider web • Three main applications:
ActiveStandards RemoteSearch CrucialInformation

• Delivering solutions since 1995

Our managed applications
Delivering Software-as-a-service (ASP model) ActiveStandards designed to help companies stay on-brand, on-line by tracking and managing corporate web standards compliance, worldwide RemoteSearch a multi-site search engine, providing integrated search multi site engine frameworks for enterprise websites CrucialInformation a premium current awareness service d li i C i lI f ti i t i delivering high-quality, strategic intelligence from the web and syndicated services

4

ActiveStandards

5

RemoteSearch

6

CrucialInformation

7

Social Networking

8

Our clients

9

Technically - where we were
1 product • W bd i b i Web design business • All home grown • No appservers • No failover • No common infrastructure • Scalability worries • No version control ersion • Unclear methodology
10

Technically – where we are now
• 3 main applications pp • Bespoke capability • Common infrastructure • Platform of services • Fault tolerant • Scalable • Defined process & methodology

11

Approach
• Do a lot with a little – 35 people, p p p , punching g above our weight • Don't reinvent the wheel • Extract commonality – keep it DRY

12

The components of the stack

• • • • • •

Trawl Harvest Index Search Analysis Monitor

• • • • • •

Routing Store Quartz ClientEngine Profile LinkChecker

13

Logical architecture

REST (not SOAP)

14

Trawl
• Responsible for managing the gathering of data in its raw form into the Store. • Currently have Trawlers for: HTTP FTP (several flavors) RSS, Atom etc RSS A SMTP Google G l Technorati Moreover M FT (several flavors)
15

Trawler service
Pluggable architecture based on JMX Mbean service

16

Harvest

• Responsible for extracting explicit data from Links and storing th fi ld d d t i th d t b d t i the fielded data in the database, and th d the non fielded data in the Store.

17

Harvest service
Pluggable architecture based on JMX Mbean service

18

Index

• Responsible for building, purging, maintaining indices.

19

Search

• Responsible for searching indices and delivering results.

20

Analysis

• Responsible for deriving scores for information implicit in the page Sentiment Readability Language detection etc g g

21

Monitor

− − − −

Badly named, should be called “Classifier” Responsible for creating filings between Links and Categories. A Link can b a b k Li k be bookmark, news i k item, bl article etc. blog i l A Category can be Users Bookmarks, News Topic, an AST Guideline etc.

22

Classifier (monitor) service
Pluggable architecture based on JMX Mbean service

23

LinkChecker

• Responsible for checking the life of links and removing them correctly from the system when they have expired expired.

24

Routing

• Manages the workflow of jobs through the stack • Has the capability to dynamically loadbalance workloads workloads.

25

Content stores
We needed a multiple terabyte (currently 24 TB) distributed, fail safe, filesystem f fil NFS was crumbling under load ZFS was vapourware Lustre was too complex We built our own! Magus Contentstores, responsible for holding both the raw and processed non fielded content of links which have been trawled and harvested

26

Content stores - configuration
<mbean code="uk.co.magus.store.service.StoreService" name="magus.service.store:service=StoreServiceLocalCalls"> <attribute name="JndiName">magus/services/StoreServiceLocalCalls</attribute> <attribute name="Config"> <TryEachStripeStore> <List> <MirrorStore> <List> <RemoteStore>nas:1299;StoreServiceRemoteCallsInvokeTarget</RemoteStore> <RemoteStore>m4:1099;StoreServiceRemoteCallsInvokeTarget</RemoteStore> </List> </MirrorStore> <MirrorStore> <List> <RemoteStore>nas:1199;StoreServiceRemoteCallsInvokeTarget</RemoteStore> <RemoteStore>m5:1099;StoreServiceRemoteCallsInvokeTarget</RemoteStore> </List> </MirrorStore> </List> </TryEachStripeStore> </attribute> <depends>jboss:service=Naming</depends> </mbean>

27

28

Store Interfaces

29

Store JMX Beans

30

Contentstore - engines
Can use many types of engine on a node Currently supports:
Mysql SleepyCat Filesystem

These can be decorated to enhance functionality

Content Store Classes

32

Quartz

• Responsible for firing messages on time. • The “heartbeat” of the stack.

33

Client Engine

• Responsible for stack based processing for Client Applications. A li ti • Keeps “heavy lifting” out of the Web Tier. • Coordinates Client Applications requests across multiple stack services.

34

Management Application
Manage taxonomy g y Manage rules Manage scheduling Focus on managing the business Leave service management t JMX or web L i t to b consoles Swing

35

Management App

Management App

Management App

Management App

Profile
• An internal service used to collect metrics on system wide performance t id f

40

41

Infrastructure architecture hit t

42

Methodology
• Agility – sprints g y p • Issue tracking – Jira • R Regular, scheduled, d l l h d l d deployments t • Consolidated build & version control

43

Deployment Deployment
1. C heck out

Subversion (Code repository )
2. C ode / Loc al T est 4. auto C heck out D eveloper

3. C hec k In

Developer Local Box

6 & 12 N otify 5 . Build / U nit T es ts / Metr ic s

7 . Publish r esults

8. D eploy

Bamboo

10. D eploy D ependenc ies 9 & 11 . F IT T ests

D ev C luster

13 . Pr epar e R eleas e N ote
P a y to $

R elease N ote

14. G et R eleas e N ote G r anite T L 16. G et Applic ation Ar tifac ts

Pr oduc t O w ner / D ev T L

15 . R eject R elease

17 . Manage T est & Pr oduction Envir onments

18. D eploy Applications Str ess T est 19 . D eploy Applications

T est C luster

Jboss ON

Pr oduc tion

44

Throughput
• 11,000 sources in system , y • ~16 000 000 pages rolling store 16,000,000 • ~200,000 new pages per d 200 000 day • Average < 2 minutes from page detection to fully classified and indexed.

45

Cost comparisons
• Apples and oranges? pp g
Proprietary Product Per CPU CPUs Oracle O l 20,000.00 20 000 00 Weblogic AS 10,000.00 MS Windows Server 3,919.00 Visual Team Studio 1,000.00 ClearCase 4,125.00 Jira 2,000.00 2 000 00 Autonomy IDOL bundl 75,000.00 IBM Intelligent Datami 132,000.00 Verity K2 50,000.00 Total 10 $200,000 $200 000 38 $380,000 48 $188,112 12 $12,000 1 $4,125 1 $2,000 $2 000 2 $150,000 1 $132,000 2 $100,000 $1,068,237 £580,531.26 €849,629.77 Licence Free Product Per CPU CPUs MySql M S l $0.00 $0 00 Jboss AS $0.00 Redhat/Apa $0.00 Eclipse $0.00 Subversion $0.00 Trac $0.00 $0 00 Carrot2 $0.00 LingPipe $0.00 Lucene $0.00 UIMA $0.00 Total 10 38 48 12 1 1 12 12 8 12 $0.00 $0 00 $0.00 $0.00 $0.00 $0.00 $0.00 $0 00 $0.00 $0.00 $0.00 $0.00 $0.00

46

Thank you

Questions?

47

Sign up to vote on this title
UsefulNot useful