Professional Documents
Culture Documents
Bigdataposter PDF
Bigdataposter PDF
Statistical Analysis
Data Visualization
Natural Language
Machine Learning
– They have mastered all foundational areas of Monitoring
• AWS,
Azure,
Data Mining
• Linux,
OpenShiW,
OpenStack,
! Raw Text, Images, ! Structured data about ! Data including trends / points, tracks, shapes)!
Infrastructure
Processing
data science (horizontals in the graphic)! • ApplicaCon
Cloudera,
Red
Hat,
• And many others…! Puppet,
Zookeeper,
Oozie,
etc.
Video, Audio! data, e.g. from/to! activities in time! • Computer Programming! OperaCng
Systems
Rackspace,
vendor
• Mathematics & Analytic Methodology (Stats)! (IaaS,
PaaS)
• Computers,
HDFS,
KaZa,
JBoss,
Xymon
specific
• Big Data Technologies!
• Communications Skills! Networks
How do we extract knowledge from big data? – They possess deep expertise in at least one
Domain Knowledge & Communication Skills • Select key components of Hadoop Ecosystem:!
specialty area (verticals in the graphic)!
• By teaching computers to extract knowledge:! Distributed Computing & Big Data – HDFS (Storage), MapReduce (Distributed Processing), Accumulo (Secure data store, Indexing)!
Popular data scientist tools:!
! By agreeing upon and defining semantic concepts of knowledge in one or more “knowledge representations” (for
example, a fixed ontology, auto-generated ontology, user-defined tags)! – R, Python, Mahout, Pandas, Many Others…! Mathematics & Analytic Methodology
– See http://oss4ds.com! Computer Programming
! By building transforms to map semantic content from structured data into knowledge representations!
! By building classifiers to extract semantic content from unstructured data and to map that extracted content into
What Questions shoulD we ask about Databases?
knowledge representations! • What type of data do can we store?! • Can we perform analytics using the
! By building analytics to correlate / fuse / perform reasoning on extracted semantic content to generate even more ! Structured, unstructured, relational, graphs,
semantic content, and to map that information into a knowledge representations! How do we implement big data solutions? entities…!
database?!
! e.g. MapReduce?!
• And by doing this on a large scale…! Follow a process, that considers:! ! Big Files (e.g. imagery)? Small files (e.g. text)?!
• What is the latency for queries? Or for
! By dividing a problem into pieces and executing in parallel (e.g. MapReduce)! – Team Skills! • How is data ingested into the database?! analytics?!
! By building clever indexes of knowledge, so that you can search it quickly…! Problem ! – Problem Definition!
Definition and ! Streaming? Batch?! ! e.g. milliseconds? days?!
! By using high performance computers (HPCs) or other fast electronics (e.g. FPGAs, ASICs, Optics)! – Experimental Design!
! A mixture of the above… (e.g. Netezza, YarcData, Next Generation Oracle)!
Experimental • What does the database cost?! • Is it optimized for particular features?!
Design! Collection, – Success/Evaluation Criteria!
Evaluation! Storage and ! License costs? O&M costs? License restrictions?! ! Fast writes? Fast reads? Ease of use?!
– Data, Curation & QA!
Retrieval!
– Solution Design! • What hardware is required?! • How many users can the system
What do we do with knowledge we extract? • Infrastructure! ! Commodity? Proprietary?! support?!
• We can estimate and visualize parameters in data using statistics! • Ingest & Storage! • How scalable is the database?! ! Scaling for data does not necessarily imply
• Analytics! ! Gigabytes? Terabytes? Petabytes? Exabytes? scaling for a large number of users!
! We can describe data, explore correlations, discover patterns, predict outcomes, etc. through “observational studies”!
Data Curation • Visualization! Yottabytes?! • Is the database secure?!
! We need to account/correct for Bias and Confounding, for example by introducing elements of chance!!
Visualization! and Quality • Security! • Is the database fault tolerant?!
!We need to consider selection bias, measurement bias, analysis bias, error, confounding variables! ! Does it provide access controls? Has it been
Assurance! • Privacy & Ethics! ! Does it need to be? What about COOP?!
• We can implement rules to trigger actions in response to discovered knowledge! Analysis! accredited? To what level?!
• Budget & Schedule!
Not a linear process!
– Try an agile approach…!
What types of Visual techniques are there? What questions... about predictive tools?
Type Utility Pros Cons • How does this tool perform prediction?! • Does the tool correct for, or enable one
Spreadsheets! Viewing tabular data! Simple/Common! Can’t see patterns! How do we address privacy & ethics in big data? ! answers should list algorithms used (refer to table to correct for potential bias or
at left for descriptions/pros/cons…)!
Common Charts! Viewing numeric data! See Patterns/Trends! Hard to pivot/explore! confounding variables?!
Graphs! Exploring networks! Powerful analysis! Complex / Intensive! Privacy. Be sure to comply with:! Ethics. Consider:! • What types of data does the tool ! e.g. by introducing elements of chance or by
Geospatial views! Viewing data in space! Intuitive maps! Graphics intensive! – The 4th Amendment to the Constitution! 1. Respect for Persons / informed consent!
– Electronic Communications Privacy Act! 2. Beneficence!
analyze?! counting everything!
Temporal views! Viewing data in time! Find patterns/trends! Not all data temporal!
– Foreign Intelligence Surveillance Act! 3. Justice! ! e.g. structured, unstructured, hybrid (are these ! if not, you should be skeptical of the predictions a
Spatiotemporal! Both space & time! Powerful analysis! Uncommon, Intensive!
– The Privacy Act! 4. Respect for Law and Public Interest! right for your mission?). Petabytes of data?! tool makes!!
3D Views! Viewing complex data! More immersive! Graphics Intensive!
Spatiotemporal! Immersive visualization! Intuitive / Powerful! Specialized Hardware! – Executive Order 12333!
– USA PATRIOT Act! !
Want to Learn More?
What types of statistical algorithms are there?
Algorithm Utility Pros Cons How do we secure big data? • George Mason University! • Explore integrated open source
Not qualitative, high curation
! GMU has both full semester graduate-level analytic platforms at www.lumify.io!
Linear! Providing point estimates! High precision, easy!
burden! • By addressing the notorious nine:! • By using big data to secure big data! courses and two day certificate course in Big ! Learn about open-source ingest, knowledge
! 1. Data Breaches! ! Collect & analyze activity data, network data, Data Practices! extraction, and link analysis from structured and
Supports more complex data, Limited inference, high 2. Data Loss! audits, provenance, pedigree, lineage!
Non-Linear! Processing complex systems! • Learn about Data Science and Big unstructured big data!
complex decisions! supervision needed! 3. Account Hijacking! • And by using:!
! 4. Insecure APIs! Data Tools at www.oss4ds.com! • Learn more about big data
! Risk Management: ICD 503! ! The instructors are part of a broader data
Fuzzy Logic /Neural!
Representing highly complex,
Complex inference, messy data!
Lower precision, seed value 5. Denial of Service! ! Access controls, IDAM, biometrics, PKI,
implementations at
qualitative systems! bias!
6. Malicious Insiders! science team that publishes open source www.altamiracorp.com!
physical security, cell-level security, smart resources to help you learn about Data Science
Complex dependencies, fuzzy Lower precision, no point 7. Abuse and Nefarious Use!
Probabilistic! Distribution, probability oriented! data, encryption!
decisions! estimates, see value bias!
8. Insufficient Due Diligence! and Big Data!
Represent large sets, easy Limited inference,
! CND, Anti-Malware, anti-virus!
Graph! Representation of data! 9. Shared Technology Issue!
interaction! computationally challenging!
A few Useful Contacts *Please email Dan Law if you have comments or suggest any improvements to this poster
Altamira: Dan Law, dan.law@altamiracorp.com Cloudera: 866-843-7207 Koverse: info@koverse.com Oracle: 800-633-0738
Amazon: http://aws.amazon.com/federal/
Centrifuge: info@centrifugesystems.com
Datastax: info@datastax.com
IBM: 800-333-6705
MarkLogic: info@marklogic.com
MongoDB: 866-237-8815
Splunk: 866-438-7758
Yarcdata: 925-264-4700
©2014 Altamira Technologies Corporation