Foundation of Data Science Unit 1 - 1

Uploaded by

pavithrpavithr19

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

91 views73 pages

Foundation of Data Science Unit 1 - 1

Uploaded by

pavithrpavithr19

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

UNIT IT UNIT. I UNITIV aN FOUNDATIONS OF DATA SCIENCE Rapa ead uy Al Syllabus INTRODUCTION Data Science: Benefits and uses ~ facets of data - Data Science Proc. Overview — Defining research goals ~ Retrieving data— Data preparat. + Exploratory Data arialysis ~ build the model- presenting findings a building applications - Data Mining - Data Warehousing — Basic Statistica, descriptions of Data DESCRIBING DATA r Types of Data - Types of Variables -Describing Data with Tables and Graphs —Describing Data with Averages - Describing Variability = Normal Distributions and Standard (z) Scores DESCRIBING RELATIONSHIPS Correlation “Scatter plots correlation coefficient for quantitative data -computational formula for correlation coefficient — Regression - regression line ~least squares regression line — Standard error of estimate ~ interpretation of 12 =multiple regression equations —regression towards the mean ‘ 9 t ae PYTHON LIBRARIES FOR DATA WRANGLING 1 ige Basics of Numpy arrays ~aggregations computations on arrays - comparisons, masks, boolean logic — fancy indexing — structured arrays ~ Data manipulation with Pandas — data indexing and selection — operating on data ~ missing data — Hierarchical indexing — combining datasets - aggregation and grouping ~ pivot tables _ DATA VISUALIZATION Importing Matplotlib — Line plots — Scatter plots — visualizing errors - ity and contour plots — Histograms — legends — colors — subplots - and Sanctetion 7 Sustomization ~ three dimensional plotting - "Br hic Data wit Basemap - Visualization with Seaborn.Contents Unit - I : Introduction 11 12 13 14 15 1.6 17 18 19 1.10 Unit - 21 2.2 23 Data Science... 1.1.1 Data Science Components 1.1.2 Benefits and uses of data science... Facets of Data..... Data Science Process: Overview Defining Research Goals ..... Retrieving Data Data Preparation: Cleansing, Integrating, and Transforming Data 1.6.1 Cleansing Data. 1.6.2. Combining Data from different Data Sources. 1.6.3 Transforming Data Exploratory Data Analysis Build the Model... 1.8.1 © Model and Variable Selection. Presenting Findings and Building Applications on top of them. Data Mining and Data Warehousing. 1.10.1 Data warehousing .... 1.10.2 Data Mining... Basic Statistical Descriptions of Dati Part A... Part B... 2: Describing Data Types of Data... 2.1.1 Qualitative or Categorical data. 2.1.2 Quantitative Dat Types of Variables... 2.2.1 Discrete an 2.2.2 Experimen 2.2.3. Confounding varial Factor) g Data with Tables and Graphs. \d Continuous Variables ble (Confounders or Confounding Describin, Ll 1.2 13 15 1.12 LAS 1.16 119 1.20 1.27 131 1.34 1.40 141 1.49 1.49 1.49 1.52 1.55 1.58 1.64 21 21 2.2 2BUs 233: 24 24 2524 Describing Data with Averages 2.5. Describing Variability ... 2.6. Normal Distributions and Standard (Z) Scores.. 2.3.1 2.3.2 24.1 2.4.2 2.4.3 24.4 2.4.5 2.5.1 2.5.2 2.5.3 2.5.4 2.5.5 2.5.6 2.5.7 25.8 2.6.1 The Normal Curve..... 2.6.2 z Scores... 2.6.3 Standard Normal Curve 2.6.4 Solving Normal Curve Problems... 2.6.5 Finding Proportions 2.6.6. Finding Scores..... 2.6.7 z Scores for. non-normal distribution. Part A... Part B & C Tables... Graphs... Mode... Median Mean... Suitable measures of central tendency Averages for Qualitative and Ranked Data. Importance of Varial Standard Deviation Sum of Squares.. Degrees of Freedom (DF) . Interquartile Range (IQR) Measures of Variability for Qualitative and Ranked Da Unit - 3 : Describing Relationships 3.1 3.2 3.3 3.4 3.5 Correlation Scatter Plots Correlation Coefficient for Quantitative Data . Computational Formula for Correlation Coefficient Regression 25 215 2.23 2.24 2.24 2.25 2.26 2.27 2.28 2.28 2.28 2.29 231 2.32 2.32 2,32 2.34 2,34 2.34 2.36 2.37 2.38 2.38 2.39 2.39 2.40 2.43 3.1 3.5 3.13 3.18 3.203.6 Regression Line . 3.7 Least Squares Regression Line 3.8 Standard Error of Estimate (SEE)... 3.9 Interpretation of 7... 3.10 Multiple Regression Equations 3.11 Regression towards the Mean Part A, Part: B Unit - 4 : Python Libraries for Data Wrangling 4,1 Introduction to Numpy Arrays 4.1.1 Dynamic Data Types in Python . 4.1.2 Python Lists. 4.1.3 Fixed-Type Arrays in Python. 4.1.4 Creating NumPy Arrays. 4.2 The Basics of NumPy Arrays 4.2 Aggregations ..... 4.3 Computation on NumPy Arrays: Universal Function: 4.3.1 Characteristics of ufunc .. ae 4,3.2. Computations on arrays: Broadcasting 4.4 Comparisons, Masks, Booleanlogic 4.4.1 Boolean Arrays .....00 4.4.2 Boolean Arrays as Mask: 4.5 Fancy Indexing... 4.5.1 Combined Indexing . 4.5.2 Modifying Values with Fancy Indexing 4.5.3 Sorting Arrays... 4.6 Structured Arrays.. 4.7 Data Manipulation with Panda: 4.7.1 Pandds Objects... 4.7.2 Create a Pandas Series from a lis 4.7.3. Pandas DataFrame Object. 4.7.4 Pandas Index Object... 4.8 Data Indexing and Selection. 4.8.1. Data Selection in Serie: 3.26 3.28 3.30 3.35, 3.38 3.40 3.43 3.47 4d 42 43 45 4.6 42 4.22 4.28 4.29 4.36 4.39 442 4.44 4.45 4.46 447 4.48 4.50 4.53 4,55 4.56 4.62 4.67 4.68 4.69y 4.8.2. Indexing & Description . 47 4.8.3 Data Sclection in DataFrame 473 4.9 Operating on Data.. 4.82 4.9.1 Ufuncs: Index Preservation. 4.82 4.9.2. UFunes: Index Alignment 483 4.9.3 Fill_value 4.85 4.10 Missing Data 4.89 4.10.1. Missing Data in Pandas.. 4.89 4.11 Hierarchical Indexing 495 4.11.1 MultiIndexed Series. * 4.96 4.11.2 Multilndex. Creation using Data frames 4.99 4.11.3 Explicit MultiIndex constructors 4.100 4.11.4 MultiIndex level .names ...... 4.101 4.11.5. Sorted and unsorted indices, 4.102 4.11.6 Rearranging Multi-Indices.. 4.105 4.12 Combining Datasets... 4.107 4.12.1. Concatenating objects 4.108 4.12.2. The. append() method ..... 4.114 4.12.3 Combining Datasets: Merge and Join. 4.115 4.13 Aggregation and Grouping .... 4,122 4.13.1 Grouping in Pandas 4,124 4.13.2 Transformation .. 4.126 4.13.3 The apply() method 4.126 4.13.4 Split key.. 4.127 4,14 Pivot Tables.. 4.134 4.14.1 Multi-level pivot tables 4.136 cane 4.138 , . 4.445 Unit - 5 : Data Visualization 5.1 Importing Matplotlib.... 5.1 5.1.1 Plotting from a script... x 52 5.1.2 Plotting from an [Python shell 5.2 5.1.3 Plotting from an IPython notebook ‘ 52 5.1.4 Saving Figures to File... be 535.1.5 Two Interfaces of Matplotlib 34 52 Line Plots... 56 5.2.1 Setting Line Colours and Styles. 59 5.2.2 Setting the Axes Limits of plots Sl 5.23 Labeling Plots. 5.13 5.3. Scatter Plots... 5.14 5.3.1 Scatter Plots with [Link].. 5.16 53.2 Scatter Plots with [Link]. 5.17 5.4 Visualizing Errors. 5.20 5.4.1. Matplotlib chart error bars in x values 5.21 5.4.2 Matplotlib chart error bars in y values 5.22 5.43 Continuous Errors... 5.25 5.6 Density and Contour Plots 5.26 5.7 Histograms 531 5.1.1 Two-Dimensional Histograms and Binnings. 5.33 5.7.2 Kernel density estimation .. 534. 5.8 Legends .. 5.35 5.8.1 Choosing Elements for the Legend 537 5.8.2 Legend for Size of Points 5.38 _ 59 Colors... 5.39 ; 5.9.1 Choosing’ the colormap.. 5.4L + 5.10 Subplots .. 5.43 | SL Text and Annotation 5.46 ' 5.11.1 Transforms and Text Position. 5.48 5.11.2 Arrows and Annotation. 5.49 5.12 _ Customization .. 5.50 5.12.1 Customizing Tic! 5.50 5.12.2 Customizing Matplotlib.. 5.52 5.12.3. Using style sheets... 5.54 5.13 Three Dimensional Plotting ... 5.55 5.13.1 Three-Dimensional Contour Plots 5.57 5.13.2 Wireframes and Surface Plots.. 5.58 5.14 Geographic Data with Basemap ... 5.59 5.14.1 Map Projections... 5.61 iS Na EP em een ee CEN5.14.2, Drawing a Map Background..... 5h 5.14.3 Plotting Data on Maps 5h 5.15 Visualization with Seabom.. 545 5.15.1 Exploring Seabom Plots .... 57 5.15.2 Histograms, KDE, and densities 5.15 5.15.3 Pair plots 5.14 Faceted histograms 55 Factor plots 577 Joint distribution: 577 Bar plots 519 541 Part B 557 Model Question Paper Lab ProgramINTRODUCTION Data Science: Benefits and uses — facets of data - Date Science Process: Overview — Defining research goals — Retrieving data — Data Preparation- Exploratory Data analysis - build the model — presenting findings and building applications - Data Mining - Data Warehousing- Basic Statistical descriptions of Data 1.1 DATA SCIENCE Data Science is the area of study which involves extracting insights from vast amounts of data using various scientific methods, algorithms, and processes. It is useful to discover hidden patterns from the voluminous raw data. The term’ Data Science has emerged because of the evolution of mathematical statistics, data analysis, and big data. Data science is an extension of statistics dealing with the massive amounts of data produced in the real world. It adds methods from computer science to the statistical computations to analyze the voluminous data and extract the knowledge from the data. Data science focuses on processing the huge volume of heterogeneous data known as big data. Big data refers to the data sets that are large and complex in nature and difficult to process using traditional data-processing application software. The characteristics of big data are referred to as the three Vs: Volume - How much data is there? Variety - How diverse are different types of data? Velocity - At what speed is new data generated? And complemented with a fourth V, Veracity - How accurate is the data? ur V's of big data make the data. capturing, cleaning, pre-processing, haring, transferring and visualization processes a complex task. . are required to extract the insights from this huge volume of These fo storing, searching. SI Specialized technique data. Data Science is an interdisciplinary field that allows to extract knowledge from ge structured or unstructured data. It enables to translate a business problem into aTesearch project and then translate it back into a_ practi 1 solution. With the right tools, technologies, algorithms, significant knowledge and ins from data and converted into distinct business advant: faster decisions. can be captured les for better and 1.1.1. Data Science Components Fig. 11: Convergence of Mathematics, computer science and domain expertise for data sclence The above figure depic c domain expertise for data computer s nce and The main components of Data Science are given in Fi igt.2. 1. Statistics portant components of. data sci a way to collect and analyze the i i meaningful insights from it, 2. Domain Expertise Expert knowledge or skills of a particular area like health care, automobile, retail industry, etc. 1} 3. Data engineering Data engineering involves acquiring, storing, retrieving, and transfo: Data engineering also includes metadata (data al rming the data. bout data) to the data,1.3 Introduction 4, Visualization Representing data in a visual context for easy understanding of data and insights, and making it easy to understand huge amount of data. 5, Advanced computing Advanced computing involves designing, writing, debugging, and maintaining, the source code of computer programs to perform the complex analysis. 6. Mathematics Mathematics involves the study of quantity, structure, space, and changes. 7. Machine learning Machine learning trains the software model so that it can perform the tasks as a human expert, (Fig. 1.2) Doman gs Expertise 3 3 & Z@-y) = 22 WB) a Statistics ~ Data Engineeing Data Science Visualization Advanced Computing Fig, 1.2 : Main components of Data Science 1.1.2. Benefits and uses of data science Data science and big data are used almost ever i c r a 'ywhere in boll i non-commercial settings: h commercial and14 Foundations of Data Science Commercial applications © Commercial companies in almost every industry use data science and big data to gain insights into their customers, processes, staff, order completion, and products interes! ¢ Many companies use data science to offer customers a better user experience, as Well as to cro Ml, up-sell, and personalize their offerings. Example: Google AdSense, which collects data from internet users to match the relevant commercial messages to the person browsing the internet. Human resource management ¢ Human resource professionals use people analytics and text mining to screen candidates, monitor the mood of employees, and study informal networks among coworkers. © Correlated signal analysis done for American baseball, applied. Statistics to hire the right players and pit them against the opponents to have the biggest advantage. Financial applications ¢ Financial institutions use data science to predict stock markets, determine the risk of lending money and to learn how to attract new clients for the services. Government Sector * Governmental organizations are also aware of data’s. value. Many governmental organizations also share their data with the public to use to gain insights or build data-driven applications. Data scientist in a government organization works on diverse projects such as detecting fraud and other criminal activity or optimizing project funding. Example: British Government Communications Head quarters. used data science and big data to Monitor millions of individuals. Those organizations collected 5 billion data records from widespread applications such as Google Maps, Angry Birds, email, and text messages, among many other data sources. Then they applied data science techniques to distill information, _. Non-governmental organizations © NGOs use it to raise money and defend their causes,1.5 Introduction * World Wildlife Fund (WWF), for instance, employed data scientists to increase the effectiveness of their fund ing efforts. * Data Kind is a data scientist group that devotes its time to the benefit of mankind. Education * — Universities use data science in their research and to enhance the study experience of their students. * The rise of massive open online courses (MOOC) produces a lot of data, which allows universities to study how this type of learning can complement traditional classes. * MOOCs are an invaluable asset “to become a data scientist and big data professional. Few of the Online course portals: Coursera, Udacity, and edX. Few application domains of data science ¢ Fraud and Risk Detection in banking * Healthcare prediction and analysis * Internet Search and recommendations © Market analysis. © Customer analysis. * Targeted Advertising. * Website Recommendations, * Advanced Image Recognition. * Speech Recognition. * Airline Route Planning. 1.2 FACETS OF DATA Data science is focused on processing of complex datasets and in building Predictive models from those data. It includes a wide array of different activities. from the upstream Processes. of acquirin, cleaning and integrating data to downstream processes of analysis, modeling and prediction. There are many facets of data s . ence. including: Identifying the structure of data * Cleaning, filtering, reorganizing, augmenting, and aggregating dataa Foundations of Data Science © Visualizing data © Data analysis, statistics, and modeling ¢ Machine Learning ¢ Assembling data processing pipelines to link these steps © Leveraging high-end computational resources for large-scale problems Different tools are used in different parts of this process, Therefore, interoperability among tools, based on common data structures and interfaces, is an important element in enabling the construction of complex, multifaceted data analysis pipelines. In data science and big data many different types of data are handled, and each of them require different tools and techniques. Main categories of data 1. Structured Unstructured Natural language Machine-generated Graph-based Audio, video, and images "2 we wD Streaming Structured Data * — Structured data is data that depends on a data model « — Resides in a fixed field within a record. « Easy to store structured data in tables within databases or Excel files. (Fig. 1.3) © SQL is the preferred way to manage and query data that resides in databases. © Hierarchical data such as a family tree is also called as structured data as they are stored in a particular structure. Example: Examples of structured data include dates, names, addresses, credit card numbers, etc.Untroduction 17 Order Date | Region Rep Units [Unit Cost] Total 9-114 Central | Smith 2 125.00 | 250.00 6-17-15 Central | Kivell 5 125.00 | 625.00 9-10-15 Central Gill Pencil 7 1.29 9.03 M-17-15 | Central | Jardine |” Binder mn 4.99 54.89 10-31-15 Central | Andrews | Pencil 14 1:29 18.06 2-26-14 Central Gill Pen 27 19.99 539.73 10-5-14 Central |. Morgan |. Binder 28 8.99 251.72 12-21-15 | Central | Andrews | Binder 28 4.99 139.72 2-9-14 Central} Jardine | Pencil 36 4.99 179.64 * S715 Central | Kivell | Pen Set a2 23.95. | 1,005.90 1-15-15 Central Gill Binder 46 8.99 413.54 1-23-14 Central | Kivell | Binder 50 19.99 | 999.50 Fig. 1.3: An Excel table is an example of structured data. Unstructured Data’ © — Unstructured data is data that isn't easy to fit into a data model because the content is context-specific or varying. © Unstructured data, typically categorized as qualitative data, cannot be processed and analyzed. via conventional data tools and methods. — It is best managed in non-relational (NoSQL) databases or data lakes to preserve it in raw form. © The importance of unstructured “data is rapidly increasing. Recent projections indicate that unstructured data is over 80% of all enterprise data, ‘while 95% of businesses prioritize unstructured data management. Example of unstructured data : email (Fig 1.4) © Although email contains structured elements such as the sender, title, and body text, it's a challenge to find the number of people who have written an email complaint about a specific employee because so many ways exist18 Foundations of Data Science to refer to a person, for example. Thousands of different languages ang dialects out there further complicate this. New team of Ul engineers # [Link] * To syzerogiam com ‘An iwestment banking cient of mine has had the go ahead to build a new team of Ul engineers to work on ‘Yarous areas of a cutting-edge single-dealer trading platform. ‘Thay will be recruiting at all lavels and paying between 40k & BSk (+ all the usual benefits of the banking word). | understand you may not be looking. | also understand you may be a contractor. Of the last 3 hires they brought ino the teem, two were contractors of 10 years who I honestly thought would never tum to ‘what they considered “the dack side.” “This isa genuine opportundy to work in an environment that's bult up for best in industry and alows you to (gain commercial experience with all the latest tools, tech, and processes. ‘Thare is more information below. | appreciate the spec is rather loose - They are nct looking for speciasts in Angular / Node / Backbone or any of the other buzz words in particular, rather an “engineer” who can ‘wear mary hats and is in touch with current tech & tnkers in their own time, For more information and a confidential chat, please drop me a reply emai. Appreciate you mary not have aan updated CV, but if you do that would be handy io have a look through f you don’ mind sending. Fig 1.4: Email is an example of unstructured data and natural language data. Natural language Natural language is a special type of unstructured data; it’s challenging to process because it requires knowledge of specific data science techniques and linguistics. The natural language processing community has had success in entity recognition, topic recognition, summarization, text completion, and sentiment analysis. Models trained in one domain don’t generalize well to other domains. Even state-of-the-art techniques aren't able to decipher the meaning of every piece of text. Even humans struggle with natural language as well. It's ambiguous by nature. The meaning of the same words can vary based on the context. Many areas like Healthcare, Finance, Media, Human Resources, etc are using NLP for utilizing the data available in the form of text and speech. Many text and speech Fecognition applications are built using NLP. For example, personal voice assistants like Siri, Cortana, Alexa, etc,

Foundation of Data Science Unit 4
No ratings yet
Foundation of Data Science Unit 4
147 pages
Foundations of Data Science Unit 3
No ratings yet
Foundations of Data Science Unit 3
49 pages
Os Lab-1
No ratings yet
Os Lab-1
26 pages
CS3401 Algorithms Assignment 1.1
No ratings yet
CS3401 Algorithms Assignment 1.1
6 pages
CSE Assignment: Sorting Algorithms Analysis
No ratings yet
CSE Assignment: Sorting Algorithms Analysis
2 pages

Foundation of Data Science Unit 1 - 1

Uploaded by

Foundation of Data Science Unit 1 - 1

Uploaded by

You might also like