You are on page 1of 8

Input = input is a function used to insert data

Set = is an unordered collection of unique items


Output = a function used to display data from input
Dictionary = is an unordered collection of key-value pairs.
Syntax = is the structure of statements in a computer language.
Statements = are the Instructions that a Python interpreter can execute
Packages = A well-organized hierarchy of directories for easier access
Modules = Modules refer to a file containing Python statements and definitions
Datatypes = are actually classes like integers used for numeric values(Literal is a raw data given in
a variable or constant), strings, lists, tuple

Functions = is a group of related statements that performs a specific task. Pythom has 2 type of
func, build in func and user defined function

Loop = iterate over a sequence (list, tuple, string) or other iterable objects. Iterating over a
sequence is called traversal

Tuple = is an ordered sequence of items same as a list. The only difference is that tuples are
immutable. Tuples once created cannot be modified.

List = is an ordered sequence of items. It is one of the most used datatype in Python and is very
flexible. All the items in a list do not need to be of the same type.

Keywords = Keywords are the reserved words in Python. We cannot use a keyword as
a variable name, function name or any other identifier. They are used to define the syntax and
structure of the Python language.

Variable = A variable is a named location used to store data in the memory. It is helpful to think of
variables as a container that holds data that can be changed later in the program. A variable is an
instance (object) of these classes.

Operators = special symbols in Python that carry out arithmetic or logical computation. The value
that the operator operates on is called the operand. There are 4 type of operators(arithemtic
operators, assignement operators, Comparison Operators and Logical Operators

OOP = Python supports Object-Oriented style or technique of programming that encapsulates code
within objects

Interpretation(statements):

*for* construct is an easy way to look at each element in a collection

*in* construct is an easy way to test if an element appears in a collection.

*try* section includes the code which might throw an exception.

*except* section holds the code to run if there is an exception. If there is no exception, the except
section is skipped
Statistic

• anecdotal evidence: Evidence, often personal, that is collected casually rather than by a well-
designed study.

• population: A group we are interested in studying. “Population”

often refers to a group of people, but the term is used for other subjects,

too.

• cross-sectional study: A study that collects data about a population

at a particular point in time.

• cycle: In a repeated cross-sectional study, each repetition of the study

is called a cycle.

• longitudinal study: A study that follows a population over time,

collecting data from the same group repeatedly.

• record: In a dataset, a collection of information about a single person

or other subject.

• respondent: A person who responds to a survey.

• sample: The subset of a population used to collect data.

• representative: A sample is representative if every member of the

population has the same chance of being in the sample.

• oversampling: The technique of increasing the representation of a

sub-population in order to avoid errors due to small sample sizes.

• raw data: Values collected and recorded with little or no checking,

calculation or interpretation.

• recode: A value that is generated by calculation and other logic applied to raw data.

• data cleaning: Processes that include validating data, identifying errors, translating between data
types and representations, etc.
1) Mention what is the responsibility of a Data analyst?

Responsibility of a Data analyst include,

 Provide support to all data analysis and coordinate with customers and staffs
 Resolve business associated issues for clients and performing audit on data
 Analyze results and interpret data using statistical techniques and provide
ongoing reports
 Prioritize business needs and work closely with management and information
needs
 Identify new process or areas for improvement opportunities
 Analyze, identify and interpret trends or patterns in complex data sets
 Acquire data from primary or secondary data sources and maintain
databases/data systems
 Filter and “clean” data, and review computer reports
 Determine performance indicators to locate and correct code problems
 Securing database by developing access system by determining user level of
access

2.12 Glossary

• distribution: The values that appear in a sample and the frequency of each.
• Histogram: A mapping from values to frequencies, or a graph that shows this
mapping. Which is a graph that shows the frequency of each value
Code python or value_counts from pandas:
hist = {}
for x in t:
hist[x] = hist.get(x, 0) + 1

Representing histograms with:


>>>import thinkstats2
>>>hist = thinkstats2.Hist([1, 2, 2, 3, 5]) => .Freq(2) and .Values()

>>>for val in sorted(hist.Values()):


print(val, hist.Freq(val))
Plotting histograms:
>>> import thinkplot
>>> thinkplot.Hist(hist)
>>> thinkplot.Show(xlabel='value', ylabel='frequency')
• frequency: The number of times a value appears in a sample.
• mode: The most frequent value in a sample, or one of the most frequent
values.
• normal distribution: An idealization of a bell-shaped distribution;
also known as a Gaussian distribution.
• uniform distribution: A distribution in which all values have the
same frequency.
• tail: The part of a distribution at the high and low extremes.
• central tendency: A characteristic of a sample or population; intuitively, it is an
average or typical value.
• outlier: A value far from the central tendency.
for weeks, freq in hist.Smallest(10):
print(weeks, freq)
• spread: A measure of how spread out the values in a distribution are.
• summary statistic: A statistic that quantifies some aspect of a distribution, like
central tendency or spread.
• mean = is the summary statistic computed with the previous formula.
mean = live.prglngth.mean()
• variance: A summary statistic often used to quantify spread. Variance is a summary
statistic intended to describe the variability or spread of a distribution
var = live.prglngth.var()
• standard deviation: The square root of variance, also used as a
measure of spread.
std = live.prglngth.std() (the standard deviation is 2.7 weeks, which means we should expect
deviations of 2-3 weeks to be common.)
• effect size: A summary statistic intended to quantify the size of an
effect like a difference between groups.
Cohen’s d:
def CohenEffectSize(group1, group2):
diff = group1.mean() - group2.mean()
var1 = group1.var()
var2 = group2.var()
n1, n2 = len(group1), len(group2)
pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
d = diff / math.sqrt(pooled_var)
return d
• clinically significant: A result, like a difference between groups, that
is relevant in practice.
• asymmetric: it has a tail that extends farther to the left than to the right or otherwise.

TIPS:
1. When you start working with a new dataset, I suggest you explore the variables you are planning
to use one at a time bu histogram and this can be done by bool indexing live[live.birthord == 1]
first_hist = thinkstats2.Hist(firsts.prglngth, label='first')
other_hist = thinkstats2.Hist(others.prglngth, label='other')
width = 0.45
thinkplot.PrePlot(2)
thinkplot.Hist(first_hist, align='right', width=width)
thinkplot.Hist(other_hist, align='left', width=width)
thinkplot.Show(xlabel='weeks', ylabel='frequency', xlim=[27, 46])

2. Reporting results:
How you report results also depends on your goals. If you are trying to demonstrate the importance
of an effect, you might choose summary statistics that emphasize differences. If you are trying to
reassure a patient, you might choose statistics that put the differences in context.

Some of the characteristics we might want to report are:


• central tendency: Do the values tend to cluster around a particular
point?
• modes: Is there more than one cluster?
• spread: How much variability is there in the values?
• tails: How quickly do the probabilities drop off as we move away from
the modes?
• outliers: Are there extreme values far from the modes?

Linear algebra = is a sub-field of mathematics concerned with vectors, matrices, and


linear transforms.
Linear regression = is an old method from statistics for describing the relationships
between variables. The objective of creating a linear regression model is to find the
values for the coefficient values (b) that minimize the error in the prediction of the output
variable y.
Regularization = a technique that is often used to encourage a model to minimize the
size of coefficients while it is being fit on data is called regularization.
Principal component analysis = methods for automatically reducing the number of
columns of a dataset are called dimensionality reduction
Dot Product = most important operation in deep learning
Scalar = is a single number or simple number
Eigenvectors = are unit vectors, which means that theri lengrh or magnitude is equal to
1.0.

Day 7Learn Statistics


Descriptive statistics refers to methods for summarizing, organizing and visualise the
information in a data set and inferential is about conclusion about population.
Why statistics for machine learning – because Machine Learning is an interdisciplinary
field that uses statistics, probability, algorithms to learn from data and provide insights
which can be used to build intelligent applications
In the descriptive statistic we identify the type of variable and number of observation and
identify the distribution of variable and outliers. Talk about the measure of
center(Indicate on the number line where is the central part of the data located) and
measure of variability(Quantify the amount of variation, spread or dispersion present in
the data) and measure of position (Indicate the relative position of a particular data value
in the data distribution). And there are two type of descriptive stat, more exactly Uni-
variate Descriptive Statistics(The various plots used to visualize uni-variate data typically
are Bar Charts, Histograms, Pie Charts. etc.) and Bi-variate Descriptive Statistics(Bi-
variate analysis involves the analysis of two variables for the purpose of determining the
empirical relationship between them. The various plots used to visualize bi-variate data
typically are scatter-plot, box-plot.)
Scatter Plots = The simplest way to visualize the relationship between two quantitative
variables , x and y. Scatter plots are sometimes called correlation plots because they
show how two variables are correlated.
Correlation = A correlation is a statistic intended to quantify the strength of the
relationship between two variables. The correlation coefficient r quantifies the strength
and direction of the linear relationship between two quantitative variables.
Box Plots = A box plot is also called a box and whisker plot and it’s used to picture the
distribution of values. When one variable is categorical and the other continuous, a box-
plot is commonly used.
Histogram is created by dividing values into beans and stack the value that fall in the
same bin. The bins need to be not to wide not to narrow
Mean = the average of a distribution. Sum divided by total number
Mode The mode is the data value that occurs with the greatest frequency
Median The median is the middle data value
Variance = how spread are the data in a variable
SD = square root of variance. The standard deviation or sd of a bunch of numbers tells
you how much the individual numbers tend to differ from the mean.
Parameter is a measurement of a variable distributin
Gaussian/normal distribution = Is a function that shows the possible values for a variable
and how often they occur. The width of the curve is defined by sd.
Probability and statistics are related areas of mathematics which concern themselves
with analyzing the relative frequency of events. But probability deals with predicting the
likelihood of future events, while statistics involves the analysis of the frequency of past
events.
Bayes’s theorem is a relationship between the conditional probabilities of two events

1. Descriptive stat(Univariate and Bivariate)


1.1 Descriptive statistics
1.2 Variance
When you start working with a new dataset, I suggest you explore the variables you are
planning to use one at a time, and a good way to start is by doing histogram
1.3 Distributions and histogram
1.4 Representing PMFs and ploting
1.5 Outliers
1.6 Relative risk and Conditional probability
1.7 Reporting results
2. Data cleaning, when you import data like this, you often have to check for errors, deal
with special values, convert data into different formats, and perform calculations.
3. Validation: One way to validate data is to compute basic statistics and compare them
with published results.
4. Interpretation: To work with data effectively, you have to think on two levels at the
same time: the level of statistics and the level of context.

Probability mass functions


Another way to represent a distribution is a probability mass function (PMF), which maps
from each value to its probability. A probability is a frequency expressed as a fraction of
the sample size, n. To get from frequencies to probabilities, we divide through by n,
which is called normalization. The Pmf is normalized so total probability is 1.
Pmf and Hist objects are similar in many ways; in fact, they inherit many of their methods
from a common parent class. For example, the methods Values and Items work the
same way for both. The biggest difference is that a Hist maps from values to integer
counters; a Pmf maps from values to floating-point probabilities. By plotting the PMF
instead of the histogram, we can compare the two category without being mislead by the
difference in sample size

Summary statistics are concise, but dangerous, because they obscure the data. An
alternative is to look at the distribution of the data, which describes how often each value
appears. The most common representation of a distribution is a histogram, which is a
graph that shows the frequency or probability of each value.
Mode: The most common value in a distribution is called the mode. In Figure 2.1 there is
a clear mode at 39 weeks. In this case, the mode is the summary statistic that does the
best job of describing the typical value.
Shape: Around the mode, the distribution is asymmetric; it drops off quickly to the right
and more slowly to the left. From a medical point of view, this makes sense. Babies are
often born early, but seldom later than 42 weeks. Also, the right side of the distribution is
truncated because doctors often intervene after 42 weeks.
Outliers: Values far from the mode are called outliers. Some of these are just unusual
cases, like babies born at 30 weeks. But many of them are probably due to errors, either
in the reporting or recording of data.

A variable is a value that may change or differ between individuals in an experiment. The moon's
circumference will always have the same value, so it is called a constant.

Frequency is the count of occurrances for each value and relative freq is freq in proportion. With
frequency tables, we have exact counts, so we can always create the histogram. But not the
opposite way around.

The range doesn’t accurate represent the variability of data too good because the outliers impact
the increase variability.

A good way to get rid of the outliers is by cutting of the tails(25% lower and 25% upper)

The mean is sensitive to outliers

SD is squre root of average squared deviation. The number of sd away from the mean is a way to
look for unpopular value.

If I know the distribution I can think critically about the mean,median,mode to describe the data set.
For this is good to read with proportion because it is easy to read and give you an idea. A smller bin
size allow us to get more info

When there are different measure units is good to standardize the distribution because it use 0 as
reference point(the midle) and sd = 1. And you transfor a distribution in a standard normal variate.
Continuous distribution allow us to calculate any proportion between two values on x axis.

Sampling distribution is the form of distribution of a data set: Uniform, Bimodal, Normal, Skewed
Kurtosis is the tail and if the value is negative means that there is a have light-tails(little data in the
tails) and if the value is positive means that there is a heavy tails(more data in the tails)

Normalization is a scaling technique in which values are shifted and rescaled so that they end up
ranging between 0 and 1. It is also known as Min-Max scaling.

Standardization is another scaling technique where the values are centered around the mean with a
unit standard deviation. This means that the mean of the attribute becomes zero and the resultant
distribution has a unit standard deviation.

You might also like