Professional Documents
Culture Documents
Functions = is a group of related statements that performs a specific task. Pythom has 2 type of
func, build in func and user defined function
Loop = iterate over a sequence (list, tuple, string) or other iterable objects. Iterating over a
sequence is called traversal
Tuple = is an ordered sequence of items same as a list. The only difference is that tuples are
immutable. Tuples once created cannot be modified.
List = is an ordered sequence of items. It is one of the most used datatype in Python and is very
flexible. All the items in a list do not need to be of the same type.
Keywords = Keywords are the reserved words in Python. We cannot use a keyword as
a variable name, function name or any other identifier. They are used to define the syntax and
structure of the Python language.
Variable = A variable is a named location used to store data in the memory. It is helpful to think of
variables as a container that holds data that can be changed later in the program. A variable is an
instance (object) of these classes.
Operators = special symbols in Python that carry out arithmetic or logical computation. The value
that the operator operates on is called the operand. There are 4 type of operators(arithemtic
operators, assignement operators, Comparison Operators and Logical Operators
OOP = Python supports Object-Oriented style or technique of programming that encapsulates code
within objects
Interpretation(statements):
*except* section holds the code to run if there is an exception. If there is no exception, the except
section is skipped
Statistic
• anecdotal evidence: Evidence, often personal, that is collected casually rather than by a well-
designed study.
often refers to a group of people, but the term is used for other subjects,
too.
is called a cycle.
or other subject.
calculation or interpretation.
• recode: A value that is generated by calculation and other logic applied to raw data.
• data cleaning: Processes that include validating data, identifying errors, translating between data
types and representations, etc.
1) Mention what is the responsibility of a Data analyst?
Provide support to all data analysis and coordinate with customers and staffs
Resolve business associated issues for clients and performing audit on data
Analyze results and interpret data using statistical techniques and provide
ongoing reports
Prioritize business needs and work closely with management and information
needs
Identify new process or areas for improvement opportunities
Analyze, identify and interpret trends or patterns in complex data sets
Acquire data from primary or secondary data sources and maintain
databases/data systems
Filter and “clean” data, and review computer reports
Determine performance indicators to locate and correct code problems
Securing database by developing access system by determining user level of
access
2.12 Glossary
• distribution: The values that appear in a sample and the frequency of each.
• Histogram: A mapping from values to frequencies, or a graph that shows this
mapping. Which is a graph that shows the frequency of each value
Code python or value_counts from pandas:
hist = {}
for x in t:
hist[x] = hist.get(x, 0) + 1
TIPS:
1. When you start working with a new dataset, I suggest you explore the variables you are planning
to use one at a time bu histogram and this can be done by bool indexing live[live.birthord == 1]
first_hist = thinkstats2.Hist(firsts.prglngth, label='first')
other_hist = thinkstats2.Hist(others.prglngth, label='other')
width = 0.45
thinkplot.PrePlot(2)
thinkplot.Hist(first_hist, align='right', width=width)
thinkplot.Hist(other_hist, align='left', width=width)
thinkplot.Show(xlabel='weeks', ylabel='frequency', xlim=[27, 46])
2. Reporting results:
How you report results also depends on your goals. If you are trying to demonstrate the importance
of an effect, you might choose summary statistics that emphasize differences. If you are trying to
reassure a patient, you might choose statistics that put the differences in context.
Summary statistics are concise, but dangerous, because they obscure the data. An
alternative is to look at the distribution of the data, which describes how often each value
appears. The most common representation of a distribution is a histogram, which is a
graph that shows the frequency or probability of each value.
Mode: The most common value in a distribution is called the mode. In Figure 2.1 there is
a clear mode at 39 weeks. In this case, the mode is the summary statistic that does the
best job of describing the typical value.
Shape: Around the mode, the distribution is asymmetric; it drops off quickly to the right
and more slowly to the left. From a medical point of view, this makes sense. Babies are
often born early, but seldom later than 42 weeks. Also, the right side of the distribution is
truncated because doctors often intervene after 42 weeks.
Outliers: Values far from the mode are called outliers. Some of these are just unusual
cases, like babies born at 30 weeks. But many of them are probably due to errors, either
in the reporting or recording of data.
A variable is a value that may change or differ between individuals in an experiment. The moon's
circumference will always have the same value, so it is called a constant.
Frequency is the count of occurrances for each value and relative freq is freq in proportion. With
frequency tables, we have exact counts, so we can always create the histogram. But not the
opposite way around.
The range doesn’t accurate represent the variability of data too good because the outliers impact
the increase variability.
A good way to get rid of the outliers is by cutting of the tails(25% lower and 25% upper)
SD is squre root of average squared deviation. The number of sd away from the mean is a way to
look for unpopular value.
If I know the distribution I can think critically about the mean,median,mode to describe the data set.
For this is good to read with proportion because it is easy to read and give you an idea. A smller bin
size allow us to get more info
When there are different measure units is good to standardize the distribution because it use 0 as
reference point(the midle) and sd = 1. And you transfor a distribution in a standard normal variate.
Continuous distribution allow us to calculate any proportion between two values on x axis.
Sampling distribution is the form of distribution of a data set: Uniform, Bimodal, Normal, Skewed
Kurtosis is the tail and if the value is negative means that there is a have light-tails(little data in the
tails) and if the value is positive means that there is a heavy tails(more data in the tails)
Normalization is a scaling technique in which values are shifted and rescaled so that they end up
ranging between 0 and 1. It is also known as Min-Max scaling.
Standardization is another scaling technique where the values are centered around the mean with a
unit standard deviation. This means that the mean of the attribute becomes zero and the resultant
distribution has a unit standard deviation.