You are on page 1of 10
Topic 4 Data Organisation, Analysis and Query Data organisation is a process of storing data into a structured hierarchy which can be used for analysis and query. Data analysis is a process of analysing the properties of the data, Data query and informa- tion retrieval is an important part of both data warehouse and business intelligence system [Manning et al., 2009], In Topic 1, we learn about how data are stored into computer. In this topic, we will learn more about the mathematics behind “structured data organisation” in Section 4.1. In Topic 3, we have discussed about the visualisation of data. The visualisation will gives us some hints to the nature of the data but to verify the statistical properties of the data, we need statistical testing methods introduced in Section 4.3 and Section 4.4. Finally, we will explore data query with Python pandas library (and alittle bit of R and SQL) in Section 4.5. §4.1 Data Organisation with Relational Database Unstructured data are normally stored in the way they are obtained from mobile applications, web- pages, ete. in some kind of “NoSQL” database (Section 1.3.6). ‘The DataFrame structure of Python pandas library is similar to Excel table which consists a lot of redundant information. Therefore it is also considered to be “unstructured”, A “structured” data, or more precisely, a “relational structured” data is organised based the nor- ‘malised relation model. They are designed in such a way that insertions and updates of data will never cause inconsistency in the database system. A normalised database is a database with data broken down to satisfy certain mathematical prop- erties. A relational database system can be rather complicated as illustrated below (nttps://wm. onl ineclassnotes.con, Ullman and Widow [1997)) ‘Figure: System Architecture 67 68 TOPIC 4. DATA ORGANISATION, ANALYSIS AND QUERY ‘The main reason for computer science (CS) students to learn discrete mathematics (UECM1303 Discrete Mathematics with Applications, which was unfortunately removed from CS programme just because many of them failed this subject) is to apply some of the logic and number theory in program- ming as well as to understand break-down of “relations” in SQL [O'Neil and O'Neil, 2000]. Logic, set algebra and the algebra of relations are important topics in discrete mathematics and they are very relevant to database theory [de Haan and Koppelaars, 2007]. This section is a summary of relational database design according to https://en. wikipedia.org/wiki/Relational_algebra and de Haan and Koppelaars [2007]. According to Codd [1970], given sets 5), ---, Sq, a subset R CS, x---Sy is called an (n-ary) relation ofthe sets 5; and they are called the ith domain of R. If one of the domain S; of R can be used to uniquely identify elements in R, it is called a primary key. The relation model requires that each component of each relation be atomic, ie. it must be of some elementary data types (e.g. Boolean, string, integers or floating points) [Ullman and Widow, 1997]. Definition 4.1.1. A table is a set of tuples of many (zero or more) true propositions/statements of the same kind, formally, if T and H are sets, then T is a table over H if for every t € T, t isa function over H. The set H is called the heading (or table schema) of T. Example 4.1.2. ‘The set of tuples 7, = {{(partno, 1), (name, ‘intel-epw’), (instock, 19), (price, 600)}, ‘amd-cpw’), (instock, 12), (price, 530)} {(partno, 2), (name, isa table. The set H = {partno, name, instock, price} is the heading of the table T, Definition 4.1.3. A database (state) is a set with tables as function values, Example 4.1.4. Let L {(empno, 10123), (name, ‘Liew How Hui’), (job, “lecturer’), (dept, 02)}, {(empno, 11321), (name, ‘Chang Yun Fah’), (job, lecturer’), (dept, 02)}, {(empno, 12231), (name, "James Ooi’), (job, lecturer’), (deptno, 01)}} and T, = {{(deptno, 01), (dname, 'DIECS'), (loc, "Zone2')}, {(deptno, 02), (dname, ‘DMAS’), (loc, "Zonet’)}} be two tables, Then a simple database state consisting of T; and 7; is a function from {Employee, Department} to (To, Ts}. DBS; = {(Employee, T;), (Department, T3)}. A database skeleton (or database schema) collects all the “headings” in the database and table. Example 4.1.5. The database skeleton of DBS; is {(Employee, fempno, name, job, deptno}), (Department, {deptno, dname, loc})} According to de Haan and Koppelaars (2007, Chapter 5], we should choose the names in the database skeleton wisely because they constitute the vocabulary between us and the customer (the ‘users of the database), they are the first stepping stone to understanding the meaning (semantics) of a database design Codd [1970] said that “the adoption of a relation model of data, permits the development of a ‘universal data sub-language based on an applied predicate calculus" and the operations of relational algebra fall into four broad classes [Ullman and Widow, 1997, Chapter 4]: 4.1, DATA ORGANISATION WITH RELATIONAL DATABASE 69. 1, Operations on relations: union, intersection and difference; 2, Operations that remove parts of a relation: (a) projection: it eliminates some columns, which is equivalent to selecting particular columns, i,°** Gy of a relation: Flas. say(R) = {4415 21).°+* (dns @n)} + (Car, 21), °°+ (Ans Pn) (ust, Ons), "+ + (Am, m)} € Rh (b) selection: it eliminates some rows (tuples). It is a unary operation, (R) = {(a,2) ER + o(a,v)} where 9 is a predicate that consists of atoms as allowed in the normal selection and the logical operators A (and), V (or) and — (negation). This selection selects all those tuples in R for which ¢ holds. 3. Operations that combine the tuples of two relations: “Cartersian product’, various kinds of “join” ‘operations (e.g. natural joins, @-joins); (a) Natural join (or fibre product): ReeS={rUs : reRASESAFun(rUS)} where Fun is a predicate that is true for a relation t (in the mathematical sense) iff t is a function. (b) theta-join or equijoin: Reap S = ag(RXS) where 0 is a binary relational operator in the set {<, <,=,#,>,2} such that av, a is an attribute name and v can be an attribute name or a value constant. The result of this oper- ation consists of all combinations of tuples in R and S that satisfy 8. The result of the 0-join is defined only if the headers of $ and R are disjoint, that is, do not contain a common attribute. (©) (Left) semijoin (or restriction) Ru S= {ts £ERATS € S(Fun(tU5))} = a, say(R OS) (€) antijoin: RoS={t : t€ RAW3s € S(Fun(t Us))} = \RKS. 4 Operations that changes the relation schema: “renaming” attributes. (@) Rename: Pajo(R) = {{(b, &1), (a2, Y2)*** (dn, @n)} + {(a, B1), (a2, Ye), -* +(ns Un)} € RY: ‘The changes in the database state is characterised by state transition de Haan and Koppelaars (2007, Chapter 8) Data need to be organised based on “good” database principle below in order to allow efficient query and storage of information: 1, Tables need to be normalised and there should be no duplicate data; 2. Focus on nouns, ie. follow the “noun: ables” principle; 3. Another table's ID must have a foreign key constraint; 70 TOPIC 4, DATA ORGANISATION, ANALYSIS AND QUERY 4, List of things (e.g. topics of a subject) get their own table, eg. do design a subject table with topic 1, topic 2, ete. This kind of “spreadsheet” design is not acceptable in SQL database design; 5. Many-to-many = lookup table (with foreign keys); 6. Watch for equal values that aren't identical, e.g. a university programme consists of “subject choices” (different types of electives) instead of subjects. §4.2 Mathematics of Information Retrieval No everything can be nicely structured as in Section 4.1. Information retrieval (IR) is concerned with the finding of information of an unstructured nature (usually text) that is relevant to users’ needs [Manning et al, 2009]. According to Dominich (2008), the basics of IR technology are: + Identification of terms, + Power law. Stoplisting + Stemming, + Weighting schemes. + Term-document matrix. + Inverted file structure. + Typical architecture of a retrieval system. + Web characteristics. General architecture of a Web search engine. + Architecture of a Web metasearch engine Measures of retrieval effectiveness. Laboratory measurement of retrieval effectiveness (precision-recall graph). + Measurement of relevance effectiveness of Web search engines. ‘The theory behind IR is lattice theory [Dominich, 2008] and AI models (support vector machines, clustering, matrix decomposition and latent semantic indexing, link analysis, etc.) [Manning et al., 2008). §4.3 Hypothesis Testing Descriptive statistics (Section 2.6.9) tries to find out how to describe the data using some kind of “static” statistical model, In order to describe how well the “static” statistical model characterises the data, hypothesis testing is required + stats.kurtosistest(a[, axis, nan_policy]): Test whether a dataset has normal kurtosis. + stats.normaltest(a[, axis, nan_policy]): Test whether a sample differs from a normal distri- bution. + stats.skentest(a[, axis, nan_policy]): Test whether the skew is different from the normal distribution 43, HYPOTHESIS TESTING n + sem(al, axis, ddof, nan_policy]) Calculate the standard error of the mean (or standard error ‘of measurement) of the values in the input array. + mmap(scores, compare[, axis, ddof]) Calculate the relative z-scores, + zscore(a[, axis, ddof]) Calculate the z score of each value in the sample, relative to the sample ‘mean and standard deviation, + igr(L, axis, rg, scale, nan_policy, ---]) Compute the interquartile range of the data along the specified axis. + signaclip(al, low, high]) Iterative sigma-clipping of array elements. + trinboth(a, proportiontocut[, axis]) Slices off a proportion of items from both ends of an array. + trint(a, proportiontocut{, tail, axis]) Slices off a proportion from ONE end of the passed array distribution. + f_oneway(*args) Performs a 1-way ANOVA. + pearsonr(x, y) Calculate a Pearson correlation coefficient and the p-value for testing non-correlation, + spearmanr(af, b, axis, nan_policy]) Calculate a Spearman rank-order correlation coefficient and the p-value to test for non-correlation. + pointbiserialr(x, y) Calculate a point biserial correlation coefficient and its p-value. + kendatltau(x, y[, initial texsort, nan_policy]) Calculate Kendall's tau, a correlation mea- sure for ordinal data, + weightedtau(x, yf, rank, weigher, additive]) Compute a weighted version of Kendall’s + Linregress(xL, y]) Calculate a linear least-squares regression for two sets of measurements. + theilslopes(y{, x, alpha]) Computes the Theil-Sen estimator for a set of points (x, y) + ttest_1samp(a, popmean[, axis, nan_policy]) Calculate the T-test for the mean of ONE group of scores. + ttest_ind(a, bf, axis, equal_var, nan_policy]) Calculate the T-test for the means of two in- dependent samples of scores + ttest_ind_from_stats(mean1, std1, nobs1, ---) T-test for means of two independent samples from descriptive statistics. + ttest_rel(a, bl, axis, nan_policy]) Calculate the T-test on TWO RELATED samples of scores, aandb. + kstest(rvs, cdf, args, N, alternative, mode]) Perform the Kolmogorov-Smirnov test for good- ness of fit. + chisquare(f_obs{, f_exp, ddof, axis]) Calculate a one-way chi square test + power_divergence(f_obsf, fexp, ddof, axis, -+-]) Cressie-Read power divergence statistic and goodness of fit test + ks_2sanp(data1, data2) Compute the Kolmogorov-Smirnov statistic on 2 samples. + tiecorrect(rankvals) Tie correction factor for ties in the Mann-Whitney U and Kruskal-Wallis Hests, + rankdata(a[, method]) Assign ranks to data, dealing with ties appropriately. + ranksums(x, y) Compute the Wilcoxon rank-sum statistic for two samples. + kruskal(*args, **kwargs) Compute the Kruskal-Wallis H-test for independent samples + friednanchisquare(*args) Compute the Friedman test for repeated measurements + combine _pvalues(pvalues{, method, weights]) Methods for combining the p-values of indepen- R TOPIC 4. DATA ORGANISATION, ANALYSIS AND QUERY dent tests bearing upon the same hypothesis. + jarque_bera(x) Perform the Jarque-Bera goodness of fit test on sample data. ansari(x, y) Perform the Ansari-Bradley test for equal scale parameters bartlett (*args) Perform Bartlett's test for equal variances + levene(*args, **knds) Perform Levene test for equal variances. shapiro(x) Perform the Shapiro-Wilk test for normality, anderson(x[, dist]) Anderson-Darling test for data coming from a particular distribution anderson_ksamp(sanples[, midrank]) The Anderson-Darling test for k-samples. binom_test(x[, n, p, alternative]) Perform a test that the probability of success isp. fligner(*args, **knds) Perform Fligner-Killeen test for equality of variance + median_test(*args, **kuds) Mood’s median test. + mood(x, yL, axis]) Perform Mood’s test for equal scale parameters. boxcox(x{, Imbda, alpha]) Return a positive dataset transformed by a Box-Cox power transfor- mation. + boxcox normax(x{, brack, method]}) Compute optimal Box-Cox transform parameter for input data. boxcox_11f(Imb, data) The boxcox log-likelihood function. + entropy(pkL, 9k, base]) Calculate the entropy of a distribution for given probability values. + wasserstein_distance(u_values, v_values[, ---]) Compute the first Wasserstein distance be- tween two ID distributions. + energy _distance(u_values, vvaluesf, ---]) Compute the energy distance between two 1D distributions. §4.4 Nonparametric Statistics Testing If the data does not have the familiar Gaussian distribution, we may resort to nonparametric version of the significance tests. These tests are distribution free, requiring that data be first transformed into rank data before the test can be performed. A common question about two or more data is whether they are different. Specifically, whether the difference between their central tendency (e.g. mean or median) is statistically significant. Data samples that do not have a Gaussian distribution can be studied by using nonparametric statistical significance tests, The null hypothesis Ho of these tests is often the assumption that both samples were drawn from a population with the same distribution, and therefore the same population parameters, such as mean or median. If after the significance test is carried out on two or more samples the null hypothesis is rejected, there is an evidence to suggest that samples were drawn from different populations, and in turn the difference between sample estimates of population parameters, such as means or medians may be significant. In general, each test calculates a test statistic, that must be interpreted with some background in statistics and a deeper knowledge of the statistical test itself. Tests also return a p-value that can be ‘used to interpret the result of the test. The p-value can be thought of as the probability of observing the two data samples given the base assumption (null hypothesis) that the two samples were drawn from a population with the same distribution. ‘The p-value can be interpreted in the context of a chosen significance level called alpha, @. A ‘common value for « is 0.05. If the p-value is below the significance level, then the test says there is ‘enough evidence to reject the null hypothesis and that the samples were likely drawn from populations 45. DATA QUERY WITH PANDAS 13 with differing distributions, ie + pS a: reject Hp, different distribution, + p> a: fail to reject Ho, same distribution. ‘A few nonparametric statistical significance tests are implemented in scipy.stats module. + stats.mannuhitneyu(x, y[, use_continuity, alternative]): Compute the (Wilcoxon-)Mann-Whitney U test on samples x and y to determine whether two independent samples were drawn from a population with the same distribution. + stats.wilcoxon(x[, y, zeromethod, correction]): Calculate the Wilcaxon signed-rank test or Wileoxon T test to compare two samples that are paired, or related. The parametric equivalent to the Wilcoxon signed ranks test goes by names such as the Student's t-test, t-test for matched pairs, -test for paired samples, or t-test for dependent samples. [Corder and Foreman, 2009] Example 4.4.1. Given two sets of data, df1 and df2, they can be compared by using Mann-Whitney U test using pandas and scipy as follows. Stet, p = wannwhitneya(afl, af2) print(‘Statistics=$.3f, pet.3f' % (stat, p)) alpha = 0.05 if p > alpha print('Same distribution (fail to reject H0)') else: print(‘Different distribution (reject H®)") §4.5 Data Query with Pandas Data query is part of the “split, process, combine” process, ie. breaking down a larger problem into smaller pieces, processing each piece independently and then putting these smaller pieces back to- gether. Efficiency of query is very important as pointed out in https://parallelthoughts. xy2/2019/ (05/a-tale-of-query-optimizat ion/ and https://meu.reddit.com/r/progranming/conments /bqkx6/i_ nnodiied_an_sql_query_from_24.mins_down_to_2/, however, here we are concerned with the “basic constructs of data query and the comparison with other systems (R and SQL). $45.1 Sampling, Searching and Filtering Searching with pandas data structures are made easy by the use of “in”, For financial and business applications where the query may be very precise, searching something that computer can handle wel. However, for a large business who is trying to sell products (a small business will have limited products to offer). Ifa customer is looking for product A, the system need to “filter out” other products related to product A which are also purchased by other customers. This is called collaborative filtering [Segaran, 2007, Chapter 2] In this section, we will limit ourselves to traditional searching and filtering based on predicate logic (UECM1084 Basic Mathematics). Example 4.5.1. Determine the values of the following Python commands (using the data from Topic 1 Example 1.3.5) + "Kuala Lumpur in s_popul index + "New York" in s_popul. index ... "4 + "Penang ' in s_popul. index TOPIC 4. DATA ORGANISATION, ANALYSIS AND QUERY + series[series > 1200000): Filtering by using boolean array, + (series > 1000000).value_counts(): Perform categorical data counti + Find duy tes: dF [dF sduplicated()] + df.nlargest(n, ‘value"): select and order top n entries + df.nsmallest(n, value’): select and order bottom n entries For example, filter the DataFrame for schools that are of type “Charter” is charter = sy1617['School_type'] syi617[is charter] "charter ' ‘We can look for multiple values in a column, such as “Charter” and “Magnet” schools, using the isin method: charter_magnet = sy1617['School_Type "].isin([ ‘Charter ', ‘Magnet ']) sy1617[charter_magnet] Filter for schools with student survey response rates of at least 80 atso 5y1617 [gt80] sy1617[ ‘School Survey Student Response Rate Pct '] We can combine multiple conditions with & and | sy1617[is charter & gt80) $45.2 Pandas-R Dictionary According to McKinney and PyData Development Team [2019], pandas’ DataFrame imitates R's data frame and one can even use HDFS files to transfer from one environment to another. Since R has a lot of, powerful statistical analysis tools, we need to learn a little bit about R to transfer some of the functions to Python. Therefore, this section copies the tables from McKinney and PyData Development Team [2019, Section 3.5} Table: Querying, Filtering, Sampli R& dplyr dim(a) head(at) slice(df, 1:10) filter(4f, coll == 1, col2 dffatscolt == 1 & dfScol2 select(df, colt, col2) select(df, col1:col3) select(af, -(colt:col3)) distinct(select(df, colt)) distinct(select(df, colt, col2)) sample_n(df, 10) sample_frac(af, 0.01) dishape afhead() afilocl:9] dfquery(‘colt df[(df.coll == 1) & (df.col: Aico!’ ‘col2"]] dflloc(, 'colt”Zcol3'] af drop(cols_to_drop, axis=1) Aaf[[’colt’J].drop_duplicates() df{[’colt’,‘col2"T].drop_duplicates() ) mutate(df, pandas dfsort_values(['colt’,‘col2")) dfsort_values(‘colt’, ascending=False) dfrename(columns=(‘col!’ ‘col_one’))["col_one’) dfrename(columns={'col1’: 'col_one')) dfassign(c=df.a-dfb) Table: Grouping and Summarising R& dplyr summary) gal <- group_by(df, colt) summarise(gdf, avgemean(col1, na.rm=TRUE)) summarise(gdf, tot um(coll)) §4.5.3 Pandas-SQL Dictionary pandas ‘df-describe() gdf = dfgtoupby(‘colt’) df groupby(‘col!’) agg(’colt’?‘mean’) Af groupby(colt’) sum() In SQL, selection is done using a comma-separated list of columns you'd like to select (or a * to select all columns): SELECT colt, col3, cols FROM df WHERE coll = 'VALT' AND col2 = 'VAL2'; ‘The pandas instructions to achieve the same effect is shown below. bidx = (df{col1] == 'VALI') & (df {col2] af[bidx EL! colt’ ,col3", cols "]] *vAL2") The selection with sorting is shown below. SELECT * FROM of WHERE col = "SOMETHING* ORDER BY feature DESC LIMIT 5; This corresponds to pandas instruction below. dF LaF Leol] "SOMETHING ']. sort (feature, ascending=False).head(5) To show only the records where col2 is a missing value (NULL). SELECT * FROM df WHERE col? IS NULL; SELECT * FROM df WHERE col1 IS NOT NULL; ‘Their equivalent forms in pandas are “df[df[ 'col2"].isna()” and “dfLdF{ ‘colt '] .notna()” Inpandas, SQL's GROUP BY operations are performed using the similarly named groupby() method. ‘This method typically refers to a process where we'd like to split a dataset into groups, apply some function (typically aggregation), and then combine the groups together. A.common SQL operation would be getting the count of records in each group throughout a dataset. For instance, a query getting us the number of tips left by sex SELECT sex, count (*) FROM tips GROUP BY sex; It is equivalent to “tips. groupby( 'sex").size()”. Note that pandas’ count() has a different mean- ing than SQL. Tt applies the function to each column, returning the number of not null records within each, 16 TOPIC 4. DATA ORGANISATION, ANALYSIS AND QUERY ‘Multiple functions can also be applied at once. For instance, say we'd like to see how tip amount differs by day of the week, agg() allows us to pass a dictionary to our grouped DataFrame, indicating which functions to apply to specific columns. SELECT day, AVG(tip), COUNT(*) FROM tips GROUP BY day; ‘The pandas way of expressing this is shown below. tips.groupby(‘day').agg({"tip': np.mean, ‘day's np.size}) SQL’s JOINs can be performed with pandas’ join() or merge(). By default, join() will join the DataFrames on their indices. Each method has parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or the columns to join on (column names or indices) Consider the data below and assume we have two database tables of the same name and structure, 1 = pd.DataFrame({*key': (A, "B', ‘C', ‘D'J, ‘value af2 = pd.DataFrame({*key': ['8', ‘0', 'D", "E'], ‘value np. random.randn(4)}) np. random. randn(4)}) ‘Then we have a dictionary below. Inner Join + SELECT * FROM df1 INNER JOIN df2 ON df1.key = df2.key; + pd.merge(df1, df2, on="key') Left Outer Join + SELECT * FROM df1 LEFT OUTER JOIN df2 ON df1.key = df2.key; + pdimerge(d#1, df2, on="key', how="Left') Right Join + SELECT * FROM df1 RIGHT OUTER JOIN df2 ON df1.key = df2.key; + pdamerge(df1, df2, on="key', how=" right") Full Join SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.keys + pd.merge(df1, df2, on="key', how='outer") A more complicated example of left join is shown below. CREATE TABLE divvy ( SELECT * FROM trips LEFT JOIN stations ON trips. from_station_name = stations. name d: divvy = pd.merge(trips, stations, how='left', Left_on="from_station name’, right_on='name') Suppose we have the following data, at] = pd.DataFrame({"city': ['Chicago', ‘San Francisco’, "New York City'], *rank': range(1, 4)}) #2 = pd.DataFrame({"city': ["Chicago’, ‘Boston’, ‘Los Angeles"], ‘rank’: (1, 4, 51)) ‘The dictionary for SQL UNION ALL is as follows. SELECT city, rank FROM df1 UNION ALL SELECT city, rank FROM of2 + pdconcat([dfl, dt2)) Note SQI.'s UNION is similar to UNION ALL but it removes duplicate rows. Therefore, the correspond- ing Python instruction needs to have .drop_duplicates().

You might also like