You are on page 1of 6

What is data?

Seems like a pretty good place to start for a class called data analysis to define what we mean by data and where else would we turn for that definition but Wikipedia. Data are values of qualitative or quantitative variables, belonging to a set of items. This is a pretty good definition and each part of the phrase tells us something important. Let's start with the end and work our way toward the beginning. A set of items, this is the set of objects that you're interested in knowing something about. In a statistics class, this is sometimes referred to as the population. The set of items you care about depends on the question you are asking. It might be a set of people with a particular disease, a set of cars produced by a specific manufacturer, a set of visits to a website or a set of credit card transactions. Corresponding to each item are a set of variables. Variables are measurements or characteristics of an item. The reason they are called variables is because the most interesting measurements or characteristics are those that vary from item to item, although knowing that a variable doesn't actually vary across items can also be informative. Variables are broken down into two types: quantitative and qualitative. Quantitative variables are variables that can be measured with an ordered set of numbers. In a clinical drug study, quantitative variables might be the height, weight or blood pressure of patients. Qualitative variables are variables that can be defined by a label. Qualitative variables might be the country of origin of a patient, the sex or the treatment status of that patient. An important distinction when it comes to data is whether it is raw or processed. Raw data comes from the original source without any modifications made by the data analyst. It is often hard to use for analysis because it is large or it has problems that need to be fixed. Data analysis includes the process of pre-processing these data into a form

that can be used by later analysis and statistical models. Raw data may only need to be processed once but all of the steps should be recorded. Processed data, on the other hand, is data that is ready for analysis. Processing can include things like merging the data, subsetting a certain set of variables, transforming some of the data or removing outliers. There may be standards for processing depending on with the type of data that you're using and so, when possible, take advantage of those standards. Regardless of what processing you've used, all steps should be recorded so that future analysts can use your processed data and be comfortable knowing the steps you used to take, to create them from the raw data themselves. A specific example can shed some light on the difference between raw and processed data. This is a picture of HiSeq DNA sequencing machine. This machine is used to sequence DNA in clinical studies. DNA is the sequence of about 3 billion base pairs or letters that is unique to each person. We will consider the data from this machine to compare raw versus processed data. Don't worry if you don't understand every step and how the machine works. The goal is just to illustrate the difference between raw and processed data. The machine works by breaking a, a long 3 billion letter sequence into much shorter sequences and attaching them to a slide as you can see here. DNA has four different base pairs, A, C, T and G. Each base pair is labeled with a different color dye and then the slide is scanned and a picture is taken at every position along the fragments. The images are then processed by a computer, and at every position on every fragment, each of the four letters, A, C, G, and T gets an intensity measurement. A statistical model is then used to determine which letter appears at which position in each fragment. For example, for this position on this fragment, since C has the highest intensity, it may be the letter that appears in the sequence. The estimated fragments are then pieced

together like a puzzle to get the sequence of letters in the person's genome. How, here the raw data could be the image files but they're huge and often terabytes of data. It could be the intensity files which are also large and often unwieldy for analysis, or it could be the short sequences of letters estimated for each fragment. These are, in fact, what most analysts use as the raw data when building human genomes. Regardless of what is considered the raw data, it's pretty clear that the way the images are processed and the way base pairs are estimated with a statistical model might have a pretty big impact on the genome produced when the short fragments are pieced together. So keeping these steps in mind is important for the analysis and they should be recorded so that people that use the data downstream are able to understand what particular nuances of the processing steps could impact their analysis. So, what do raw data look like? This is the raw data at the level of this short fragments that are produced by a sequencing machine. They include the sequence of letters as well as some information about the quality of those estimates, estimated letters. This is another example of some raw data. This comes from the Twitter API or Application Programming Interface. These are interfaces that allow you to access the data that are being produced by companies like Twitter and Facebook. When you access the data, they come in a very structured format. The structured format may or may not be very easy to analyze directly in order to get information about the way that users use these services. Another example is an electronic medical record. Electronic medical records contain measurements of quantitative and qualitative variables. They also may contain free text typed by the doctor about allergies or medication history. These data are often needed to be processed in order to be able to analyze them with statistical models. So what do processed data look like?

We're going to be talking a little about data processing in a, coming up next week. But to give you a flavor of what we're going to be talking about, this is what we're going for. Processed data or tidy data have the following properties. Each variable forms a column. So in each of these columns are the measurements for one specific variable and each observation forms a roll, row. In this case, this was a study of peer reviewers in an experiment we performed in 2011. So each row corresponds to a particular question solved by a particular reviewer. The corresponding variables for that question lie in each of the columns. So row 1 contains all the values for question 1. Each table or file stores data about one kind of observation. So, for example, in a clinical study, you wouldn't include in the same table information about patients as well as information about the hospitals, that are, that those patients are being included in. The goal is to separate the data in such a way that it's easy to answer the questions that you're trying to answer in your downstream analysis. So how much data is out there? This is an infographic that describes how much information is available at any given year or being collected. Here, it suggests that about 1.8 zettabytes were created in 2011. You might dispute the exact value of this number but it gives you an idea of the order of magnitude of data being created each year. 1.8 zettabytes is equivalent to about 3 tweets per minute for every person in the United States in every minute for an entire year. That's a lot of information and it's why you hear very often about big data. Big data is usually defined as data sets that are so large they can, they cannot be analyzed with a single computer. Despite the fact that they can't, are different in this way, that they can't be analyzed by a single computer, they're similar in that the data are still being used to answer specific questions that people want to address and typically the common statistical and machine learning algorithms can be applied to these data

once it's possible to handle the data themselves. So an important thing to keep in mind when talking about big data is that it really depends on your perspective. This is a picture of an IBM 350 hard drive. This hard drive can store about 5 megabytes of data. Some of the data sets that you'll analyze during this class on your laptop will be larger than 5 megabytes and so, to the people in these pictures, they would be big data. The big data that we analyze today are big largely because our computers are not able to handle them. So why is big data a big deal right now? Here's an example. In 1969, 296 individuals in Nebraska and Boston were given letters with the goal of mailing them to a friend who would then mail them to a friend who would end, with the eventual goal of the letter ending up in the hands of a specific person in Boston. 64 such letter chains made it back to the target person and, on average, it took about 5.2 people, between the person that was mailing the original letter and the target individual. This number was rounded up and became the basis for the usual 6 degrees of separation that you hear about in the media. Well, recently, a similar study was performed on 30 billion conversations from 240 million people, a substantially larger data set. They ended up estimating the average number of degrees of freedom between people to be 6.6, which would round the number up to 7 degrees of separation. The interesting thing here is that these data are now much easier to collect and it's happened so rapidly that it might not be possible for our computers to keep up. This is why people talk a lot about big data and, in particular, they struggle with how to handle the data given that our computers have not grown as fast as the data have grown. Regardless of whether the data are big or small, this is something to keep in mind. This is a quote by John Tukey, one of the most famous data analysts. The data may not contain the answer.

The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. So it's important to keep in mind that if you're trying to answer a specific question, which is the basis for most good data analysis, you may not have the data to answer that question. You know that's a hard decision to come to. The only thing I'd add to this quote is that, no matter how big your data are, you still need the right data to answer your question. So taking a step back and thinking about whether your data will answer the question you're trying to answer is the important first step in doing a data analysis.