You are on page 1of 6

# What is data?

Seems like a pretty good place to start for a class called data analysis to define what we mean by data and where else would we turn for that definition but Wikipedia. Data are values of qualitative or quantitative variables, belonging to a set of items. This is a pretty good definition and each part of the phrase tells us something important. Let's start with the end and work our way toward the beginning. A set of items, this is the set of objects that you're interested in knowing something about. In a statistics class, this is sometimes referred to as the population. The set of items you care about depends on the question you are asking. It might be a set of people with a particular disease, a set of cars produced by a specific manufacturer, a set of visits to a website or a set of credit card transactions. Corresponding to each item are a set of variables. Variables are measurements or characteristics of an item. The reason they are called variables is because the most interesting measurements or characteristics are those that vary from item to item, although knowing that a variable doesn't actually vary across items can also be informative. Variables are broken down into two types: quantitative and qualitative. Quantitative variables are variables that can be measured with an ordered set of numbers. In a clinical drug study, quantitative variables might be the height, weight or blood pressure of patients. Qualitative variables are variables that can be defined by a label. Qualitative variables might be the country of origin of a patient, the sex or the treatment status of that patient. An important distinction when it comes to data is whether it is raw or processed. Raw data comes from the original source without any modifications made by the data analyst. It is often hard to use for analysis because it is large or it has problems that need to be fixed. Data analysis includes the process of pre-processing these data into a form

that can be used by later analysis and statistical models. Raw data may only need to be processed once but all of the steps should be recorded. Processed data, on the other hand, is data that is ready for analysis. Processing can include things like merging the data, subsetting a certain set of variables, transforming some of the data or removing outliers. There may be standards for processing depending on with the type of data that you're using and so, when possible, take advantage of those standards. Regardless of what processing you've used, all steps should be recorded so that future analysts can use your processed data and be comfortable knowing the steps you used to take, to create them from the raw data themselves. A specific example can shed some light on the difference between raw and processed data. This is a picture of HiSeq DNA sequencing machine. This machine is used to sequence DNA in clinical studies. DNA is the sequence of about 3 billion base pairs or letters that is unique to each person. We will consider the data from this machine to compare raw versus processed data. Don't worry if you don't understand every step and how the machine works. The goal is just to illustrate the difference between raw and processed data. The machine works by breaking a, a long 3 billion letter sequence into much shorter sequences and attaching them to a slide as you can see here. DNA has four different base pairs, A, C, T and G. Each base pair is labeled with a different color dye and then the slide is scanned and a picture is taken at every position along the fragments. The images are then processed by a computer, and at every position on every fragment, each of the four letters, A, C, G, and T gets an intensity measurement. A statistical model is then used to determine which letter appears at which position in each fragment. For example, for this position on this fragment, since C has the highest intensity, it may be the letter that appears in the sequence. The estimated fragments are then pieced

together like a puzzle to get the sequence of letters in the person's genome. How, here the raw data could be the image files but they're huge and often terabytes of data. It could be the intensity files which are also large and often unwieldy for analysis, or it could be the short sequences of letters estimated for each fragment. These are, in fact, what most analysts use as the raw data when building human genomes. Regardless of what is considered the raw data, it's pretty clear that the way the images are processed and the way base pairs are estimated with a statistical model might have a pretty big impact on the genome produced when the short fragments are pieced together. So keeping these steps in mind is important for the analysis and they should be recorded so that people that use the data downstream are able to understand what particular nuances of the processing steps could impact their analysis. So, what do raw data look like? This is the raw data at the level of this short fragments that are produced by a sequencing machine. They include the sequence of letters as well as some information about the quality of those estimates, estimated letters. This is another example of some raw data. This comes from the Twitter API or Application Programming Interface. These are interfaces that allow you to access the data that are being produced by companies like Twitter and Facebook. When you access the data, they come in a very structured format. The structured format may or may not be very easy to analyze directly in order to get information about the way that users use these services. Another example is an electronic medical record. Electronic medical records contain measurements of quantitative and qualitative variables. They also may contain free text typed by the doctor about allergies or medication history. These data are often needed to be processed in order to be able to analyze them with statistical models. So what do processed data look like?