### Transcription of Transforming and Restructuring Data - Stat …

1 **Transforming** and **Restructuring** **data** Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348. Tuscaloosa, AL 35487-0348. Phone: (205) 348-4431. Fax: (205) 348-8648. May 14, 2001. These notes were prepared with the support of a grant from the Dutch Science Foundation. I would like to thank Heather Claypool and Lynda Mae for comments made on earlier versions of these notes. If you wish to cite the contents of this document, the APA reference for them would be DeCoster, J. (2001). **Transforming** and **Restructuring** **data** . Retrieved <month, day, and year you downloaded this le> from For future versions of these notes or help with **data** analysis visit ALL RIGHTS TO THIS DOCUMENT ARE RESERVED. Contents 1 Introduction 1. 2 Transformations: Calculating New Values from Existing Variables 6. 3 Normalizing **data** 10. 4 Working with Conditionals (if statements) 15. 5 Working with Arrays and Loops 20. 6 **Restructuring** **data** : Changing the Unit of Analysis 28.

2 I Chapter 1. Introduction Overview Often times the initial form of your **data** is not the way you want it for analysis. The reasons for this could be many. For example, A researcher might choose to have **data** entered in a format that is easy for typists (to reduce **data** -entry errors) but which di ers from the form needed for analysis. An experiment may have been administered by a computer program that is forced to record the **data** on a trial-by-trial basis when the participant is the desired unit of analysis. The residuals of an ANOVA might be observed to have a severe skew. This is problematic because ANOVAs assume that the residuals have a normal distribution. Correcting this often involves **Transforming** the response variable. A particular way of looking at the **data** is not apparent until after analysis has already begun and the **data** have been loaded into the statistics program in a format incompatible with the new analysis. These notes attempt to explain the circumstances under which you would manipulate your **data** and provide a number of tools and techniques to make manipulation easier and more e cient.

3 Three tools that are particularly important are conditional statements, loops, and arrays. Conditional statements, explained in chapter 4, allow you to apply categorical transformations. This includes both transformations of a categorical variable as well as applying di erent transfor- mations to a numeric variable based on a categorical distinction. Loops and arrays, explained in chapter 5, provide you with a means of performing large numbers of similar transformations using a relatively small section of written code. The great majority of people performing statistical analysis do so using either SPSS or SAS. These notes will therefore always follow the introduction of a particular method of **data** manipulation with speci c instructions on how to implement it in both of these software packages. In the main body of each chapter we will use pseudocode (generic programming statements not speci cally applicable to either program). **data** and **data** Sets The information that you collect from an experiment, survey, or archival source is referred to as your **data** .

4 Most generally, **data** can be de ned as list of numerical and/or categorical values possessing meaningful relationships. 1. For analysts to do anything with a group of **data** they must rst translate it into a **data** set. A **data** set is a representation of **data** , de ning a set of variables that are measured on a set of cases.. A variable is simply a feature of an object that can categorized or measured by a number. A. variable takes on di erent values to re ect the particular nature of the object being observed. The values that a variable takes will vary when measurements are made on di erent objects at di erent times. A **data** set will typically contain measurements on several di erent variables. Each time that we record information about an object we create a case. Like variables, a **data** set will typically contain multiple cases. The cases should all be derived from observations of the same type of object with each case representing a di erent example of that type.

5 Cases are also sometimes referred to as observations. The object type that de nes your cases is called your unit of analysis. Sometimes the unit of analysis in a **data** set will be very small and speci c, such as the individual responses on a questionnaire. Sometimes it will be very large, such as companies or nations. When describing a **data** set you should always provide de nitions for your variables and the unit of analysis. You typically would not list the speci c cases, although you might describe their general characteristics. Many di erent **data** sets can be constructed from the same **data** . Di erent **data** sets could contain di erent variables and possibly even di erent cases. For example, a researcher gives a survey to four di erent people (John, Vicki, James, and Heather). asking them how they felt about dogs, cats, and birds. The survey showed that John likes dogs, but is neutral towards cats and birds. Vicki dislikes dogs, but likes cats and birds.

6 James is neutral towards dogs, but dislikes cats and birds. Heather dislikes dogs, likes cats, and is neutral towards birds. From this **data** the researcher could construct the **data** set presented in table When displaying a **data** set in tabular format we generally put each case in a separate row and each variable in a separate column. The entry in a given cell of the table represents the value of the variable in that column for the case in that row. Table : Pet **data** Set 1. Case Person Pet Rating 1 John Dog 1. 2 John Cat 0. 3 John Bird 0. 4 Vicki Dog 1. 5 Vicki Cat 1. 6 Vicki Bird 1. 7 James Dog 0. 8 James Cat 1. 9 James Bird 1. 10 Heather Dog 1. 11 Heather Cat 1. 12 Heather bird 0. The unit of analysis for this **data** set is a person's evaluation about a pet. It has three variables: person, representing whose evaluation it is, pet, representing the animal being evaluated, and rating, coding whether the person has a positive, negative, or neutral evaluation.

7 2. While this is an accurate representation of the **data** , it might be easier to examine if the responses from the same person could be seen on the same line. The researcher might therefore restructure the **data** set as in table Table : Pet **data** Set 2. Case Person Dog Cat Bird 1 John 1 0 0. 2 Vicki 1 1 1. 3 James 0 1 1. 4 Heather 1 1 0. The unit of analysis for this **data** set is an individual. This time there are four variables: person, indicating who is providing the evaluation, dog, representing the person's evaluation of dogs, cat, representing the person's evaluation of cats, and bird, representing the person's evaluation of birds. Looking at the **data** this way it's pretty clear that some people appear to like pets in general more than others. The researcher might therefore decide that it would be useful to add a new variable to indicate the person's average pet rating. The **data** set that would result appears in table Table : Pet **data** Set 3.

8 Case Person Dog Cat Bird Average 1 John 1 0 0 .33. 2 Vicki 1 1 1 .33. 3 James 0 1 1 .66. 4 Heather 1 1 0 0. The unit of analysis for this **data** set is again the individual. It includes all of the variables found in **data** set 2 as well as a new variable, average, representing the mean rating of all three pets. All three of these **data** sets are accurate representations of the original **data** but contain di erent variables and have di erent units of analysis. The important thing when building your **data** set is to make sure that you maintain the relationships that were originally present in the **data** . The exact structure that your **data** sets should have depends on what sort of analyses you wish to perform. Analyses that are easy using one form of your **data** could be very di cult using another. **data** Manipulation **data** manipulation is the procedure of creating a new **data** set from an existing **data** set. In almost every study you will need to alter your initial **data** set in some way before you can begin analysis.

9 The di erent ways that you can change your **data** set can be grouped into two general categories. 1. Changes that involve calculating new variables as a function of one or more old variables in your **data** set are called transformations. The new **data** set will typically have all of the original variables, with the addition of one or more new variables. Sometimes a transformation will simply involve changing the values of an existing variable. After performing a transformation the cases of the new **data** set will be exactly the same as those of the old **data** set. 2. If you alter your **data** set in such a way that you end up changing the unit of analysis you are performing **data** **Restructuring** . The new **data** set will typically use entirely new variables, with maybe a small number that are the same as in the original **data** set. Additionally, your new **data** 3. set will be composed of entirely new cases. **Restructuring** a **data** set is typically a more more di cult and involved procedure than simply **Transforming** variables.

10 The rst thing you should always do when thinking about manipulating your **data** is to write down exactly what you would want your nal **data** set to look like. You should describe the unit of analysis for your cases, as well as de ne all of your variables. This step will make it much easier for you to determine what transformation and **Restructuring** steps you will need to take. **data** Manipulation in SPSS. There are two basic ways that you can work with SPSS. Most users typically open up an SPSS **data** le in the **data** editor, and then select items from the menus to manipulate the **data** or to perform statistical analyses. This is referred to as interactive mode, because your relationship with the program is very much like a personal interaction, with the program providing a response each time you make a selection. If you request a transformation the new **data** set is immediately updated. When you select an analysis the results immediately appear in the output window.