You are on page 1of 8

Application of Student Generated Test Questions to Stimulate Deeper Learning

Keith W. DeGregory

This paper was completed and submitted in partial fulfillment of the Master Teacher Program, a 2-year faculty professional development program conducted by the Center for Teaching Excellence, United States Military Academy, West Point, NY, 2009.

Abstract As educators, it is important to know whether the techniques we use in the classroom are effective in facilitating a students learning. One might be able to easily assess a students ability to execute mechanical drills through routine quizzing, but how does one assess if a student truly knows what concepts are important or whether he knows when and where to apply them? In this paper I focus on a specific classroom assessment technique known as student generated test questions. This particular technique requires the student to design his own quiz or test questions, complete with solutions, based off of what they deem the most important concepts from a given block of material. This paper describes a small-scale experiment conducted with three sections of an undergraduate calculus course in which I set out to measure the effectiveness of student generated test questions in improving student performance and depth of understanding concepts. Although I was unable to categorically conclude that the technique is effective, I have come across some interesting results and findings. One significant finding is that a students attitude towards this new way of studying might play a considerable factor on how well the technique works. However, further research and testing is required to confirm this as the original experiment was not designed to gauge this factor. In general, most students appreciated this new approach to studying and felt they better understood the material when expected to design their own test questions.

Application of Student Generated Test Questions to Stimulate Deeper Learning MAJ Keith W. DeGregory
Department of Mathematical Sciences

Background Taking a derivative or integral by following a set procedure or applying a specific rule is not too difficult for the average student enrolled in a college calculus course. However, give this same student a calculus problem to solve that is concealed in a real world scenario filled with relevant facts as well as extraneous information, and it is now a problem solving problem that requires an entire process. In such a situation the student is expected to read and understand the scenario, extract the mathematical question from the text, select the appropriate solution process, carryout the procedure correctly, interpret the results, and finally address the original problem from the scenario by answering the question. Solving problems like this is intimidating if one does not have a feel for where these problems are coming from and why they would require calculus (or other classroom procedures) to solve them. What if the student were to devise the problem himself? Suppose a student were to first choose a learning objective he wishes to understand better and then carefully design a question that assesses that learning objective to include developing a solution to his own question. I argue that the student that studies in this fashion would perform better on tests in the short term and in the long term would have a deeper understanding of the overarching concepts than a similar student who studies using more traditional techniques such as reviewing material or going over homework problems. The approach of having a student design his own questions is a classroom assessment technique (CAT) commonly referred to in literature as student generated test questions (Angelo & Cross, 1993). In practice, teachers who use this technique may assign it as homework in order to give students the opportunity to evaluate the course topics, reflect on what they understand, and [identify] what are good test items (NTLF, 2008). Additionally, teachers could use the results to assess what the class deems important and how the class perceives the material based on the questions they generated. Much of the research and literature on this particular CAT is subjective and generic. I only came across one paper that tried to measure the effectiveness of this CAT in the students learning process. Therefore, I have devised an experiment designed to do just that: put the student generated test questions CAT to the test. My hypothesis: Students that study by developing their own quiz questions will fare better when tested on the relevant lesson objectives by forming a deeper understanding of the more important concepts. For the purpose of this experiment a deeper understanding is defined as long term retention and in the context of this experiment equates to performance on related questions on the term end exam (TEE). In the next section I describe the experiment in detail. Page 1

Experiment Design I have designed the experiment around the three sections of undergraduate mathematics I instructed during the 2008 fall term at West Point. The course was MA205, Integral calculus and Introduction to Differential Equations, which is the third in a series of four core mathematics courses that every student at West Point must take. I divided the experiment into three phases, distributed over the entire semester, which involved all three of the sections I taught. Since the course was already broken into three major blocks of material, I matched my experiment phases up to these blocks. Block 1 was the longest of the three; therefore, I used the beginning of Block 1 to calibrate all students to my expectations for quizzes as well as the homework assignments. For the remainder of this paper I will refer to my three sections as X, Y and Z hours (changed from actual course designator). In the first block I selected the X hour section to be the experimental group while the Y and Z hour sections were the control groups. During Block 2 I made Y hour the experimental group while the other two were control groups. Likewise, during Block 3 Z hour was the experimental group. The table below summarizes the group breakdown for the experiment. Block 1 X Y, Z Block 2 Y X, Z Block 3 Z X, Y

Experiment Control

Table 1: Breakdown of Experiment Grouping

It was important that the experiment was both transparent to the students and did not favor any one section over the other. To accomplish this I integrated the experiment into my instructor assessment plan. All instructor level assessments (i.e., homework and quizzes) were dedicated to support the experiment. To this end there was no undue burden on the students class preparation time as a result of the experiment. Throughout the semester each section had three homework assignments and eight quizzes. The first quiz was created by myself and was used as a baseline for their performance on quizzes prior to the start of the experiment. It also provided the students with an expectation of the length and difficulty of quizzes for the course. Next I assigned all of the sections a team homework assignment in which the teams developed a quiz, complete with their own solution, covering pre-designated topics. The assignment asked the students to identify the most important learning objectives and develop questions designed to assess these learning objectives. The assignment was graded on creativity, appropriateness, linkage to learning objectives, and correctness of their solution. From these homework assignments I chose questions that were well thought-out and covered the most important concepts and incorporated them into the next in-class quiz. I also allowed students to use their returned homework as a reference while taking the quiz. This portion of the experiment posed as a calibration since all three sections were assigned this homework as well as administered its corresponding quiz. After calibration and capturing the students baseline I transitioned into Phase 1 of the experiment. In Phase 1, X hour served as the experimental group while the other two sections were the control groups. During this phase only X hour was assigned graded homework assignments. They were assigned two homework assignments spaced out over the remainder of Block 1. The purpose of the Page 2

assignments was the same as the calibration homework except that it was changed to an individual assignment. From this set of homework assignments I designed the remaining two quizzes for the block. I then administered these quizzes to all three sections including the control groups. The intent of this methodology is that the students in the experimental group would put deep thought into what they believed were the most important concepts and how best to develop a question to assess ones understanding of these concepts. In essence, the students were stepping into the shoes of the instructor and looking at the material as the tester and not the test taker. Through this thoughtful analysis, I hypothesized that the students in the experimental group would have a greater appreciation for the important concepts which would carry over into deeper, more permanent acquisition of these concepts. Meanwhile, I assumed the students in the control sections would most likely prepare for the upcoming quizzes using a more traditional approach to studying (i.e., rereading text or practicing problems). Additionally, I always offered the option to students in the control groups to voluntarily design a quiz as part of their studying technique. Phases 2 and 3 occurred over the final two major blocks of material, respectively. These phases of the experiment were very similar to that of Phase 1 except the experimental group shifted to Y hour for Phase 2 and then Z hour for Phase 3. Only the section that was the experimental group was assigned the graded homework assignments such that each section had exactly two homework assignments and an equal work load. The only exception is that by Phase 3 I added an additional component to the homework that I believed would assist the students in identifying if their quiz was appropriate. The added component was to develop a grading rubric on how they would assign a grade, administer their quiz to a classmate, grade the quiz according to their rubric, and then reflect on their quiz based on feedback from the classmate that took their quiz. Something to be aware of is that each of my sections was homogenous in that they were composed of students sectioned based off of similar math abilities as measured by their previous semester of undergraduate mathematics at West Point. The composition of each section is summed up in Table 2 so you get a better appreciation for the students involved in the experiment. The table below depicts each of the three sections with a very basic generalization based-off their responses to a preexperiment survey.
Considers self mathematically inclined 15% 59% 79% Average study hours for quizzes 1.4 1.1 1.0 Average grade (MA104) D BA

Section X Hour Y Hour Z Hour

# students 13 17 17

Enjoys math 31% 68% 79%

Predominant study technique Reviewing solutions Reworking problems Reviewing solutions

Table 2: Composition of Sections

This table shows that I am basically working with three distinct sample populations: a math savvy section (Z Hour), a section with average mathematical abilities (Y Hour), and a section that struggles with mathematics (X Hour).

Page 3

Data and Metrics Throughout the course of the semester I collected both quantitative as well as qualitative data using a variety of means to include tests scores and surveys. The table below depicts a list of all the techniques I used to gather data: Quantitative Quiz grades Homework grades WPR grades (experiment specific questions) WPR grades (overall) TEE (experiment specific questions)
Table 3: Data Points

Qualitative Entry survey Calibration survey Experiment survey End of course survey Personal Interviews

I will apply two main techniques to measure the quantitative results. One technique will be to measure the differential from the averages of all sections. The other will be to measure grades as compared to the baseline for that particular group. I will use the qualitative results in two ways as well. First I will quantify responses, when possible, to illustrate overarching attitudes and then I will use actual student comments and feedback to reinforce how these attitudes.

Analysis & Discussion I calculated the baseline for each hour by averaging a sections performance on the calibration quizzes, the questions from the TEE specifically linked to the calibration quizzes, as well as a course-wide derivatives exam. These baseline percentages found in the table below represent how I would expect each hour to perform on a standard quiz having a standard difficulty level. Since I am working with high, medium, and low math sections, I will assume that the average of these averages represents a fair measurement of standard difficulty, which is 82.7%, which was only 0.6% less than the overall course ending average.
Section X Hour Y Hour Z Hour Overall Baseline Average 70.3% 84.2% 92.0% 82.7%

Table 4: Baseline Averages

During Phase 1 (Block 1) of the experiment I found that X hours section average of 72.1% for the two quizzes administered during Block 1 was 2.6% greater as a percentage of their baseline average as calculated using Equation 1 below. Eq. 1 Page 4

Similar calculations for the performance of the control groups, Y and Z hours, on these same quizzes shows that Y hour scored 0.1% less than their baseline and Z hour scored exactly on their baseline. Since the quizzes came directly after the homework assignment, this metric does not necessarily measure the deeper understanding I mention in the hypothesis. Therefore, I conducted a similar analysis to compare performance on the questions contained in that blocks WPR, as well as the TEE, that focus specifically on the concepts assessed from the quizzes. The findings are consolidated in the table below.
Block 1 Quizzes +2.6% -0.1% +0.0% WPR Focus Questions +10.8 % +8.4% +3.6% TEE Focus Questions +14.4% +1.5% +0.6% Overall Averages +9.2% +4.4% +2.1%

Section X Hour (Experiment) Y Hour (Control) Z Hour (Control)

Table 5: P hase 1 Results

Note that for each column the higher percentages experienced by the experimental group support the hypothesis. This is true in that every measurement, from short term recall to long term retention, for the experimental group outperformed the control groups. The fact that performance increases over time suggest that the students may have internalized the concepts resulting in a deeper understanding and higher performance on the term end exam. I replicated this process for the next two blocks of instruction where the experimental and control groups change so that I can generalize my conclusions. The results are found in the two proceeding tables.
Block 2 Quizzes -7.2% +3.8% +1.5% WPR Focus Questions +14.6 % +5.1% +5.3% TEE Focus Questions +3.2% +0.5% +1.4% Overall Averages +3.4% +2.5% +1.6%

Section X Hour (Control) Y Hour (Experiment) Z Hour (Control)

Table 6: Phase 2 Results

For Block 2 where Y hour was the experimental group you will see that their performance on the block quizzes again supports the hypothesis, but the control groups end up performing better on both the WPR and TEE focus questions which counters the hypothesis. The overall average for these metrics shows that the experimental groups performance (+2.5%) is greater than Z Hour (+1.6%) but less than X hour (+3.4%). This does not necessarily support the hypothesis there may be other factors which bias the experiment at this point. For example, X hour may indirectly view the material, and consequently how they study, differently having already been exposed to the question generation technique during Bock 1.
Block 3 Quizzes +6.1% -2.5% -0.5% WPR Focus Questions +8.0 % -10.0% -2.5% TEE Focus Questions +10.5% +6.2% +2.8% Overall Averages +8.5% -2.1% +0.0%

Section X Hour (Control) Y Hour (Control) Z Hour (Experiment)

Table 7: P hase 3 Results

Page 5

The results for Block 3 are tabulated in Table 7. None of the metrics for the experimental group (Z hour) support the hypothesis for this block. At this point, I came to the conclusion that the results were inconclusive for the experiment as it was designed. However, upon closer examination of my qualitative data, I discovered that the students attitudes towards the experiment may play a factor in the performance of the individuals and hence the overall section averages. For example, in the entry survey 92% of students in the low section believed trying a new innovative way of studying could improve their grades compared to 71% and 76% for Y and Z hours, respectively. In light of this, they could have been more open to taking the exercise seriously in an attempt to improve their grades. Additionally, there was greater room for improvement in the lower sections than there was for the higher section as they were already performing at the A level. The graph below shows the results taken from the exit survey when students were asked if they felt their performance on tests was higher during the block in which they designed the test questions. The graph in Figure 1, along with other responses from the surveys, suggests that individual attitudes towards the experiment may have played a factor in the results. For example, X and Y Hours, in general, felt they performed better in their respective experimental block. Looking back at Table 5 and 6, respectively, we see that indeed these hours outperformed the control hours for these blocks. Likewise, Z hour felt mostly neutral and they actually experienced negative improvement and ended up between the two control hours (see Table 7). Is it Figure 1: Percent by section that felt their coincidence that X and Y hours both had lower performance was higher during experimental blocks mathematical abilities while Z had the highest ability? One explanation is the mindset of an A student and how they study could be entirely different than that of say a B or D student. Finally, during the exit survey nearly 80% of the free response questions had nothing but positive comments about the student generated test questions CAT and actually wanted to see more of it. Typical comments included: I believe it helped me do better on quizzes because I had to understand the concepts to accurately create a quiz. For the 20% that felt it was not worthwhile, it is possible their negative outlook on the technique biased their performance and results. Since the exit survey was conducted anonymously through our grade management software, I was unable to make correlations between attitude and individual performance during the experiment. One other consensus was that most students would like to see this type of assignment in the future, although they admitted they would most likely not study this way if not required to. In 1987 Reeves-Kazelskis and Kazelskis also set out to measure the effectiveness of this particular CAT. They concluded that the technique was effective only for students with higher levels of prior knowledge related to [the subject] upon entry into the course, and that it was detrimental to students who had lower levels of prior knowledge (Reeves-Kazelskis & Kazelskis, 1987). This results are almost opposite of what I came across as the lower sections improved better across the board where as the higher sections performed lower in short run and showed no improvement in long run. Page 6

This could also be contributed to external factors such as possibly the material as Reeves-Kazelskis and Kazelskis conducted there experiment in foreign language domain.

Recommendations & Conclusions I believe there were two main flaws with my experiment. First, the sample size was not large enough. Secondly, there were other factors and variables involved that were not controlled and may have biased the results. Furthermore, I believe attitude played a large role in the individual results, but because I did not see this from the beginning, I had not designed the experiment to capture these attitudes. Further testing is required on a larger sample size with more controls and a refined hypothesis to truly measure the effectiveness of this CAT. After careful analysis of the data and survey responses, I find that the results of the experiment are inconclusive. My findings do not fully support my original hypothesis: Students that study by developing their own quiz questions will fare better when tested on the relevant lesson objectives by forming a deeper understanding of the more important concepts. Although, there are some positive indicators found in the results, especially with the lower ability sections, and positive feedback from most of the students, the experiment did not categorically support the thesis. Furthermore, my results are at odds with the only other available research that I came across that attempts to measure the effectiveness of this CAT. In the end, however, the majority of the students appreciated the introduction to a new way of studying and in general felt it contributed to improved performance and deeper understanding. My new conclusion is that the student generated test question classroom assessment technique is a good tool to use in the classroom, but, like most things in life you get out of it what you put in. The student has to buy into the exercise and take the assignment seriously if they expect the technique to help them learn and understand the material better.

Bibliography Angelo, T.A. & Cross, P.K. Classroom Assessment Techniques (2nd ed.). San Francisco: Jossey-Bass, 1993. National Teaching & Learning Forum (NTLF). Classroom Assessment Techniques. Accessed 4/4/2008. Reeves-Kazelskis, C. and Kazelskis, R. The Effects of Student- Generated Questions on Test Performance. Paper presented at the Annual Meeting of Mid-South Educational Research Association. Mobile, AL, 1987. Silva, F. Student-generated test questions. IN: Indiana University: Teaching Resources Center, 1995.

Page 7