You are on page 1of 17

The International Journal of Educational and Psychological Assessment January 2012, Vol.

9(2)

104

Assessing Higher Education Teachers through Peer Assistance and Review Carlo Magno

De La Salle University, Manila


Abstract
The present study advances the practice of assessing teacher performance by constructing a rubric that is systematically anchored on an amalgamated professional practice and learnercentered framework (see Magno & Sembrano, 2009). The validity and reliability of the rubric was determined using both classical test theory and item response theory, and implications for a new way of looking at the function of teacher performance assessment results for higher education institutions. The rubric used by fellow teachers is called the Peer Assistance and Review Form (PARF). The items reflect learner -centered practices with four domains anchored on Danielsons Components of Professional Practice principles: Planning and preparation, classroom environment, instruction, and professional responsibility. The rubric was pilot tested with 183 higher education faculty. The participants were observed by two raters in their class. Concordance of the two raters was established across the four domains (=.47, p<.01). High internal consistency among the items was obtained using Cronbachs alpha. The four-factor structure of the domains was established in a measurement model using Confirmatory Factor Analysis. The Polytomous Rasch Model (Rating Scale Analysis) showed appropriateness of the step calibration of the four point scale and the fit of the items in the rubric. Keywords: Peer assistance and Review, teacher assessment, Danielsons components of professional practice, learner-centered

Introduction In order to have a valid assessment of teachers performance, different stakeholders should contribute in the assessment process. External raters, including students, are often used to assess teachers performance (Allison-Jones & Hirt, 2004; Centra, 1998; Heckert, Latier, Ringwald, & Silvey, 2006; Howard, Helms, & Lawrence, 1997; Tang, 1997; Marsh & Bailey, 1993; Pike, 1998; Scriven, 1994; Stringer & Irwing, 1998; Young & Shaw, 1999), peer ratings (Goldstein, 2004; Graves, Sulewski, Dye, Deveans, Agras, & Person, 2009; Kell & Annetts, 2009; Magno & Sembrano, 2009; Reid 2008), and accreditation bodies (Gosling, 2002; Ross, Hogaboam-Gray, McDougall, & Bruce, 2002). Another aspect of assessing teachers performance is through self-assessment that includes teaching portfolios (Graves, Sulewski, Dye, Deveans, Agras, & Person, 2009; Stolle, Goerss, & Watkins 2005; Wray 2008), self-evaluation checklists (Bruce & Ross, 2008). Selfreflection on ones teaching is another popular method (Bruce & Ross 2008; Graves, Sulewski, Dye, Deveans, Agras, & Person, 2009). The students and teachers in higher education are considered as direct stakeholders in the account of teachers performance because they are the primary witnesses and recipients of the teaching and learning process. While the use of students as raters of their faculty in higher education is relatively common, there is less support for the use of peers as raters. Peer assistance and review or peer review, or teacher peer coaching occurs when a teacher is observed by another teacher for a specific purpose (Goldstein,
2012 Time Taylor Academic Journals ISSN 2094-0734

The International Journal of Educational and Psychological Assessment January 2012, Vol. 9(2)

105

2004; Kerchner & Koppich 1993; Bruce & Ross, 2008). Peer evaluations in teaching are described as involving teachers in the summative [also formative] evaluation of other teachers (Goldstein, 2004, p. 397). It was further described by Graves, Sulewski, Dye, Deveans, Agras, and Person (2009, p. 186) as evaluating ones peers allow the assessment of ones teaching by another person who has similar experience and goals. A more explicit definition was provided by Bruce and Ross (2008, p. 350) about peer evaluation, they described it as: a structured approach for building a community in which pairs of teachers of similar experience and competence observe each other teach, establish improvement goals, develop strategies to implement goals, observe one another during the revised teaching, and provide specific feedback. The purposes of rating teachers, such as hiring, clinical supervision, and modeling, are best facilitated using peer evaluations. Teachers performance from peer reviews should be conceptualized with the aim of helping teachers to improve their teaching rather than solely pointing out their mistakes (Oakland & Hambleton, 2006; Stiggins, 2008). It is described as a constructive process where the peer aims to provide assistance to a less experienced teacher in improving their instruction with a focus on student-teacher interaction. Blackmore (2005) reiterated the constructive idea of peer review where the aim of assessing teachers should bring about changes and improvement in the practice of teaching. Goldstein (2003) indicates that there is a need for extensive research in the area of peer assessment of teacher performance especially with regard to implementation issues. The present study constructed an instrument that serves the purpose of peer assistance and review for higher education faculty. This instrument will be carried out by faculty peers that serve to provide feedback for the faculty in higher education. Teachers View of Peer Review Peer review of teachers performance is defined and described with several intentions but the teachers who are constantly observed create their own views. These views are described in studies as thoughts and perceptions created by teachers as part of the process. Views were also quantitatively assessed using attitude scales reflecting certain components such as general attitudes and domain specific attitude (Wen, Tsai, & Chang 2006). The teachers view about their fellow teachers assessment was shown in the study of Kell and Annetts (2009) where they invited teaching staffs to verbalize their perceptions about the Peer Review of Teaching (PRT) and clarify issues. The teaching staffs were asked to provide their personal reflections about the PRT. They found that giving the teaching staff ownership of the PRT makes them autonomous and develop flexibility in the process. In terms of rationale and purpose of the PRT, the staff saw it to be formative and useful for personal and professional development, while the newer staff viewed it as summative and auditlike. The ethics behind PRT included comments like lack of time and the review being potentially biased that they do not like to participate. The affective issues
2012 Time Taylor Academic Journals ISSN 2094-0734

The International Journal of Educational and Psychological Assessment January 2012, Vol. 9(2)

106

were complaints about pulling of ranks and undercurrents of power gains. On the other hand, the study by Atwood, Taylor, and Hutchings (2000) on the peer review teaching program for science professors was able to identify the barriers for the peer review practice. The barriers include: (1) fear, (2) uncertainty about what should be reviewed, and (3) how the process is reviewed. A more positive approach to studying peer reviews was conducted by Carter (2008). He presented useful ways for peer reviewers to enrich the peer review process. The tips are meant to make the review as pleasant as possible: (1) understand alternative views of teaching and learning; (2) prepare seriously for previsit interview; (3) listen carefully to what the students says in the pre-visit interview; (4) watch the students not just the teacher. The views by Carter (2008) provide alternative ways of implementing the peer review process that focus more on the constructive aspect. Milanowsi (2006) explained that peer review can become more constructive when peers discuss performance problems and suggestions (without the responsibility for making an administrative evaluation, evaluators will be able to provide more assistance toward improving performance). It is constructive when the function of the review is split into administrative and developmental. The developmental evaluation and feedback is provided by a peer mentor, while administrative evaluation by managers and peer evaluators, or a combined role group, in which developmental evaluation, feedback, and administrative evaluation were provided by a peer. The views of the teachers in the study about the peer review showed that ratees in the split role group were slightly less open to discussions of problems or weaknesses than those in the combined role group. The results of the interview showed that a larger proportion of those in the split role group reported being comfortable discussing their problems or weaknesses than those in the combined group. However, the difference is small. The study by Keig (2000) determined the perceptions of the faculty on several factors that might detract from and/or enhance their likelihood to take part in formative peer review of teaching. They also determined the perception of faculty how peer assessment might benefit the faculty, colleagues, students, and the institution. They found that the more the faculty is willing to participate in the peer review, the less likely they would become a detractor to the faculty. This indicates that the faculty who engages in peer reviews has good intentions for their fellow faculty. Effects of Peer Reviews of Teaching Different studies have shown that when peer reviews are intended for a positive and constructive approach, it can be beneficial for its intended outcomes (Bruce & Ross, 2009; Reid, 2008; Bernstein & Bass, 2000; Blackmore, 2005; Yon, Burnap, & Kohut, 2002; Kumrow & Dahlem, 2002). For example, an anonymous writer (2006) reported that when the peer assistance and review was implemented statewide in Canada, it reinforced the value for teaching as a highly skilled vocation, it helped teachers become more reflective on their teaching, and increased student learning as reflected through the increased SAT scores. Bruce and Ross (2008) found that peer reviews increased teachers efficacy. Moreover, Reid (2008) found
2012 Time Taylor Academic Journals ISSN 2094-0734

The International Journal of Educational and Psychological Assessment January 2012, Vol. 9(2)

107

that teachers and peers saw opportunities for developing relationships. The implementation by Kumrow and Dahlem (2002) reported that the number and quality of classroom observations exponentially increased. The Present Study The empirical studies about peer assistance and review are still not as rich as those about teachers performance based on students perspectives. The majority of the literature about peer assistance is comprised of articles or just reviews explaining the process and ways on how it will be implemented. The few studies completed report improvement in practice (Bruce & Ross, 2008; Kumrow & Dahlem, 2002), highlight teaching practices (Bernstein & Bass, 2000), development of framework for teaching (Blackmore, 2005), and autonomy of the teacher (Yon, Burnap, & Kohut 2002). These benefits necessitate the proper implementation of the peer assistance and review in a higher education setting. The present study constructed a rubric called Peer Assistance Review Form (PARF) that is applicable in the Philippine higher education institutions which also purports to yield the same benefits mentioned in the reviews. The rubrics validity and reliability were established using concordance among raters, convergence, item fit through the Rasch Partial credit model, and Confirmatory Factor Analysis. The items in the rubric are anchored on the Learner-centered principles and Danielsons Components of Professional Practice. The learner-centered principles are perspectives that allow the teachers ability to facilitate the learners in their learning, the learning in the programs, and other processes that involve the learner (Magno & Sembrano, 2007; McCombs, 1997). On the other hand, Danielsons Components of Professional Practice identified aspects of the teachers responsibilities that have been documented through empirical studies and theoretical research promoting improved student learning (Danielson, 1996). The framework is divided into 22 components clustered into four domains of teaching (planning and preparation, classroom environment, instruction, and professional responsibility). The theoretical combination of the learner-centered and components of professional practice in a framework was discussed in the study by Magno and Sembrano (2009, p. 168). The amalgamation was further described as a combination of aspects of the teaching and learning process. More so, this amalgamation is representative in the assessment of the teaching and learning process in higher education. Method Participants The participants in the study were 183 randomly selected teachers in a higher education institution in Manila, Philippines. These teachers have finished their masters and doctorate degree, and some are still in progress. These teachers are teaching in five different major areas: Multidisciplinary studies, management and information technology, hotel, restaurant, and institutional management, design and
2012 Time Taylor Academic Journals ISSN 2094-0734

The International Journal of Educational and Psychological Assessment January 2012, Vol. 9(2)

108

arts, and deaf studies. A proportion of faculty was randomly selected for each school that served as ratee. Instrument The criteria used in the Peer Assistance and Review (PARF) were based on the four domains and the underlying components of Danielsons Components of Professional Practice. The descriptions for each criterion and four gradations of responses are also framed within the learner-centered principles. The gradations of the responses for each criterion were established based on the descriptions of each domain and components. The descriptions were confirmed and reviewed by higher education teachers and administrators through a focus group discussion (FGD) method. The faculties invited as reviewers arrived at a consensus on the rating categories according to its suitability of ideal teaching and facilitation of learning for higher education. The FGD guide was facilitated by allowing the participants to determine if the provided descriptors in the rubric are applicable for them, relevant in their teaching, phrased accordingly, produce consistent meanings for different users, and will have a wide variety of uses. The revised rubric was distributed to all teaching faculty. They judged whether the items are relevant for their teaching. A copy of the revised version of the PARF was given to experts in the field of performance assessment specifically for teachers performance. The reviewers were given one week to accomplish the task. The definitions of the components and purpose of the PARF were also provided so that the reviewers were guided if the criteria are relevant. After receiving the forms with comments, the instrument was revised once again. The instrument that was pretested was composed of 88 items under each of the four domains: Planning and preparation (25 items), classroom environment (21 items), instruction (22 items), and professional responsibility (20 items). Each item is rated using an analytical rubric using a four point scale (4=exemplary, successful=3, limited=2, poor=1). Procedure Before the actual observation of the raters commenced, the selected faculty who served as raters were oriented on the process of conducting the peer assistance and review and how to accomplish the forms. The orientation was meant to train the faculty about the purpose, importance, and specific processes involved in the peer assistance and review. The orientation was conducted before the start of the term. After the training, each ratee was informed about their schedule as to when they would be observed and rated. Each ratee was provided with a copy of the PARF in advance to prepare them for in the actual observation. The faculty members serving as ratees were informed that the purpose of the observation was simply to test the instrument; it would have no impact on administrative evaluation or salary. The observation took place within the class periods within the whole term. The raters visited and communicated with the ratee several times to complete
2012 Time Taylor Academic Journals ISSN 2094-0734

The International Journal of Educational and Psychological Assessment January 2012, Vol. 9(2)

109

evidence for the scale. These visits and meetings were conducted outside of the classroom. The ratee was requested to provide a syllabus and other pertinent documents during the period of observation for the raters reference. A detailed implementation guide for the observation was provided to the ratees and raters. The ratee, during the period of observation was requested to refrain from giving exams, writing activities, group works, reporting, etc. that would consume the entire period. This was to ensure that there would be some teaching samples to be observed and rated. In the observation period, there were two raters for each faculty: The primary and secondary rater. This procedure was done to establish the concordance of the ratings. If there was no common time among the three raters, the observation could take place in different periods. Each rater observed the same teacher in the same class. The data from the pretest were encoded and analyzed for reliability and validity. Acceptable items were determined using the Polytomous Rasch Model (Rating Scale Analysis) by assessing item fit (Andrich, 1978). The approach is a probabilistic measurement model for sequential integer scores such as a Lickert scale. The WINSTEPS software was used to generate results of the Polytomous Rasch Model. The PARF criteria with inadequate fit were revised. Results The data with N=183 teachers were used to analyze the reliability and validity of the PARF. Each ratee was rated by a primary and secondary rater. Missing values in the data were treated using mean replacement and the descriptive statistics that includes means, standard deviations, skewness, and kurtosis were obtained. The reliability was also obtained using Cronbachs alpha. Convergent validity of the rating scale was established by correlating the factor scores for each rater and between the two raters. The Polytomous Rasch Model was used to investigate the step calibration of the scale and fit of the items. The factor structure of the theoretical model was tested using Confirmatory Factor Analysis (CFA). Parceled solutions resulted in less bias in estimates of structural parameters under normal distributions than did solutions based on the individual items (Bandalos, 2002). The means (M=3.40, M=3.41) for the rating given by the primary and secondary raters are high given the highest possible score of 4.0. The means provided by both raters are almost the same indicating that the ratings were very consistent. The distribution of scores tends to be negatively skewed with peak modes. This is consistent with the high values of the means where majority of the ratings were between 3 and 4, and very few gave a rating of 1.00. The overall internal consistency of the scores, using Cronbachs alpha, for both primary and secondary raters is .98 which indicates high reliability. For primary raters alone, the internal consistency is .98 and for second raters alone, the internal consistency is .97 which also indicates high reliability. When the ratings of the primary raters and secondary raters were tested for concordance, the results of the correlation was significant (=.47, p<.001). This shows that consistent ratings were obtained by the primary and secondary raters in a
2012 Time Taylor Academic Journals ISSN 2094-0734

The International Journal of Educational and Psychological Assessment January 2012, Vol. 9(2)

110

moderate level. This significant concordance also indicates the reliability of the scale across two external raters. This implies similarity in the understanding of the items and observations for the same teacher being rated. The means, standard deviations, and internal consistencies were broken down by the domains in the instrument and the results were still consistent across the primary and secondary raters. The mean rating were still high (M=3.45 to M=3.38). This shows that even across domains the ratings between the primary and secondary raters for one teacher was consistent. In the same way, the Cronbachs alpha for each domain had very high internal consistencies. The convergence of the domains was tested across the same rater and across the primary and secondary rater. Table 1

Convergence of the Domains for Primary and Secondary Raters


Secondary Rater Primary Rater 1. 2. 3. 4. Planning and preparation Classroom Environment Instruction Professional responsibility

1
--.85**b .88**b .76**b 3.45 0.32 .93

2
.76**a --.87**b .73**b 3.38 0.33 .92

3
.83**a .82**a --.79**b 3.39 0.32 .92

4
.64**a .66**a .67**a --3.39 0.33 .91

M
3.45 3.38 3.38 3.40

SD
0.36 0.37 0.37 0.38

Cronbachs Alpha .94 .93 .94 .93

M SD
Cronbachs alpha

Note. a values represent correlations among the secondary raters, b values are
correlations for the primary raters **p<.01 The four domains of the scale attained convergence by having significant and strong correlations as rated by the primary rater (r values range from .73 to .88). The strongest correlation occurs between instruction and the first two domains (planning and preparation and classroom environment). Professional responsibility did not reach a very strong correlation with the other three domains as compared with the strength of correlation produced within planning and preparation, classroom environment, and instruction. When the four domains were tested for convergence as rated by the secondary raters, the same pattern was observed as that of the primary raters. Significant and strong correlations were attained for the four domains (r values range from .67 to .82). There is also a stronger correlation among the first three domains (planning and preparation, classroom environment, and instruction) as compared with the correlation of professional responsibility with the rest of the domains. The consistent findings in the patterns of convergence between the primary and secondary raters are reflective of their consistent ratings for the same teachers. These results are consistent with the same levels of mean ratings.

2012 Time Taylor Academic Journals ISSN 2094-0734

The International Journal of Educational and Psychological Assessment January 2012, Vol. 9(2)

111

Table 2

Convergence of the Domains between Primary and Secondary Raters


Planning and Preparation .41** .38** .37** .39** Secondary Rater Classroom Environment Instruction .41** .45** .39** .35** .45** .44** .46** .43** Professional Responsibility .33** .29** .30** .41**

Primary Rater Planning and Preparation Classroom Environment Instruction Professional Responsibility

**p<.01 The convergence of the domains was also tested between the primary and secondary raters. All correlations were significant at a lower margin of error (=.01), but the strength of the relationships were moderate. The overall concordance between the primary and secondary raters is =.47 which is also consistent with the correlations across the four domains. This indicates that it is more difficult to attain stronger correlations for external (across raters) than internal validity (with the same rater). To investigate the functioning of the items in the scale, the Polytomous Rasch Model was used. The scale categories (4-point scale) were first analyzed in the process to determine the threshold. Higher scale categories must reflect higher measures and low values for lower scales, thereby producing a monotonic increase in threshold values. The average step calibrations for the primary rater are, -5.11, 2.73, .63, and 3.84, for the secondary rater, -4.59, -2.31, 1.44, and 5.17. All average step functions are increasing monotonically indicating that a 4-point scale for each factor attained scale ordering where there is a high probability of observance of certain scale categories.

2012 Time Taylor Academic Journals ISSN 2094-0734

The International Journal of Educational and Psychological Assessment January 2012, Vol. 9(2)

112

Item fit mean square (MNSQ) using WINSTEPS was computed to determine if the items under each domain have a unidimensional structure. MNSQ
2012 Time Taylor Academic Journals ISSN 2094-0734

The International Journal of Educational and Psychological Assessment January 2012, Vol. 9(2)

113

INFIT values within 1.2 and 0.8 are acceptable. High values of item MNSQ indicate a lack of construct homogeneity with other items in a scale, whereas low values indicate redundancy with other items (Linacre & Wright, 1998). Two Rasch analyses were conducted separately for each rating provided by the primary and secondary raters. For the primary rater, four items lacked construct homogeneity which means that they are not measuring the same construct as the other factors. These items are about service to the school, participation in college wide-activities, enhancement of content-knowledge pedagogy, and service to the profession respectively. On the other hand, six items are redundant with other items. These items are about instructional materials, lesson and unit structure, quality of feedback, lesson adjustment, and student progress in learning respectively. For the secondary rater, eight items lacked construct homogeneity. These items are about student interaction, importance of content, student pride in work, quality of questions, engagement of families and student services, service to the school, participation in college-wide activities, and service to the profession. On the other hand, three items were redundant with other items. These items are about quality of feedback, timeliness of feedback, and lesson adjustment. A Confirmatory Factor Analysis (CFA) was conducted to examine the factor structure of the Danielsons Components of Professional Practice as a four-factor scale. The first model tested a four-factor structure with the indicators or manifest variables used were the actual items (the ratings of the primary and secondary raters for each item was averaged). There were 25 items for planning and preparation, 21 items for the classroom environment, 22 items for instruction, and 20 items for professional responsibility. The result of the measurement model showed that the four factors are significantly related and all 88 indicators had significant paths to their respective factors. However the data did not fit the specific model, 2=8829.23, df=3734, PGI=.57, Bentler-Bonnett Normed Fit Index=.46, Bentler-Bonnett NonNormed Fit Index=.56. A second measurement model was constructed retaining the four factors with few constraints. The constraints were reduced by having less parameter estimates in the model. This was done by creating three parcels as indicators to each factor. The parcels were created by combining item scores for both primary and secondary rater. Given few indicators per factor, the df in the second analysis was reduced to 132 that yielded a larger statistical power and model fit. The results in the second analysis showed that all four factors are significantly correlated and each parcel is also significant. The fit of the model improved as compared with a measurement model with more constraints, 2=262.47, df=132, PGI=.86, Bentler-Bonnett Normed Fit Index=.89, Bentler-Bonnett Non-Normed Fit Index=.87. The results of the CFA showed that the four factors of Danielsons Components of Professional Practice is adequate and can be used.

2012 Time Taylor Academic Journals ISSN 2094-0734

The International Journal of Educational and Psychological Assessment January 2012, Vol. 9(2)

114

Discussion A rating scale anchored on Danielsons Components of Professional Practice and learner-centered principles was constructed to rate teachers performance. The analysis involved statistics to determine the internal consistencies, convergence, item fit, and factor structure of the scale. These analyses somehow had favorable results regarding the validity and reliability of the scale. For the scales internal consistency, the obtained Cronbachs alpha was high, given the ratings provided by the primary and secondary raters for the whole scale and for each factor. Internal consistency of the items was achieved in a similar fashion for the two raters. The items indicate that the scale is measuring the same
2012 Time Taylor Academic Journals ISSN 2094-0734

The International Journal of Educational and Psychological Assessment January 2012, Vol. 9(2)

115

overall construct. When the internal consistency of the items were computed for each domain, high Cronbachs alpha were also obtained. Even if the items were reduced, as in the case of each factor, high internal consistency was still achieved. When the primary and secondary raters were tested if they concord on the same observation, a significant coefficient was obtained (=.47). There is consistency of ratings across two separate raters. This consistency reflects a common understanding of the items meaning and observation of the teacher being observed. This is a good indicator for future use of the test considering that the actual implementation involves two or even multiple raters. These two raters need to concord with their ratings of the same teacher to achieve a more consistent result of the teachers performance. This concordance is facilitated by the items where each rater had a common understanding and frame of assessment for the teacher being observed. When the concordance analysis was conducted for each domain, significant relationships occurred for the four factors across the two raters. The two raters do not only have consistent understanding and reference of observation for the whole scale, the same consistency is carried for each factor. The scale also showed convergence across the domains for each rater. The results show significant correlations of all the four factors in the case of the primary and secondary raters. The same pattern of correlation was also achieved for the primary and secondary raters. The pattern of correlations showed that domains planning and preparation, the classroom environment, and instruction were highly correlated. Even if all four factors were significantly correlated, the correlation coefficients for professional responsibility with the other factors are not as high as compared with the coefficients for the first three. The same pattern of correlations is true for both the primary and secondary raters. This shows that professional responsibility is not seen as highly linked to teaching as compared to the first three domains (planning and preparation, classroom environment, and instruction). The raters and teachers do not seem to consider much the professional responsibility to be integrated strongly with classroom performance or its translation into the actual teaching process as compared to the kind of integration in the first three domains. The item analysis using the Polytomous Rasch Model showed that the items on student interaction, importance of content, student pride in work, quality of questions, engagement of families and student services, service to the school, participation in college wide-activities, enhancement of content-knowledge pedagogy, and service to the profession are out of bounds as compared to other items. These items did not seem applicable for majority of the teachers. There was agreement between the primary and secondary raters on this misfit especially on three items (participation in college wide-activities, enhancement of contentknowledge pedagogy, service to the profession). This was consistent in the convergence of the domains. Given these three items, the raters and teachers has a tendency to view a weak integration of participating in college activities, attending seminars, and publication as part of their teaching performance or their role to improve ones teaching (items of professional responsibility). The item analysis using the Polytomous Rasch Model also showed that the items on instructional materials, lesson and unit structure, quality of feedback, timeliness of feedback, lesson adjustment, student progress in learning respectively
2012 Time Taylor Academic Journals ISSN 2094-0734

The International Journal of Educational and Psychological Assessment January 2012, Vol. 9(2)

116

are redundant with the other items. There was agreement between primary and secondary raters on quality of feedback and lesson adjustment. These items were rated more likely in the same way as the other items. These items were carefully reviewed again by the faculty and agreed to remove them from the pool of items. The adequacy of the model composed of a four factor structure was proven in the study. This shows that the four factors (planning and preparation, classroom environment, instruction, and professional responsibility) can be used as essential components in assessing teacher performace in the higher education. This shows that the scale measures effectively four distinct domains. Previous studies using Danielsons components of professional practice were widely applied for teachers teaching in the elementary and high school. However, the present study showed the appropriateness of the domains even for higher education institutions. The results of the present study points to three perspectives on assessing teacher performance: (1) The need to inculcate professional responsibility such as research and continuing education programs for higher education faculty, (2) the advantage of the instrument having multiple raters, and (3) expectations that needs to be set for higher education institutions in the Philippines. Professional responsibility is an important part of higher education faculties work requirements. However, the study found that service to the profession such as research and publications, participation to school activities, and enhancement of pedagogy were less integrated with instruction among teachers in higher education. This scenario is typical in most higher education institutions where the teachers work is concentrated on teaching, whereas the professional responsibility is underrated. Once a teacher is hired in a higher education institution, the teacher is defined on how much teaching load is given and much expectation is placed on teaching. The entire semester of the teacher is devoted on teaching and no time for professional responsibility is provide such as engagement in research, looking for publication opportunities, and attending contributing professional development. As compared to other countries, universities and colleges balance both teaching and research (Calma, 2009; Magno, 2009a). Colleges and universities in the Philippines have limited opportunities and resources given for a faculty to conduct research and establish their own research laboratories and facilities. This is reflective of the very few universities entering and very low status of universities in the world university rankings by the Times Higher Education (Magno, 2009a). For other professional and pedagogical enhancements, the selection is very limited and the funds provided are very minimal for a faculty to attend conferences within and outside of the country. The same scenario is true for teachers in the grade school and high school. Much of the rewards are for teaching and not on certain professional responsibility such as research, publications, and involvement in professional organizations. The strength of the Peer Assistance Review Form developed in the study rests on the consistency obtained through multiple raters and scale calibration procedure. The raters were consistent in their interpretations, ratings, and calibration of the scales. The calibration of the scale from lowest to highest in terms of its degree is one aspect of scale fidelity that most test developers neglect to report (Magno, 2009b). This procedure can accurately be estimated using a Polytomous Rasch Model. A new perspective for rating scales is not only to establish its internal
2012 Time Taylor Academic Journals ISSN 2094-0734

The International Journal of Educational and Psychological Assessment January 2012, Vol. 9(2)

117

consistencies and factorial structure, it is also important to determine and report its scale calibration. The category structure allows scale developers to decide on the appropriateness of the scale length and the type of scale used. Another advantage that led to the results is the refined description of the scale framed in an analytical rubric format (Reddy & Andrade, 2010). The raters can easily and elaborately distinguish among skills presented in each global criterion. This ensures the appropriate gradation of the scale. Lastly is the need to look further at the standards and competencies for higher education teachers. This issue is addressed in the study by testing specific competencies required of higher education teachers. These standards of competencies need to be set to ensure that students are benefitting through instruction (Berdrow & Evers, 2009). Colleges and universities need to adhere to teaching and learning frameworks that will serve and carry out their mission and vision well. Very few universities in the Philippines adhere to specific teaching and learning thrusts which led to poor educational standards (Magno, 2009a). In the Philippine setting, the competencies of teachers in the basic education are specified. However, this move is also not impossible because of the rich tradition of literature for higher education. The present study attempted to frame these competencies using an amalgamated framework of the learner-centered practices and Danielsons components of professional practice (see Magno & Sembrano, 2009). This study pioneers the setting of specific teaching and learning frameworks for faculty in the Philippines. The move on assessing teacher performance rigorously needs to be advocated in Philippine higher education institutions to ensure accountability of graduates and quality of faculty. Assessing teacher performance also needs to take a developmental process where results should be used to help teachers reach specified expectations (Bruce & Ross, 2009; Reid, 2008; Bernstein & Bass, 2000; Blackmore, 2005; Yon, Burnap, & Kohut, 2002; Kumrow & Dahlem, 2002). This move is carried out by having good instrument to facilitate these benefits. The use of assessment instruments for rating teachers should coincide with practices that will also help teachers improve their teaching. Having established a valid and reliable scale for teachers performance means that proper and appropriate assessment tool can be used. Rigorous assessment of teacher performance is known to occur in the basic education (grade school and high school teacher) in the Philippine setting. There is very limited advocacy in maintaining the move for teacher performance assessment and measures in the Philippines higher education institutions because of the complexity of its structure (involvement in research and professional development). However, the present study pushed these frontiers first by providing an instrument evidenced to be appropriate and implemented the possibility of proper assessment practices among higher education faculty.

2012 Time Taylor Academic Journals ISSN 2094-0734

The International Journal of Educational and Psychological Assessment January 2012, Vol. 9(2)

118

References Allison-Jones, L. L., & Hirt, J. B. (2004). Comparing the teaching effectiveness of part-time and full-time clinical nurse faculty. Nursing Education Perspectives, 25, 238-242. Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-73. Anonymous. (2006). Standards-based teacher evaluations. Gifted Child Today, 29, 8-9. Atwood, C. H., Taylor, J. W., & Hutchings, P. A. 2000. Why are chemists and other scientists afraid of the peer review of teaching? Journal of Chemical Education, 77, 239-244. Bandalos, D. L. (2002). The effects of item parceling on goodness-of-fit and parameter estimate bias in structural equation modeling. Structural Equation Modeling, 9, 78-102. Berdrow, I., & Evers, F. T. (2009). Bases of competence: an instrument for self and institutional assessment. Assessment and Evaluation in Higher Education, 35, 419-434. Bernstein, D., & Bass, R. 2005. The scholarship of teaching and learning. Academe, 91, 37-44. Blackmore, J. A. (2005). A critical evaluation of peer review via teaching observation within higher education. The International Journal of Educational Management, 19, 215-320. Bruce, C. D., & Ross, J. A. (2008). A model for increasing reform implementation and teacher efficacy: Teacher peer coaching in grades 3 and 6 mathematics. Canadian Journal of Education, 31, 346-370. Calma, A. (2010). The context of research training in the Philippines: Some key areas and their implications. The Asia-Pacific Education Researcher, 18, 167-184. Carter, V. K. (2008). Five steps to become a better peer reviewer. College Teaching, 56, 85-90. Centra, J. A. (1998). The development of the student instructional report II. Princeton, New Jersey: Educational Testing Service. Danielson, C. (1996). Enhancing professional practice: A framework for teaching. Alexandria, VA: Association for Supervision and Curriculum Development. Goldstein, J. (2003). Making sense of distributed leadership: The case of peer assistance and review. Educational Evaluation and Policy Analysis, 25, 397421. Goldstein, J. (2004). Making sense of distributed leadership: The case of peer assistance and review. Educational Evaluation and Policy Analysis, 26, 173197. Gosling, D. (2002). Models of peer observation of teaching. LTSN Generic Centre. Graves, G., Sulewski, C. A., Dye, H. A., Deveans, T. M., Agras, N. M., & Pearson, J. M. (2009). How are you doing? Assessing effectiveness in teaching mathematics. Primus, 19, 174-193.

2012 Time Taylor Academic Journals ISSN 2094-0734

The International Journal of Educational and Psychological Assessment January 2012, Vol. 9(2)

119

Heckert, T. M., Latier, A.., Ringwald, A., & Silvey, B. (2006). Relation of course, instructor, and student characteristics to dimensions. College Student Journal, 40, 1-11. Howard, F. J., Helms, M. M., & Lawrence, E. P. (1997). Development and assessment of effective teaching: an integrative model for implementation in schools of business administration. Quality Assurance in Education, 5, 159161. Keig, L. (2000). Formative peer review of teaching: Attitudes of faculty at liberal arts colleges toward colleague assessment. Journal of Personnel Evaluation in Education, 14, 67-87. Kell, C., & Annetts, S. (2009). Peer review of teaching embedded practice or policyholding complacency? Innovations in Education and Teaching International, 46, 61-70. Kerchner, C. T., & Koppich, J. E. (1993). A union of professionals: Labor relations and education re-form. New York: Teachers College Press. Kumrow, D., & Dahlem, B. (2002). Is peer review an effective approach for evaluating teachers? The Clearing House, 75, 236-240. Linacre, J. M., & Wright, B. D. (1998). A user's guide to Winsteps, Bigsteps, and Ministeps: Rasch-model computer programs. Chicago: MESA Press. Louis, K. S., & Marks, H. M. (1998). Does professional community affect the classroom? Teachers work and student experience in restructuring schools. American Journal of Education, 106, 532-575. Magno, C. (2009a). A metaevaluation study on the assessment of teacher performance in an assessment center in the Philippines. The International Journal of Educational and Psychological Assessment, 3, 75-93. Magno, C. (2009b). Demonstrating the difference between classical test theory and item response theory using derived test data. The International Journal of Educational and Psychological Assessment, 1, 1-11. Magno, C., & Sembrano, J. (2007). The Role of teacher efficacy and characteristics on teaching effectiveness, performance, and use of learner-centered practices. The Asia-Pacific Education Researcher, 16, 73-91. Magno, C., & Sembrano, J. (2010). Integrating learner-centeredness and teaching performance in a theoretical model. International Journal of Teaching and Learning in Higher Education, 21(2), 158-170. Marsh, H. W., & Bailey, M. (1993). Multidimensional students' evaluations of teaching effectiveness. The Journal of Higher Education, 64, 1-18. McCombs, B. L. (1997). Self-assessment and reflection: Tools for promoting teacher changes toward learner-centered practices. NASSP Bulletin, 81, 114. McLymont, E. F., & da Costa, J. L. (1998, April). Cognitive coaching the vehicle for professional development and teacher collaboration. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA. Oakland, T., & Humbleton, R. K. (2006). International perspectives on academic assessment. New York: Springer.

2012 Time Taylor Academic Journals ISSN 2094-0734

The International Journal of Educational and Psychological Assessment January 2012, Vol. 9(2)

120

Pike, C. K. (1998). A validation study of an instrument designed to measure teaching effectiveness. Journal of Social Work Education, 34, 261-272. Reddy, Y. M., & Andrade, H. (2010). A review of rubric use in higher education. Assessment and Evaluation in Higher Education, 35, 435-448. Reid, E. S. (2008). Mentoring peer mentors: Mentor education and support in the composition program. Composition Studies, 36, 1-31. Ross, J. A., McDougall, D., & Hogaboam-Gray, A. (2002). Research on reform in mathematics education, 1993-2000. Alberta Journal of Educational Research, 48, 122-138. Scriven, M. (1994). Duties of as teacher. Journal of Personnel Evaluation in Education, 8, 151-184. Stiggins, R. (2008). Assessment for learning, the achievement gap, and truly effective schools. Portland, OR: ETS Assessment Training Institute. Stolle, C., Goerss, B., & Watkins, M. (2005). Implementing portfolios in a teacher education program. Issues in Teacher Education, 14, 25-34. Stringer, M., & Irwing, P. (1998). Students' evaluations of teaching effectiveness: A structural modelling approach. British Journal of Educational Psychology, 68, 409-511. Tang, L. T. (1997). Teaching evaluation at a public institution of higher education: Factors related to the overall teaching effectiveness. Public Personnel Management, 26, 379-380. Wen, M. L., Tsai, C., & Chang, C. (2006). Attitudes towards peer assessment: a comparison of the perspectives of pre-service and in-service teachers. Innovations in Education and Teaching International, 43, 83-93. Wray, S. (2008). Swimming upstream: Shifting the purpose of the an existing teaching portfolio requirement. Professional Educator, 32, 1-17. Yon, M., Burnap, C., & Kohut, G. 2002. Evidence of effective teaching: Perceptions of peer reviewers. College Teaching, 50, 104-111. Young, S., & Shaw, D. G. 1999. Profiles of effective college and university teachers. The Journal of Higher Education, 70, 670-687.

About the Author Dr. Carlo Magno is presently a faculty of the Counseling and Educational Psychology Department at De La Salle University, Manila. Most of his research focuses on the development of different forms of teacher assessment protocols. He is involved with several projects that involve assessment of teacher competencies in the Philippines. Further correspondence can be addressed to him at the College of Education, De La Salle University, 2401 Taft Ave., Manila, Philippines, e-mail: crlmgn@yahoo.com.

2012 Time Taylor Academic Journals ISSN 2094-0734