Professional Documents
Culture Documents
Peter O’Donoghue
Peter O’Donoghue
Centre for Performance Analysis, School of Sport, University of Wales Institute
Cardiff, Cyncoed Campus, Cardiff, Wales, CF23 6XD, UK.
Abstract
1.0. Introduction.
Most performance analysis methods do not involve fully automated data capture
techniques. Therefore, human error during data gathering can limit the reliability of
the methods used. Even where fully automated data gathering is possible, there may
be algorithmic, optical or other limitations in data gathering techniques that limit the
reliability of data collected. Therefore, reliability evaluation is essential so that the
information produced by performance analysis can be interpreted with a full
understanding of measurement error involved. This is not just the case in scientific
research, but also where the information is being used in coaching, media or judging
contexts to make important decisions.
Hughes et al. (2001) surveyed published performance analysis research finding many
investigations where the reliability of data collection was not reported and other
investigations where the reliability test made inappropriate use of statistical
techniques. Many performance analysts find reliability testing a difficult area,
particularly the use of statistical analysis techniques in order to evaluate reliability.
Guidance on reliability statistics has been limited to percentage error (Hughes et al.,
2004) and more recently confidence intervals (Cooper, 2006). There are areas related
to reliability of data collection in performance analysis that have an impact on
reliability. These include the use of operational definitions and validity of
performance indicators. At the recent World Congress of Performance Analysis of
Sport 7 in Szombathley, Hungary, there was a workshop on reliability issues in
performance analysis. At this workshop, the author of the current paper raised seven
35
questions arising from concerns over accepted reliability evaluation practice in
performance analysis. The purpose of the current paper is to provide a record of these
questions and to propose a framework for reliability assessment during system
development and subsequent operation. In raising these questions, examples are used
based on previously published results, data used in previous investigations as well as
some original data. The seven questions are:
It is essential for system operators and the eventual consumers of the information
generated by performance analysis to have a shared understanding of the variables
used. Therefore, these variables should be defined with a level of precision making
their meaning unambiguous. To ensure the variables are also valid, Hughes and
Bartlett (2002) recommend the use of “performance indicators”. Performance
indicators are not just variables relevant to sports performance but are also valid
measures associated with level of performance. Performance indicators should also
have the metric property of having a means of interpretation (O’Donoghue, 2006).
The operational definitions used in performance analysis do not have the detail seen
in legal contracts. When it is essential to understand the requirements for software
intensive systems, rigorous formal specification techniques are used based on set
theory, mapping functions and temporal logic (O’Donoghue and Murphy, 1996). It
may not be wise for operational definitions in performance analysis systems to
achieve such levels of unambiguous detail as performance analysis is often a real-time
observational task where there is not sufficient time to consider the level of detail one
would see in a legal contract which is inspected carefully over a much longer time
scale before determining if it has been conformed to or breached.
Consider the hand notation system used to analyse rugby union developed by
McCorry et al. (1996) and also used by Hunter and O’Donoghue (2001). This system
involved tallying named events such as “possession gained” during “set play” or
“loose play” as well as gaining territory by going “over”, “through” or “around” the
opposition. These events were not defined in sufficient detail to satisfy many
performance analysts and so one of the objectives of Armitage’s (2006) study of the
2003 Rugby World Cup was to provide precise operational definitions for these terms
prior to testing system reliability and using the system to investigate the differences
between winning and losing teams in the knockout stages. The operational definitions
came to 8 double sided pages of Armitage’s (2006) thesis with terms introduced
36
during the operational definitions, such as “gain line”, also being defined. Armitage
and his supervisor (the author of the current paper) participated in an inter-operator
reliability study. Firstly the operational definitions were read, discussed and
explained before the final of the 2003 Rugby World Cup was analysed independently
by the two observers. Table 1 shows the frequency of different methods of gaining
territory employed by the two teams as recorded by the two observers and Table 2
shows the possession changes recorded. Irrespective of whether the disagreements
illustrated in Tables 1 and 2 are considered as percentage errors or absolute
differences in frequencies, they reveal serious limits in reliability despite the efforts
made prior to the inter-operator agreement study commencing. The two observers
discussed the reasons why they disagreed so much and what each of them was
counting for each type of event. It was concluded that it would have been useful to
discuss the operational definitions while viewing example video sequences so as the
operational definitions would then be considered in terms of the analysis task, an
observational task. This would have facilitated learning by both observers of the
types of observed behaviours that would be counted for each class of recorded event
Downloaded by [University of Florida] at 07:06 15 December 2017
and for what reasons. The experience shows that agreeing on the wording of
operational definitions is not sufficient and it is necessary for observers to understand
the operational definitions fully.
Table 1. Results for methods of gaining territory during Armitage’s (2006) inter-
operator reliability study.
Method of England Australia
gaining territory Operator 1 Operator 2 Operator 1 Operator 2
Around 23 9 11 4
Over 22 23 16 28
Through 11 34 3 13
Total 56 66 30 45
There are some studies where the nature of what is being observed cannot be
described precisely or practically in words. Consider an investigation of defensive
styles used by international netball teams (Williams and O’Donoghue, 2006) that
characterised defensive style during an opposition as man-to-man, zonal, part man-to-
man / part zonal or other. Very brief operational definitions were provided for these
as follows :
• Zone defence – Where all players concerned in the area of play analysed are
marking the space.
• Man-to-man defence – Where all players concerned in the area of play
analysed are marking a player
37
• Part man-to-man / part zone defence – Where some players concerned in the
area of play analysed are marking the space and some players concerned are
marking a player.
• Other – where the defence could not be classified in to one of the man-to-man
or zonal strategies.
Despite the vagueness of these definitions, a 100% agreement was achieved when two
independent operators used the system to analyse a single quarter of netball using the
system. One reason for the level of agreement obtained was that both players were
experienced netball players and coaches who were very familiar with the terms used
and the types of defensive play that counted for each. This shows that an
understanding of the behaviours being counted is more important than agreement of
the wording of operational definitions. Defending in netball is undertaken by a team
of 7 players in an attempt to regain possession of the ball. There are complex
movement patterns dictated by the defensive style adopted and the possession play of
the opposition with the ball. Describing all of the possible behaviours that
Downloaded by [University of Florida] at 07:06 15 December 2017
characterise each defensive style or what distinguishes the defensive styles would
have been impossible. The contrasting experiences of Armitage (2006) and Williams
and O’Donoghue (2006) show that agreement on operational definition wording is not
sufficient for reliable observation and that good knowledge of the behaviours being
observed is essential.
38
acceptable reliability. A p values above 0.05 means there is no significant difference
between the observations but if the p value is less than 0.50 then the observations are
more different than similar. Intra-operator agreement does still have a role in system
development during piloting as the experience of using the system can identify
necessary changes that can be made before undertaking an inter-operator agreement
study.
1.0
p (from Chi square test)
0.9
0.83 0.95 0.96
0.8
0.7 0.76
0.6
0.5 0.52
0.4
Downloaded by [University of Florida] at 07:06 15 December 2017
0.3
0.2
0.1
0.0
T1 T2 T3 T4 T5
Observation
There are statistical methods used to establish reliability that may not be appropriate
for the particular systems under consideration. Consider a time motion analysis
system used to record the percentage of time a soccer player spends in a stationary
position, walking, jogging, running, on-the-ball activity and any other high intensity
activity. Using chi square or a correlation coefficient to compare the percentage of
time distributed between these different movement types between two different
observations of the same performance may not give an indication of reliability. Most
players spend 10% to 15% of the match in a stationary position, 40% to 55% of the
match walking, 15% to 25% of the match jogging, about 5% of the match running,
less than 5% of the match performing on the ball activity and the remaining time
performing other high intensity activity. Therefore, there will be high correlations
and high P values (>0.90) when looking at completely different players. For a
reliability statistic to have construct validity, it must produce different ranges of
values when reliably analysing the same performance by trained observers than when
two totally different performances are being compared. When O’Donoghue et al.
(2005) developed the POWER system for time-motion analysis, they found that
values of Pearson’s r, percentage error and chi square produced when analysing
different subjects could be greater than when the same subject was being analysed by
an experienced observer. However, the kappa statistic achieved construct validity
with values of under 0.2 (very poor to poor strength of agreement) being obtained
when analysing completely different performances and values of over 0.6 (good to
39
very good strength of agreement) being obtained when the same performance was
analysed by an experienced user. Kappa might not be the best reliability statistic for
all systems and performance indicators. However, it is recommended that efforts are
made to use statistical procedures for reliability that do exhibit construct validity.
This can be tested by applying them to pairs of observations where the same
performance is being observed and pairs of observations from completely different
performances. Any reliability statistic showing the “known group difference”
between these two different types of pair of observation with no overlap in values can
be deemed to have the property of construct validity as a reliability statistic.
This question is concerned with the values of reliability statistics that indicate an
acceptable level of reliability for performance analysis systems. Kappa has been
interpreted on the basis of the threshold values specified by Altman (1991). For
Downloaded by [University of Florida] at 07:06 15 December 2017
Hughes et al. (2005) undertook a laboratory base study to compare performance of ten
6s sprints with different recoveries. There were 3 conditions; performing 6s sprints
every 25s, 40s and 55s (meaning recoveries of 19s, 34s and 49s respectively). When
6s sprints were performed every 25s, power output during the 10 sprints was
significantly lower than when the sprints were performed every 40s or every 55s.
Furthermore, heart rate response and oxygen consumption were significantly higher
40
when 6s sprints were performed every 25s than when the sprints were performed
every 40s or every 55s. This is evidence that a different mix of energy systems is
utilised when 6s sprints are performed with 19s recovery than when they are
performed with recoveries of 34s or 49s. It is, therefore, inappropriate to interpret a
kappa value of 0.603 as a good strength of agreement when one observer records 18 x
6s bursts with 44s recoveries and the other observer records 36 x 6s bursts with 19s
recoveries. In reality during time-motion analysis, the observers will have slight
differences for the points at which corresponding high intensity bursts are deemed to
have commenced and ceased. This will reduce the kappa values, but it is still serious
that there is a possibility of interpreting observations indicating different energy
system utilisation as having a good strength of agreement. It is therefore
recommended that when considering reliability statistics and how they are interpreted
for time-motion analysis, the point at which energy systems would be considered to
be different should be identified. The kappa value obtained at this point would then
be the threshold for acceptable reliability. Similarly if considering patterns of play in
a team game, the point at which the frequency profile of events changes so much as to
Downloaded by [University of Florida] at 07:06 15 December 2017
Observer 1
6s 6s 6s 6s 6s 6s
19s 19s 19s 19s 19s 19s
Observer 2
6s 6s 6s
44s 44s 44s
Key
Work
Rest
Table 3. Summary of observation time (s) where observers agreed and disagreed.
Observer 1
Work Rest Total
Observer 2 Work 108 0 108
Rest 108 684 792
Total 216 684 900
41
6.0 How does limited reliability affect the results of academic investigations of
sports performance?
Table 4 shows that adding these errors to the 277 values increases the variability
within each playing position as well as the overall sample of 277 players. Table 4
also shows a similar pattern if 20 or 10 players for each positional group are selected
at random. The increase in variability decreases the H value of the Kruskal Wallis H
test and, in so doing, increases the P value reported. Therefore, the effect of
measurement error due to limited reliability appears to be to increase the p value
making it less likely that a significant difference will be found. Therefore, limited
reliability is more likely to cause a Type II Error (concluding no significant difference
when there is a difference in reality) than a Type I Error (concluding that there is a
significant difference when there is no difference in reality).
Table 4. The effect of introducing measurement error to between samples study with
different sample sizes.
Position 277 Players 60 Players 30 Players
N Correct With N Correct With N Correct With
Error Error Error
Defenders 59 9.75+2.49 9.80+2.66 20 10.16+2.52 10.11+2.82 10 9.07+1.84 9.25+1.74
Midfielders 115 10.97+2.28 10.90+2.57 20 11.38+2.13 11.18+2.65 10 10.11+1.54 10.77+2.37
Forwards 103 9.89+2.01 9.81+2.29 20 10.83+2.10 11.19+2.08 10 10.21+1.74 10.00+2.23
All 277 10.31+2.30 10.26+2.54 60 10.79+2.27 10.83+2.54 30 9.80+1.73 10.01+2.15
H2 18.4 14.7 4.7 3.5 1.9 1.8
p <0.001 0.001 0.096 0.174 0.392 0.401
42
limited reliability due to operator error risks third parties making poor decisions about
training and preparation based on the findings. Therefore, the level of reliability
should be evaluated and reported so as the reader has an understanding of the quality
of data gathering and is able to take account of this when interpreting the findings of
the study. Reliability is not only important when performance analysis systems are
being used in academic research but also in coaching, media and scoring contexts.
O’Donoghue and Longville (2004) describe an approach to improving the reliability
of the system used by the Welsh netball squad. Important decisions effecting training
and preparation for international competition are made based on information provided
by such systems. Therefore, correctness of the information is very important in
coaching contexts.
scored by the computerised scoring system used in amateur boxing. Therefore, there
is a risk of the wrong decision being produced where the number of recorded scoring
punches for the two boxers gives the opposite result to what would be obtained if all
actual scoring punches were used. There have been attempts at improving the
operating environment of the judges to allow the system to be used (Mullan and
O’Donoghue, 1999 and 2001) but these have failed to prove more reliable than the
current approach used by IABA (International Amateur Boxing Association).
Systems have been developed to allow the activity of judges to be scrutinised (Scott,
1996) and there are metrics to allow biased, lazy and incompetent judges to be
identified (Brown and O’Donoghue, 1999). The efforts made to try and improve the
scoring system used in amateur boxing may be considered more important than
reliability evaluations for academic research. This is because when athletes have
prepared diligently to compete in Olympic, World, Commonwealth or continental
international boxing tournaments, their performances should be scored as accurately
as possible to ensure that their success is deserved.
The media has a number of different forms; printed media, television and radio as
well as the increasing number of internet sites reporting sports news. This paper uses
an example from the printed media as an example of the limited reliability of
information in the media. The newspaper business has a goal of selling news papers
and must therefore provide their readers with the sports coverage the readers are
interested in. The author bought the following British newspapers on Sunday 10th
September 2006, the day after a full programme of domestic soccer and rugby
matches as well as the women’s singles final at the US Open: The Independent on
Sunday, The Mail on Sunday, The News of the World, The Sunday Express, The
Sunday Mirror, The Sunday Telegraph and The Sunday Times.
Of these, only The Sunday Times and The Sunday Telegraph went beyond qualitative
analysis of performances when reporting on soccer matches. For 3 of the FA Premier
League soccer matches, The Sunday Telegraph showed each team’s %time in
possession, and frequency of offsides, shots on target, shots off target, corners, fouls
conceded, yellow cards and red cards. The Sunday Times reported the same statistics
for each FA Premier League match played the previous day with the number of
blocked shots as an additional indicator. Table 5 shows that there was limited
43
agreement between the two newspapers for some of the common performance
indicators used for the three matches both provided quantitative match facts for.
There was no quantitative analysis of process indicators during the reporting of any
other sport in any of the Sunday newspapers surveyed on that day.
Table 5. Performance indicators reported in The Sunday Times and the Sunday
Telegraph on 10th September 2006 for 3 FA Premier League soccer matches.
Newspaper Team Performance Indicator
Possession
conceded
Shots off
Shots on
Offsides
Corners
target
target
Fouls
%
The Sunday Telegraph Everton 54 3 6 2 2 22
Liverpool 46 4 5 16 11 13
The Sunday Times Everton 46 3 4 2 2 23
Liverpool 52 4 4 14 11 14
Downloaded by [University of Florida] at 07:06 15 December 2017
44
Table 6. Number of points where players went to the net as measured by 2 different
methods.
Match / Set Author’s Observation US Open Internet Site
Player 1 Player 2 Neither Player 1 Player 2 Neither
Federer v Davydenko Set 1 2 2 35 3 4 32
Federer v Davydenko Set 2 6 3 65 10 7 67
Federer v Davydenko Set 3 6 0 47 7 2 44
Roddick v Youhzny Set 1 18 2 59 21 3 55
Roddick v Youhzny Set 2 5 3 25 7 4 22
Roddick v Youhzny Set 3 13 7 61 16 10 55
Roddick v Youhzny Set 4 13 3 44 15 5 40
Federer v Roddick Set 1 5 7 35 7 9 31
Federer v Roddick Set 2 13 13 29 16 13 26
Federer v Roddick Set 3 8 18 57 11 19 53
Federer v Roddick Set 4 4 7 29 4 8 28
Table 7. Summary of categorical data for duration of high intensity bursts for
objectivity.
Trainee Observer
No bursts <10s >=10s Total
Experienced No bursts 2 1 0 3
observer <10s 1 23 1 25
≥10s 0 0 0 0
Total 3 24 1 28
45
9.0. Recommendations.
Considering the exercises undertaken to try and answer the seven questions discussed
during this paper, the following steps are recommended during the development and
operation of a performance analysis system. Steps 1 to 4 are relevant during system
development and steps 5 to 7 should be applied by each new operator using the
system.
10.0. Acknowledgements.
The author would like to thank the delegates of the World Congress of Performance
Analysis of Sport 7 in Szombathley, Hungary who participated in the reliability
workshop especially Mike Hughes, Nic James, Martin Lames, Tim McGarry and
Albin Tenga for their comments on the ideas presented in this paper.
11.0. References.
Altman, D.G. (1991). Practical Statistics for Medical Research. London: Chapman
& Hall.
Armitage, P. (2006). Analysis of the knockout stages of the 2003 rugby world cup,
B.Sc Dissertation, School of Sport, University of Wales Institute Cardiff,
Cyncoed Campus, Cardiff, UK.
Brown, D. and O’Donoghue, P.G. (1999). Specification and evaluation metrics for
identifying biased, lazy and incompetent judges in amateur boxing, Book of
Abstracts, Exercise and Sports Science Association of Ireland, Limerick, 19th
November 1999.
Coalter, A., Ingram, B., McCrorry, P. MBE, O'Donoghue, P.G. and Scott, M. (1998).
A Comparison of Alternative Operation schemes for the Computerised
Scoring System for Amateur Boxing. Journal of Sports Sciences, 16, 16-17.
46
Cooper, S.M. (2006). Statistical methods for resolving issues relevant to test and
measurement reliability and validity in variables related to sport performance
and physical fitness, Ph.D Thesis, School of Sport, University of Wales
Institute Cardiff, Cyncoed Campus, Cardiff, UK.
Hughes, M., Cooper, S.M. and Nevill, A. (2004). Analysis of notation data: reliability.
In Notational analysis of sport, 2nd Edition (Edited by M. Hughes and I.M.
Franks). London: Routledge, 189-204.
Downloaded by [University of Florida] at 07:06 15 December 2017
Hughes, M.G., Rose, G. and Amaral, I. (2005). The influence of recovery duration on
blood lactate accumulation in repeated sprint activity. Journal of Sports
Sciences, 23, 130-131.
Hunter, P. and O’Donoghue, P.G. (2001). A match analysis of the 1999 Rugby Union
World Cup. In Proceedings of the World Congress of Performance
Analysis, Sports Science and Computers (PASS.COM) (Edited by M.
Hughes and I.M. Franks). Cardiff: CPA Press, UWIC, 85-90.
McCorry, M., Saunders, E.D., O'Donoghue, P.G. and Murphy, M.H. (1996). A match
analysis of the knockout stages of the 3rd Rugby Union World Cup. In
Proceedings of the World Congress of Notational Analysis of Sport III
(Edited by M. Hughes). Cardiff: CPA Press, UWIC, 230-239.
O’Donoghue, P.G. (2002). Performance models of ladies’ and men’s singles tennis at
the Australian Open. International Journal of Performance Analysis of
Sport (e), 2(1), 73-84.
47
O’Donoghue, P.G. (2005). An Algorithm to use the Kappa Statistic to establish
Reliability of Computerised Time-Motion Analysis Systems, Book of
Abstracts, 5th International Symposium of Computer Science in Sport, Hvar,
Croatia, 25th-28th May, Book of abstracts, pp.49.
O’Donoghue, P.G. and Longville, J. (2004). Reliability testing and the use of
Statistics in performance analysis support: a case study from an international
netball tournament. In Performance Analysis of Sport 6 (Edited by P.G.
O’Donoghue and M.D. Hughes). Cardiff: CPA Press, UWIC, pp. 1-7.
O'Donoghue, P.G. and Murphy, M.H. (1996). Object Modelling and Formal
Downloaded by [University of Florida] at 07:06 15 December 2017
O'Donoghue, P.G., Hughes, M.G., Rudkin, S., Bloomfield, J., Cairns, G., Powell, S.
(2005). Work rate analysis using the POWER (Periods of Work Efforts and
Recoveries) System. International Journal of Performance Analysis of
Sport (e), 5(1), 5-21.
Scott, M. (2006). A system for scrutinising the activity of judges operating the
computerised scoring system in amateur boxing, M.Sc thesis, Faculty of
Engineering, University of Ulster, Jordanstown, UK.
Sockett, W. (2006). Observational analysis of type and intensity of children’s play during
primary school break times, B.Sc Dissertation, School of Sport, University of
Wales Institute Cardiff, Cyncoed Campus, Cardiff, UK.
48