You are on page 1of 15

International Journal of Performance Analysis in Sport

ISSN: 2474-8668 (Print) 1474-8185 (Online) Journal homepage: http://www.tandfonline.com/loi/rpan20

Reliability Issues in Performance Analysis

Peter O’Donoghue

To cite this article: Peter O’Donoghue (2007) Reliability Issues in Performance


Analysis, International Journal of Performance Analysis in Sport, 7:1, 35-48, DOI:
10.1080/24748668.2007.11868386

To link to this article: https://doi.org/10.1080/24748668.2007.11868386

Published online: 03 Apr 2017.

Submit your article to this journal

Article views: 190

View related articles

Citing articles: 5 View citing articles

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=rpan20

Download by: [University of Florida] Date: 15 December 2017, At: 07:06


Reliability Issues in Performance Analysis

Peter O’Donoghue
Centre for Performance Analysis, School of Sport, University of Wales Institute
Cardiff, Cyncoed Campus, Cardiff, Wales, CF23 6XD, UK.

Abstract

There are beliefs that have come to be accepted by many in


performance analysis. This paper challenges some of these beliefs.
The presence of precise operational definitions does not guarantee
good reliability nor does their absence guarantee poor reliability.
Intra-operator reliability studies cannot show a system to be objective.
Some reliability statistics give values considered to reflect good
reliability even when observations are not even looking at the same
match! The value of a reliability statistic considered to be acceptable
Downloaded by [University of Florida] at 07:06 15 December 2017

needs to be justified. Limited reliability can introduce variability into


data that reduces the chance of finding a significant difference.
Reliability is at least as important when performance analysis is used
in coaching and judging contexts as when it is used for academic
research. There is a trade off between reliability and the precision of
measurement.

Keywords: Reliability, objectivity, measurement error.

1.0. Introduction.

Most performance analysis methods do not involve fully automated data capture
techniques. Therefore, human error during data gathering can limit the reliability of
the methods used. Even where fully automated data gathering is possible, there may
be algorithmic, optical or other limitations in data gathering techniques that limit the
reliability of data collected. Therefore, reliability evaluation is essential so that the
information produced by performance analysis can be interpreted with a full
understanding of measurement error involved. This is not just the case in scientific
research, but also where the information is being used in coaching, media or judging
contexts to make important decisions.

Hughes et al. (2001) surveyed published performance analysis research finding many
investigations where the reliability of data collection was not reported and other
investigations where the reliability test made inappropriate use of statistical
techniques. Many performance analysts find reliability testing a difficult area,
particularly the use of statistical analysis techniques in order to evaluate reliability.
Guidance on reliability statistics has been limited to percentage error (Hughes et al.,
2004) and more recently confidence intervals (Cooper, 2006). There are areas related
to reliability of data collection in performance analysis that have an impact on
reliability. These include the use of operational definitions and validity of
performance indicators. At the recent World Congress of Performance Analysis of
Sport 7 in Szombathley, Hungary, there was a workshop on reliability issues in
performance analysis. At this workshop, the author of the current paper raised seven

35
questions arising from concerns over accepted reliability evaluation practice in
performance analysis. The purpose of the current paper is to provide a record of these
questions and to propose a framework for reliability assessment during system
development and subsequent operation. In raising these questions, examples are used
based on previously published results, data used in previous investigations as well as
some original data. The seven questions are:

• Do precise operational definitions guarantee good reliability?


• Do intra-operator reliability studies measure improving reliability or
increasing awareness of the match observed during the reliability study?
• Do statistical measures used to evaluate reliability have construct validity?
• What is a good strength of agreement?
• How does limited reliability impact on the results of academic investigations
of sports performance?
• Is reliability of observation and data collection as important in coaching media
and scoring applications of performance analysis as it is in scientific research?
Downloaded by [University of Florida] at 07:06 15 December 2017

• Is there a precision versus reliability trade off?

2.0. Do precise operational definitions guarantee good reliability?

It is essential for system operators and the eventual consumers of the information
generated by performance analysis to have a shared understanding of the variables
used. Therefore, these variables should be defined with a level of precision making
their meaning unambiguous. To ensure the variables are also valid, Hughes and
Bartlett (2002) recommend the use of “performance indicators”. Performance
indicators are not just variables relevant to sports performance but are also valid
measures associated with level of performance. Performance indicators should also
have the metric property of having a means of interpretation (O’Donoghue, 2006).

The operational definitions used in performance analysis do not have the detail seen
in legal contracts. When it is essential to understand the requirements for software
intensive systems, rigorous formal specification techniques are used based on set
theory, mapping functions and temporal logic (O’Donoghue and Murphy, 1996). It
may not be wise for operational definitions in performance analysis systems to
achieve such levels of unambiguous detail as performance analysis is often a real-time
observational task where there is not sufficient time to consider the level of detail one
would see in a legal contract which is inspected carefully over a much longer time
scale before determining if it has been conformed to or breached.

Consider the hand notation system used to analyse rugby union developed by
McCorry et al. (1996) and also used by Hunter and O’Donoghue (2001). This system
involved tallying named events such as “possession gained” during “set play” or
“loose play” as well as gaining territory by going “over”, “through” or “around” the
opposition. These events were not defined in sufficient detail to satisfy many
performance analysts and so one of the objectives of Armitage’s (2006) study of the
2003 Rugby World Cup was to provide precise operational definitions for these terms
prior to testing system reliability and using the system to investigate the differences
between winning and losing teams in the knockout stages. The operational definitions
came to 8 double sided pages of Armitage’s (2006) thesis with terms introduced

36
during the operational definitions, such as “gain line”, also being defined. Armitage
and his supervisor (the author of the current paper) participated in an inter-operator
reliability study. Firstly the operational definitions were read, discussed and
explained before the final of the 2003 Rugby World Cup was analysed independently
by the two observers. Table 1 shows the frequency of different methods of gaining
territory employed by the two teams as recorded by the two observers and Table 2
shows the possession changes recorded. Irrespective of whether the disagreements
illustrated in Tables 1 and 2 are considered as percentage errors or absolute
differences in frequencies, they reveal serious limits in reliability despite the efforts
made prior to the inter-operator agreement study commencing. The two observers
discussed the reasons why they disagreed so much and what each of them was
counting for each type of event. It was concluded that it would have been useful to
discuss the operational definitions while viewing example video sequences so as the
operational definitions would then be considered in terms of the analysis task, an
observational task. This would have facilitated learning by both observers of the
types of observed behaviours that would be counted for each class of recorded event
Downloaded by [University of Florida] at 07:06 15 December 2017

and for what reasons. The experience shows that agreeing on the wording of
operational definitions is not sufficient and it is necessary for observers to understand
the operational definitions fully.

Table 1. Results for methods of gaining territory during Armitage’s (2006) inter-
operator reliability study.
Method of England Australia
gaining territory Operator 1 Operator 2 Operator 1 Operator 2
Around 23 9 11 4
Over 22 23 16 28
Through 11 34 3 13
Total 56 66 30 45

Table 2. Results for possession changes during Armitage’s (2006) inter-operator


reliability study.
Situation where England Australia
possession was gained Operator 1 Operator 2 Operator 1 Operator 2
Set Play 11 12 13 14
Loose Play 31 37 32 38
Total 42 49 45 52

There are some studies where the nature of what is being observed cannot be
described precisely or practically in words. Consider an investigation of defensive
styles used by international netball teams (Williams and O’Donoghue, 2006) that
characterised defensive style during an opposition as man-to-man, zonal, part man-to-
man / part zonal or other. Very brief operational definitions were provided for these
as follows :

• Zone defence – Where all players concerned in the area of play analysed are
marking the space.
• Man-to-man defence – Where all players concerned in the area of play
analysed are marking a player

37
• Part man-to-man / part zone defence – Where some players concerned in the
area of play analysed are marking the space and some players concerned are
marking a player.
• Other – where the defence could not be classified in to one of the man-to-man
or zonal strategies.

Despite the vagueness of these definitions, a 100% agreement was achieved when two
independent operators used the system to analyse a single quarter of netball using the
system. One reason for the level of agreement obtained was that both players were
experienced netball players and coaches who were very familiar with the terms used
and the types of defensive play that counted for each. This shows that an
understanding of the behaviours being counted is more important than agreement of
the wording of operational definitions. Defending in netball is undertaken by a team
of 7 players in an attempt to regain possession of the ball. There are complex
movement patterns dictated by the defensive style adopted and the possession play of
the opposition with the ball. Describing all of the possible behaviours that
Downloaded by [University of Florida] at 07:06 15 December 2017

characterise each defensive style or what distinguishes the defensive styles would
have been impossible. The contrasting experiences of Armitage (2006) and Williams
and O’Donoghue (2006) show that agreement on operational definition wording is not
sufficient for reliable observation and that good knowledge of the behaviours being
observed is essential.

3.0. Do intra-operator reliability studies measure improving reliability or


increasing awareness of the match observed during the reliability study?

Intra-operator reliability studies involve an operator analysing a performance on 2 or


more occasions using the given performance analysis system. Where a good level of
intra-operator agreement is achieved, it merely shows that the operator can use the
system consistently. The operator’s understanding of the events being counted may
be different to other potential operators and, therefore, intra-operator agreement does
not establish the objectivity of a system. For the system to be objective, the result of
using the system must be independent of individual operators. Multiple analyses of
performances are done to monitor improving reliability of observation as a user trains
to use the system. However, much of the improving agreement seen will be due to
familiarisation with the particular performance rather than improvement in reliability
of observation. Given that intra-operator agreement alone cannot establish the
objectivity of a system, it is proposed that improvements in reliability during end-user
training should be monitored during inter-operator agreement studies with different
performances being analysed at successive stages. An example of such an approach
was the inter-operator reliability study undertaken to test Sockett’s (2006) system for
analysing play type and intensity of primary school children in the playground. For
this study, intra-operator reliability could not have been done because the Ethics
Committee of UWIC’s School of Sport would not allow subjects to be filmed.
Therefore, two independent observers used the system on 5 different occasions
analysing the one child on each occasion. Figure 1 shows the progress in inter-
operator agreement over the 5 occasions where the system was tested. The chi square
test of independence was used to compare the frequency profile of play types
observed by the 2 observers. Because testing reliability is about testing similarity of
observation rather than difference, p values of 0.90 or above are required for

38
acceptable reliability. A p values above 0.05 means there is no significant difference
between the observations but if the p value is less than 0.50 then the observations are
more different than similar. Intra-operator agreement does still have a role in system
development during piloting as the experience of using the system can identify
necessary changes that can be made before undertaking an inter-operator agreement
study.

1.0
p (from Chi square test)

0.9
0.83 0.95 0.96
0.8
0.7 0.76
0.6
0.5 0.52
0.4
Downloaded by [University of Florida] at 07:06 15 December 2017

0.3
0.2
0.1
0.0
T1 T2 T3 T4 T5
Observation

Figure 1. Results of an inter-operator agreement study (Sockett, 2006).

4.0. Do statistical measures used to evaluate reliability have construct validity?

There are statistical methods used to establish reliability that may not be appropriate
for the particular systems under consideration. Consider a time motion analysis
system used to record the percentage of time a soccer player spends in a stationary
position, walking, jogging, running, on-the-ball activity and any other high intensity
activity. Using chi square or a correlation coefficient to compare the percentage of
time distributed between these different movement types between two different
observations of the same performance may not give an indication of reliability. Most
players spend 10% to 15% of the match in a stationary position, 40% to 55% of the
match walking, 15% to 25% of the match jogging, about 5% of the match running,
less than 5% of the match performing on the ball activity and the remaining time
performing other high intensity activity. Therefore, there will be high correlations
and high P values (>0.90) when looking at completely different players. For a
reliability statistic to have construct validity, it must produce different ranges of
values when reliably analysing the same performance by trained observers than when
two totally different performances are being compared. When O’Donoghue et al.
(2005) developed the POWER system for time-motion analysis, they found that
values of Pearson’s r, percentage error and chi square produced when analysing
different subjects could be greater than when the same subject was being analysed by
an experienced observer. However, the kappa statistic achieved construct validity
with values of under 0.2 (very poor to poor strength of agreement) being obtained
when analysing completely different performances and values of over 0.6 (good to

39
very good strength of agreement) being obtained when the same performance was
analysed by an experienced user. Kappa might not be the best reliability statistic for
all systems and performance indicators. However, it is recommended that efforts are
made to use statistical procedures for reliability that do exhibit construct validity.
This can be tested by applying them to pairs of observations where the same
performance is being observed and pairs of observations from completely different
performances. Any reliability statistic showing the “known group difference”
between these two different types of pair of observation with no overlap in values can
be deemed to have the property of construct validity as a reliability statistic.

5.0. What is a good strength of agreement?

This question is concerned with the values of reliability statistics that indicate an
acceptable level of reliability for performance analysis systems. Kappa has been
interpreted on the basis of the threshold values specified by Altman (1991). For
Downloaded by [University of Florida] at 07:06 15 December 2017

example 0.4 to 0.6 is interpreted as moderate strength of agreement, 0.6 to 0.8 is


interpreted as good strength of agreement with values of over 0.8 being interpreted as
very good strength of agreement. However, these values were used for an example
where two radiographers were making diagnoses when analysing xeromammograms.
Each radiographer agreed on the number of xeromammograms there actually were to
be diagnosed. In performance analysis, there are occasions where one operator
records a discrete event while the other operator doesn’t. Furthermore, there is an
algorithm used to determine the kappa statistic for the proportion of continuous
observation time where two observers agree on movement class adjusted for the
expected proportion of observation time where they would agree by chance
(O’Donoghue, 2005). The kappa thresholds specified by Altman (1991) cannot be
assumed to be suitable when assessing the reliability of performance analysis systems.

Consider two observations of a player involved in 15 minutes of intermittent high


intensity activity. One observer records 18 x 6s bursts of high intensity activity with
equal recoveries of 44s following them. The other observer records 36 bursts of high
intensity activity of 6s with 19s recovery periods following them. There are 18 x 6s
periods where both observers agree that high intensity activity is being performed as
well as 36 x 19s periods where both observers agree that low intensity recovery
activity is being performed. However, there are 19 x 6s periods where the first
observer has recorded low intensity recovery while the second observer has recorded
high intensity activity. Figure 2 shows the first 150s of the two observers’ recordings.
Table 3 summarises the activity recorded by the two observers during the 15 minute
observation. The proportion of observation time where the two observers agree on the
activity being recorded is 0.880 and the proportion of observation time where they
could be expected to agree by chance is 0.698 giving a kappa value of 0.603 which
would be interpreted as a good strength of agreement (Altman, 1991).

Hughes et al. (2005) undertook a laboratory base study to compare performance of ten
6s sprints with different recoveries. There were 3 conditions; performing 6s sprints
every 25s, 40s and 55s (meaning recoveries of 19s, 34s and 49s respectively). When
6s sprints were performed every 25s, power output during the 10 sprints was
significantly lower than when the sprints were performed every 40s or every 55s.
Furthermore, heart rate response and oxygen consumption were significantly higher

40
when 6s sprints were performed every 25s than when the sprints were performed
every 40s or every 55s. This is evidence that a different mix of energy systems is
utilised when 6s sprints are performed with 19s recovery than when they are
performed with recoveries of 34s or 49s. It is, therefore, inappropriate to interpret a
kappa value of 0.603 as a good strength of agreement when one observer records 18 x
6s bursts with 44s recoveries and the other observer records 36 x 6s bursts with 19s
recoveries. In reality during time-motion analysis, the observers will have slight
differences for the points at which corresponding high intensity bursts are deemed to
have commenced and ceased. This will reduce the kappa values, but it is still serious
that there is a possibility of interpreting observations indicating different energy
system utilisation as having a good strength of agreement. It is therefore
recommended that when considering reliability statistics and how they are interpreted
for time-motion analysis, the point at which energy systems would be considered to
be different should be identified. The kappa value obtained at this point would then
be the threshold for acceptable reliability. Similarly if considering patterns of play in
a team game, the point at which the frequency profile of events changes so much as to
Downloaded by [University of Florida] at 07:06 15 December 2017

indicate a different style of play should be identified. The value of whatever


reliability statistic is being used (not always kappa) at this point can be used as the
threshold for acceptable reliability.

Observer 1
6s 6s 6s 6s 6s 6s
19s 19s 19s 19s 19s 19s

Observer 2
6s 6s 6s
44s 44s 44s

Key
Work

Rest

Figure 2. Example of 2 time-motion analysis observations.

Table 3. Summary of observation time (s) where observers agreed and disagreed.
Observer 1
Work Rest Total
Observer 2 Work 108 0 108
Rest 108 684 792
Total 216 684 900

41
6.0 How does limited reliability affect the results of academic investigations of
sports performance?

Consider the percentage of high intensity activity performed by 277 FA Premier


League soccer players observed by the author. Table 4 shows that a Kruskal Wallis H
test revealed a significant difference between the 3 broad outfield positional roles for
this variable. A 60 subject reliability study was undertaken by O’Donoghue and
Parker (2001) revealing 95% limits of agreement for this variable of 0.44+2.44% (this
means there was a standard deviation of 1.25% for the 60 inter-operator errors. Let us
assume that the 277 values are correct and then add random error using the Microsoft
Excel function shown in equation (1). This generates normally distributed random
errors with a standard deviation of 1.25%

Error = 1.25 * NORMSINV(RAND()) (1)


Downloaded by [University of Florida] at 07:06 15 December 2017

Table 4 shows that adding these errors to the 277 values increases the variability
within each playing position as well as the overall sample of 277 players. Table 4
also shows a similar pattern if 20 or 10 players for each positional group are selected
at random. The increase in variability decreases the H value of the Kruskal Wallis H
test and, in so doing, increases the P value reported. Therefore, the effect of
measurement error due to limited reliability appears to be to increase the p value
making it less likely that a significant difference will be found. Therefore, limited
reliability is more likely to cause a Type II Error (concluding no significant difference
when there is a difference in reality) than a Type I Error (concluding that there is a
significant difference when there is no difference in reality).

Table 4. The effect of introducing measurement error to between samples study with
different sample sizes.
Position 277 Players 60 Players 30 Players
N Correct With N Correct With N Correct With
Error Error Error
Defenders 59 9.75+2.49 9.80+2.66 20 10.16+2.52 10.11+2.82 10 9.07+1.84 9.25+1.74
Midfielders 115 10.97+2.28 10.90+2.57 20 11.38+2.13 11.18+2.65 10 10.11+1.54 10.77+2.37
Forwards 103 9.89+2.01 9.81+2.29 20 10.83+2.10 11.19+2.08 10 10.21+1.74 10.00+2.23
All 277 10.31+2.30 10.26+2.54 60 10.79+2.27 10.83+2.54 30 9.80+1.73 10.01+2.15
H2 18.4 14.7 4.7 3.5 1.9 1.8
p <0.001 0.001 0.096 0.174 0.392 0.401

7.0. Is reliability of observation and data collection as important in coaching,


media and scoring applications of performance analysis as it is in scientific
research?

Where performance analysis systems use human operators, there should be a


reliability study conducted every time the system is operated by new users. This is
essential for original research studies to be published in the International Journal of
Performance Analysis of Sport. Indeed, one of the seven types of scientific
dishonesty identified by Shore (1991) is the use of faulty data gathering procedures.
Reporting sports performance data as though it is 100% accurate where there may be

42
limited reliability due to operator error risks third parties making poor decisions about
training and preparation based on the findings. Therefore, the level of reliability
should be evaluated and reported so as the reader has an understanding of the quality
of data gathering and is able to take account of this when interpreting the findings of
the study. Reliability is not only important when performance analysis systems are
being used in academic research but also in coaching, media and scoring contexts.
O’Donoghue and Longville (2004) describe an approach to improving the reliability
of the system used by the Welsh netball squad. Important decisions effecting training
and preparation for international competition are made based on information provided
by such systems. Therefore, correctness of the information is very important in
coaching contexts.

The computerised scoring system used in amateur boxing is an example of


performance analysis as each judge must analyse the performance of the two boxers
in order to contribute to the decision as to the bout outcome. Coalter et al. (1998)
have demonstrated that a minority of less than one third of actual scoring punches are
Downloaded by [University of Florida] at 07:06 15 December 2017

scored by the computerised scoring system used in amateur boxing. Therefore, there
is a risk of the wrong decision being produced where the number of recorded scoring
punches for the two boxers gives the opposite result to what would be obtained if all
actual scoring punches were used. There have been attempts at improving the
operating environment of the judges to allow the system to be used (Mullan and
O’Donoghue, 1999 and 2001) but these have failed to prove more reliable than the
current approach used by IABA (International Amateur Boxing Association).
Systems have been developed to allow the activity of judges to be scrutinised (Scott,
1996) and there are metrics to allow biased, lazy and incompetent judges to be
identified (Brown and O’Donoghue, 1999). The efforts made to try and improve the
scoring system used in amateur boxing may be considered more important than
reliability evaluations for academic research. This is because when athletes have
prepared diligently to compete in Olympic, World, Commonwealth or continental
international boxing tournaments, their performances should be scored as accurately
as possible to ensure that their success is deserved.

The media has a number of different forms; printed media, television and radio as
well as the increasing number of internet sites reporting sports news. This paper uses
an example from the printed media as an example of the limited reliability of
information in the media. The newspaper business has a goal of selling news papers
and must therefore provide their readers with the sports coverage the readers are
interested in. The author bought the following British newspapers on Sunday 10th
September 2006, the day after a full programme of domestic soccer and rugby
matches as well as the women’s singles final at the US Open: The Independent on
Sunday, The Mail on Sunday, The News of the World, The Sunday Express, The
Sunday Mirror, The Sunday Telegraph and The Sunday Times.

Of these, only The Sunday Times and The Sunday Telegraph went beyond qualitative
analysis of performances when reporting on soccer matches. For 3 of the FA Premier
League soccer matches, The Sunday Telegraph showed each team’s %time in
possession, and frequency of offsides, shots on target, shots off target, corners, fouls
conceded, yellow cards and red cards. The Sunday Times reported the same statistics
for each FA Premier League match played the previous day with the number of
blocked shots as an additional indicator. Table 5 shows that there was limited

43
agreement between the two newspapers for some of the common performance
indicators used for the three matches both provided quantitative match facts for.
There was no quantitative analysis of process indicators during the reporting of any
other sport in any of the Sunday newspapers surveyed on that day.

Table 5. Performance indicators reported in The Sunday Times and the Sunday
Telegraph on 10th September 2006 for 3 FA Premier League soccer matches.
Newspaper Team Performance Indicator

Possession

conceded
Shots off
Shots on
Offsides

Corners
target

target

Fouls
%
The Sunday Telegraph Everton 54 3 6 2 2 22
Liverpool 46 4 5 16 11 13
The Sunday Times Everton 46 3 4 2 2 23
Liverpool 52 4 4 14 11 14
Downloaded by [University of Florida] at 07:06 15 December 2017

The Sunday Telegraph Portsmouth 45 4 7 2 6 14


Wigan 55 4 5 6 9 5
The Sunday Times Portsmouth 47 4 4 3 6 15
Wigan 53 4 4 5 10 9

The Sunday Telegraph Chelsea 53 3 6 13 6 12


Charlton 47 2 3 1 3 7
The Sunday Times Chelsea 54 3 8 14 6 12
Charlton 46 1 2 3 3 9

The question of importance of reliability of performance indicators reported in the


media is still open. It depends on what the information is going to be used for.
Coaches in high level sport would tend to use their own analysts rather than relying
on media reports. There are research investigations that have used performance
indicators provided on internet media (O’Donoghue, 2002). Such studies need to
check the reliability of the match statistics used by undertaking a separate reliability
study. Table 6 shows that the number of points deemed to be net points in US Open
tennis matches differs between the author’s observation and the values reported on the
US Open official internet site (US Open, 2006). There is a systematic bias with 1.95
more net points being reported on average for each player in each set on the internet
site than was recorded by the author. The author’s definition of a player approaching
the net was that the player would have to cross the service line before the last shot of
the rally was played. The fact that the internet site never reported a lower value than
the author indicates that there is a methodological difference with additional points
being counted as net points by the method used by the US Open internet site. When
using internet data where such a systematic bias exists, either the systematic bias
needs to be removed or the definition used on the internet site needs to be understood.

44
Table 6. Number of points where players went to the net as measured by 2 different
methods.
Match / Set Author’s Observation US Open Internet Site
Player 1 Player 2 Neither Player 1 Player 2 Neither
Federer v Davydenko Set 1 2 2 35 3 4 32
Federer v Davydenko Set 2 6 3 65 10 7 67
Federer v Davydenko Set 3 6 0 47 7 2 44
Roddick v Youhzny Set 1 18 2 59 21 3 55
Roddick v Youhzny Set 2 5 3 25 7 4 22
Roddick v Youhzny Set 3 13 7 61 16 10 55
Roddick v Youhzny Set 4 13 3 44 15 5 40
Federer v Roddick Set 1 5 7 35 7 9 31
Federer v Roddick Set 2 13 13 29 16 13 26
Federer v Roddick Set 3 8 18 57 11 19 53
Federer v Roddick Set 4 4 7 29 4 8 28

8.0. Is there a precision versus reliability trade off?


Downloaded by [University of Florida] at 07:06 15 December 2017

There are occasions where an attempt is made to measure a performance indicator


precisely but with poor reliability where a categorical version of the performance
indicator may be much more reliable. Consider the reliability study undertaken by
McLaughlin and O’Donoghue (2001) into the reliability of the CAPTAIN system for
investigating work-rate of primary school children in the playground. A total of 28
children were observed during the inter operator reliability study. The mean duration
of high intensity bursts recorded by McLaughlin and O’Donoghue was 2.36+1.21s
and 3.12+2.93s respectively. The duration of high intensity bursts was
heteroscedastic (r = 0.853) and the 95% ratio limits of agreement were 0.89×/÷3.26
showing poor reliability. However, it is clear from the data that for most subjects, the
two observers agreed that the activity was intermittent. If the average duration of
high intensity burst is categorised into no bursts performed, average burst lasting less
than 10 s and 10 s or greater, as shown in Table 7, the kappa result of 0.52 indicates a
moderate strength of agreement. This shows how a performance indicator, which
may be unreliable when expressed to 0.01s precision, can be more reliable when
expressed less precisely or even categorised into sub-ranges. Such categorical data
may still be important if inferences about energy systems and physiological demands
can be made based on the modal subject performing bursts of mean duration under
10s. It would be unfortunate if one could not even say that playground activity is
intermittent in nature when a categorical version of the variable indicates that it is.

Table 7. Summary of categorical data for duration of high intensity bursts for
objectivity.
Trainee Observer
No bursts <10s >=10s Total
Experienced No bursts 2 1 0 3
observer <10s 1 23 1 25
≥10s 0 0 0 0
Total 3 24 1 28

45
9.0. Recommendations.

Considering the exercises undertaken to try and answer the seven questions discussed
during this paper, the following steps are recommended during the development and
operation of a performance analysis system. Steps 1 to 4 are relevant during system
development and steps 5 to 7 should be applied by each new operator using the
system.

1. Identify the performance indicators of interest and define these as precisely as


possible. Where the performance indicator represents a complex pattern that
is difficult to define in words, example video sequences for each value of the
indicator may be required to train operators.
2. Identify the values of performance indicators for different types of
performances (tactically, technically or in terms of energy systems).
3. Select a reliability statistic that will have construct validity.
4. Determine what value of this performance indicator represents an acceptable
Downloaded by [University of Florida] at 07:06 15 December 2017

level of reliability (based on step 2 of these recommendations).


5. Train the operators using intra-operator reliability studies.
6. Undertake an inter-operator reliability study. This stage could be to compare
reliability of performance indicators using a system with those published in
media or internet sources.
7. If the level of reliability achieved by the operators is poor, consider using a
less precise categorical version of the performance indicator.

10.0. Acknowledgements.

The author would like to thank the delegates of the World Congress of Performance
Analysis of Sport 7 in Szombathley, Hungary who participated in the reliability
workshop especially Mike Hughes, Nic James, Martin Lames, Tim McGarry and
Albin Tenga for their comments on the ideas presented in this paper.

11.0. References.

Altman, D.G. (1991). Practical Statistics for Medical Research. London: Chapman
& Hall.

Armitage, P. (2006). Analysis of the knockout stages of the 2003 rugby world cup,
B.Sc Dissertation, School of Sport, University of Wales Institute Cardiff,
Cyncoed Campus, Cardiff, UK.

Brown, D. and O’Donoghue, P.G. (1999). Specification and evaluation metrics for
identifying biased, lazy and incompetent judges in amateur boxing, Book of
Abstracts, Exercise and Sports Science Association of Ireland, Limerick, 19th
November 1999.

Coalter, A., Ingram, B., McCrorry, P. MBE, O'Donoghue, P.G. and Scott, M. (1998).
A Comparison of Alternative Operation schemes for the Computerised
Scoring System for Amateur Boxing. Journal of Sports Sciences, 16, 16-17.

46
Cooper, S.M. (2006). Statistical methods for resolving issues relevant to test and
measurement reliability and validity in variables related to sport performance
and physical fitness, Ph.D Thesis, School of Sport, University of Wales
Institute Cardiff, Cyncoed Campus, Cardiff, UK.

Hughes, M. and Bartlett, R. (2002). The use of performance indicators in performance


analysis. Journal of Sports Sciences, 20, 735-737.

Hughes, M., Evans, S. and Wells, J. (2001). Establishing normative profiles in


performance analysis. International Journal of Performance Analysis of
Sport (e), 1, 4-27.

Hughes, M., Cooper, S.M. and Nevill, A. (2004). Analysis of notation data: reliability.
In Notational analysis of sport, 2nd Edition (Edited by M. Hughes and I.M.
Franks). London: Routledge, 189-204.
Downloaded by [University of Florida] at 07:06 15 December 2017

Hughes, M.G., Rose, G. and Amaral, I. (2005). The influence of recovery duration on
blood lactate accumulation in repeated sprint activity. Journal of Sports
Sciences, 23, 130-131.

Hunter, P. and O’Donoghue, P.G. (2001). A match analysis of the 1999 Rugby Union
World Cup. In Proceedings of the World Congress of Performance
Analysis, Sports Science and Computers (PASS.COM) (Edited by M.
Hughes and I.M. Franks). Cardiff: CPA Press, UWIC, 85-90.

McCorry, M., Saunders, E.D., O'Donoghue, P.G. and Murphy, M.H. (1996). A match
analysis of the knockout stages of the 3rd Rugby Union World Cup. In
Proceedings of the World Congress of Notational Analysis of Sport III
(Edited by M. Hughes). Cardiff: CPA Press, UWIC, 230-239.

McLaughlin, E. and O’Donoghue, P.G. (2001). The reliability of time-motion analysis


using the CAPTAIN system. In Proceedings of the World Congress of
Performance Analysis, Sports Science and Computers (PASS.COM)
(Edited by M. Hughes and I.M. Franks). Cardiff: CPA Press, UWIC, 63-68.

Mullan, A. and O’Donoghue, P.G. (1999). An alternative computerised scoring


system for amateur boxing, Book of Abstracts, Exercise and Sports Science
Association of Ireland, Limerick, 19th November 1999.

Mullan, A. and O’Donoghue, P.G. (2001). An alternative computerised scoring


system for amateur boxing. In Proceedings of the World Congress of
Performance Analysis, Sports Science and Computers (PASS.COM)
(Edited by M. Hughes and I.M. Franks). Cardiff: CPA Press, UWIC, 359-364.

O’Donoghue, P.G. (2002). Performance models of ladies’ and men’s singles tennis at
the Australian Open. International Journal of Performance Analysis of
Sport (e), 2(1), 73-84.

47
O’Donoghue, P.G. (2005). An Algorithm to use the Kappa Statistic to establish
Reliability of Computerised Time-Motion Analysis Systems, Book of
Abstracts, 5th International Symposium of Computer Science in Sport, Hvar,
Croatia, 25th-28th May, Book of abstracts, pp.49.

O’Donoghue, P.G. (2006). Performance indicators for possession and shooting in


international netball. In Performance Analysis of Sport 7 (Edited by H.
Dancs, M. Hughes and P.G. O’Donoghue). Cardiff: CPA UWIC Press, 459-
467.

O’Donoghue, P.G. and Longville, J. (2004). Reliability testing and the use of
Statistics in performance analysis support: a case study from an international
netball tournament. In Performance Analysis of Sport 6 (Edited by P.G.
O’Donoghue and M.D. Hughes). Cardiff: CPA Press, UWIC, pp. 1-7.

O'Donoghue, P.G. and Murphy, M.H. (1996). Object Modelling and Formal
Downloaded by [University of Florida] at 07:06 15 December 2017

Specification during Real-time System Development. Journal of Network


and Computer Applications, 19, 335-352.

O’Donoghue, P.G. and Parker, D. (2001). Time-motion analysis of FA Premier


League soccer competition. In Proceedings of the World Congress of
Performance Analysis, Sports Science and Computers (PASS.COM)
(Edited by M. Hughes and I.M. Franks). Cardiff: CPA Press, UWIC, 263-266.

O'Donoghue, P.G., Hughes, M.G., Rudkin, S., Bloomfield, J., Cairns, G., Powell, S.
(2005). Work rate analysis using the POWER (Periods of Work Efforts and
Recoveries) System. International Journal of Performance Analysis of
Sport (e), 5(1), 5-21.

Scott, M. (2006). A system for scrutinising the activity of judges operating the
computerised scoring system in amateur boxing, M.Sc thesis, Faculty of
Engineering, University of Ulster, Jordanstown, UK.

Shore, E.G. (1991). Analysis of a multi-institutional series of completed cases,


Scientific Integrity Symposium, Harvard Medical School, Boston, USA,
February 1991. Cited in Thomas, J.R. and Nelson, J.K. (1996). Research
Methods in Physical Activity, 3rd Edition. Champaign, Il: Human Kinetics.

Sockett, W. (2006). Observational analysis of type and intensity of children’s play during
primary school break times, B.Sc Dissertation, School of Sport, University of
Wales Institute Cardiff, Cyncoed Campus, Cardiff, UK.

US Open (2006). www.usopen.org, accessed 12/09/06.

Williams, L. and O’Donoghue, P.G. (2006). Defensive strategies used by international


netball teams. In Performance Analysis of Sport 7 (Edited by H. Dancs, M.
Hughes and P.G. O’Donoghue). Cardiff: CPA UWIC Press, 474-479.

48

You might also like