You are on page 1of 14

Statistical  Hydrology,  Fall  2012  






Carlos  Serrano  Moreno  

Principal  Component  Analysis  of  Precipitation  in  Spain  
One   of   the   common   techniques   used   to   deal   with   large   data   sets   is   Principal   Components   Analysis     (PCA).   This   technique  is  a  statistical  analysis  method  frequently  used  in  the  geophysical  sciences  to  explain  correlations  in  a   large  set  of  variables  and  provides  a  smaller  number  of  independent  components.   In  order  to  get  familiar  with  the  PCA  technique,  for  this  project  data  registered  by  19  weather  stations  in  Spain  will   be  used   so   as   to   find   the   relationship   among   the   variables   registered   that   leads   to   provide   a   good   estimator   of   the   rainfall.   By   PCA   analysis   it   was   possible   to   decide   that   the   best   way   to   predict   the   precipitation   as   a  

function   of   the   other   registered   variables   was   to   use   a   regional   approach.   Then   one   expression   was   adjusted  to  every  single  location.   Most   of   the   research   centres   that   work   on   Climate   Change   modelling   do   not   provide   estimations   of     precipitation  but  the  do  offer  predictions  for  other  variables  such  as  temperature,  atmospheric  pressure,   geopotencial   height   that   are   easier   to   predict.     Taking   into   consideration   that   the   variables   that   are   available   for   future   scenarios   the   following   multiple   regression   approach   was   used   to   predict   precipitation  for  every  station.     Keywords:  Principal  Component  Analysis,  multiple  regression,  precipitation  estimation,  Climate  Change   escenario,  Mediterranean  Areas.  




 Fall  2012             Carlos  Serrano  Moreno   Principal  Component  Analysis  of  Precipitation  in  Spain   1  Introduction   Flood   forecasting   is   one   of   the   most   important   challenges   in   hydrological   sciences   nowadays.   Then.     Another   typical   situation   where   being   able   to   prioritize   the   data   is   important   also   appears   when   dealing   with   weather   stations.   However.   when   trying   to   find   long-­‐term   rainfall   predictions   one   comes   out   with   agencies   or   organizations   that   provide   estimations   of   variables   that   can   be   related   with   rainfall   such   as   temperature   or   pressure.   It   is   very   important   to   distinguish   which   of   the   stations   are   the   ones  that  provide  relevant  data  and  be  able  to  reject  the  stations  that  provide  redundant  information.  pressure  or  any  other  variable   with  rainfall.   Normally.   a   direct   estimation   of   rainfall   is   unavailable.   Providing   alerts   with   an   adequate   anticipation   time   on   the   occurrence   of   the   flood   events   mitigates   its   impact   and   brings   enormous   social  benefits.     In  order  to  try  to  correct  these  estimations  researchers  try  to  use  all  the  information  available.   One   of   the   common   techniques   used   to   deal   with   large   data   sets   is   Principal   Components   Analysis     (PCA).  this  means  that  it  is   also   important   work   with   the   directly   registered   data   that   weather   stations   provide.   even   these   products   are   able   to   provide   rainfall   estimations   at   fine   resolution   the   large   uncertainty   embedded   in   these   simulations   makes   that   all   these   estimations   have   to   be   pre-­‐processed   and   corrected  before  being  used  as  an  input  for  hydrologic  models.  Mediterranean  areas  are  especially  vulnerable  to  the  occurrence  of  flash  flood  events.   In   order   to   be   able   to   provide   a   sufficient   lead   time   for   mitigating   the   effects   of   this   hazardous   events   scientific   researchers  use  products  as  Numerical  Weather  Predictions  (NWPs)  or  rainfall  observations  provided  by  weather   radars.  Especially  when  working   over   a   large   domain   (national   or   continental   scale).  due  to  the   steep  slopes  and  the  big  amount  of  the  runoff  draining  along  the  impermeable  surface  of  the  catchment.   3       .   however.Statistical  Hydrology.   weather   stations   provide   a   direct  measurement  of  the  variables  at  a  location.   Whether   tools   such   as   NWP's   or   weather   radars   are   just   providing   rainfall   estimations   over   the   whole   study   area.  due  to  the  high  amount  of  weather  stations  available  in  Mediterranean  areas  it  becomes  also  important  to   learn  how  to  deal  with  large  sets  of  data  that  have  been  obtained  at  different  positions.   Also   in   this   case   the   study   of   the   data   available   from  the  local  stations  can  help  to  find  an  accurate  relation  between  temperature.   This   technique  is  a  statistical  analysis  method  frequently  used  in  the  geophysical  sciences  to  explain  correlations  in  a   large  set  of  variables  and  provides  a  smaller  number  of  independent  components.

1  Available  data  and  objectives:     In  order  to  do  this  project.  for  this  project  data  registered  by  19  weather  stations  in  Spain  will   be  used  so  as  to  find  the  relationship  among  the  variables  registered  that  leads  to  provide  a  good  estimator  of  the   rainfall.Statistical  Hydrology.   there   are   some   climatic   regions   inside   the   country   that   follow   different   patterns   (some   variables   will   be   strongly   correlated   with   the   rainfall   in   some   areas   but  not  in  the  other  ones).  and  thanks  to  the  Spanish  Meteorological  Agency  (AEMET).  Due  to  the  big  amount  of  these  variables  the  use  of  PCA  technique  becomes  necessary  in  order   to   identify   which   of   these   variables   are   closely   related   with   the   rainfall   and   try   to   find   a   way   of   predicting   the   monthly  rainfall  by  using  one  combination  of  the  variables  here  given.  obtain  new  variables  that  are  going  to  be  linearly  independent  between  each  other.  By  using  PCA  the   complexity  of  the  problem  will  be  simplified  because  of  the  fact  that  a  smaller  number  of  variables  will  be  involved   in  the  estimation  of  the  rainfall.   1.  Total  Precipitation.     1.   Temp.  monthly  data  registered  in   19   different   weather   stations   placed   in   different   provinces   is   available.  Daily  precipitation.  The  first  goal   will  by  identifying  if  the  vectors  of  the  PCA  base  are  the  same  (or  involve  the  same  variables  in  the  same  way)  for   each   station.   However.  Max.  Fall  2012             Carlos  Serrano  Moreno   In  order  to  get  familiar  with  the  PCA  technique.  Depending  on  the  results  obtained.   Max   Temp.   due   to   the   big   climatic   differences   between   regions   in   the   country.  Atmospheric  Pressure  and   average  isolation.   on   the   other   hand.  but  also.     Some   of   the   variables   registered   at   the   weather   stations   are:   Month.   Year.   it   will   be   interesting   to   find   if   the   same   PCA   base   can   be   used   for   understanding   the   problem   all   over   Spain   or   if.  hail  days.   Then   according   to   the   results.   Av.  Then  the  ones   that  will  explain  a  higher  %  of  the  variance  will  be  chosen  so  as  to  predict  the  monthly  rainfall.   Temperature.   It   will   not   only   be   possible   to   identify   which   variables   are   the   ones   who   have   a   stronger   meaning   inside.       4       .   Min.  if  a  common  relationship  between  the  variables  in  all  the  stations  is   observed   this   one   will   be   used   to   characterize   the   precipitation   over   Spain.  Snowy  days.   In   most   locations   the   registered   information  is  available  from  January  1920  until  August  2012.2  Procedure:     The  PCA  analysis  will  be  performed  in  every  different  weather  station  as  well  as  in  the  whole  sample.     By   using   PCA.  Rainy  days.   it   is   expected   to   find   that   each   meteorological   variables   plays   a   different  role  for  each  climatic  area.

2  Application  of  PCA  to  the  data  set:     In  this  study  PCA  technique  will  be  applied  to  the  data  registered  in  19  different  meteorological  stations  placed  in   the   main   cities   of   Spain.   As   an   output  of  the  PCA  the  first  principal  component  will  show  the  direction  of  greatest  variability  (covariance)  in  the   data.  In  other  words.  but  also  a  method  to  reduce  the  number  of  parameters  of  the  problem.  even  the  complexity  of  the  problem  will  be  bigger.   the   variables   here   studied   do   not   have   a   high   correlation   between   each   other.   is   not   always   possible.  the  month  to  which  the  data  corresponds  is  also  included).  It  is  not  easy  to   decide   which   is   the   number   of   principal   components   that   should   be   used   in   the   analysis.    The  possibility  of  reducing   the  number  of  dimensions  of  the  problem  is  one  of  the  most  used  properties  of  the  PCA  technique.         2.     However.     Understanding   the   physical   meaning   of   the   principal   components.   It   may   seem   that   the   decision   of   applying   PCA   technique   in   this   situation   is   wrong.  it   can  be  easier  to  understand  the  result  if  all  the  variables  have  a  clear  physical  meaning.   even   can   be   very   interesting.   The   number   of   principal   component   that   can   be   obtained  is  equal  to  the  number  of  variables  inside  the  data  set  (in  this  study  this  number  of  variables  is  10.1  Principal  Component  Analysis  (PCA)   As   it   was   said   before   the   statistical   technique   chosen   to   analyze   the   data   set   available   is   Principal   Component   Analysis  (PCA)  PCA  is  the  most  common  form  of  a  factor  analysis.   as   it   can   be   seen   in   Table   1.   In   spite   of   the   normal   situations   where   PCA   is   used.   depending   on   the   necessities   of   the   research   it   will   be   possible   to   accept   losing   some   information   in   order   to   work   with   a   smaller   number  of  variables.  PCA  technique  is  not  only  a  way  of  transferring  a  ser  of  correlated  variables  into  a  set  of  uncorrelated   ones.  This  is  the  reason  why  when  it  is  not  possible  to  reduce  the  problem  into  a  small  number  of  variables  it   can  be  better  to  work  with  the  initial  set  of  variables  because.  if  the  initial  set  of   variables  is  highly  correlated  it  will  be  possible  to  work  with  only  some  of  the  main  principal  components  obtained   because  a  small  number  of  variables  will  be  able  almost  the  same  variability  of  the  data.  Fall  2012             Carlos  Serrano  Moreno   2  Method  of  analysis   2.  By  using  these  technique  it  will  be  possible  to   obtain  new  variables  (also  known  as  dimensions  or  Principal  Components)  that  will  be  linear  combinations  of  the   original  variables  registered  by  each  weather  station.  the  second  principal  component  is  the  next  orthogonal  (uncorrelated)  dimension  of  greatest  variability.   These   variables   also   will   try   to   capture   as   much   of   the   original   variance   in   the   data   as   possible.   However.Statistical  Hydrology.  By  definition  of  PCA  this  new  variables  are  uncorrelated  with   one   another.  This   procedure   is   following   to   find   all   the   principal   components.   another   interesting   point   of   PCA   is   that   this   technique   5       .  apart   from  the  atmospheric  variables  registered.

13 -0.01 0.03 -0.Statistical  Hydrology. Temp -0. Temp Frost Height Insolation % Max. Daily R -0.05 Frost Height Insol % Max.24 0.05 0.04 -0.45 1.53 0.04 0.12 -0.  Fall  2012             Carlos  Serrano  Moreno   also  allows  to  identify  insights  and  hidden  relation  between  the  variables  that  can  not  be  seen  just  by  performing  a   simple  analysis  of  the  registered  values.03 -0. Pressure Atmos.15 0.24 -0.00 -0.00 0.38 -0.13 -0.38 -0.59 1.1 -0.00 -0.13 0 0 0.24 -0.03 -0.00 0.12 -0.06 0 -0.1 -0.00 0.15 Wind dir -0.02 -0.04 -0.08 -0.107 Total Precip.00 -0.01 1.02 Av.79 -0.05 -0. Wind dir Wind vel 1.06 -0.08 -0.03 0.09 0.11 -0.01 0.04 -0.16 -0.14 -0.46 0.2 -0.     Using   R   software   and   RCommander   Package   the   result   obtained   after   performing   the   PCA   analysis   to   the   data   obtained  in  Barcelona  is  shown  in  the  following  figure  and  table:     Figure  1:  Sedimentation  graphic  showing  the  %  of  Explained  Variance.00 0.04 -0.3 0.03 -0.03 -0. -0.24 -0.14 -0.59 -0.68 0.01 -0. Pressure Av.53 -0.00 0.06 1.       Atmos.       6       .05 0.03 0 -0.45 -0.2 -0.04 0.01 1.12 0.46 0.13 -0.2 -0.33 0.13 -0.04 -0.00 0.3 -0.43 0.01 Wind vel -0.38 -0.11 -0.107 0.00 0.338 -0.2 -0.06 0.11 -0.13 -0.1 1.11 -0.05   Table  1:  Covariance  matrix  of  the  variables  registered  in  the  weather  station  in  Barcelona.03 -0.01 0.13 0 0 -0.68 1.08 0.79 0.00 -0.05 -0.12 -0.03 -0.03 -0.38 -0.1 0.09 -0.338 -0.01 -0.04 1.33 1. Daily Rainfall Month Rainy Days Total Precip.08 -0.16 Month Rainy Days -0.43 1.13 -0.

  If   the   main   objective   of   the   research   was   to   reduce   the   complexity   of   the   problem  it   would   be   possible  to  use  the  6  main  principal  components  and  reproduce  around  the  86  %  of  the  variability  as  it  can  be  seen   in  Table  2.90 1.10 9.16 85.11 %  of  variance   29.35 0.20 7.35 11.90 48.   Figure   2   shows   the   role   that   each   variable   plays   inside   the   2   main   principal   components   (just  these  two  components  represent  almost  the  50  %  of  the  variance  of  the  data  set.08 98.96 0.  Fall  2012                     Carlos  Serrano  Moreno     Component Component Component Component Component Component Component Component Component Component 1 2 3 4 5 6 7 8 9 10 Eigenvalue   2.   there   are   10   principal   components   that   represent   the   whole   variance   contained   in   the   data   set   from   Barcelona.77 0.25 59.90 18.02 cumulative  %  of  variance   29.   As   Figure   1   shows.Statistical  Hydrology.67 6.52 2.56 96.11 0.98 100     Table  2:  Sedimentation  graphic  showing  the  variance  explained  by  every  principal  component.96 78.   It   is   also   interesting   to   analyze   the   contribution   that   each   of   the   registered   variables   has   inside   the   obtained   principal   components.92 0.73 3.61 9.83 92.  )   7       .29 0.35 68.67 0.99 1.83 1.

    The  same  PCA  analysis  is  performed  for  the  remaining  18  stations  and  similar  results  are  obtained.     8       .  It  is  important   to  point  out  that  for  all  the  stations  the  principal  components  explain  the  same  %  of  variance  and  also  the  loads   that   each   of   the   registered   variables   are   the   same.  Fall  2012             Carlos  Serrano  Moreno     Figure  2:  Variables  factor  main  for  the  2  main  principal  components  for  the  data  registered  in  Barcelona.   Another   example   can   be   seen   in   Figure   3   for   the   variables   registered  in  Guadalajara.  it  is  important  to  keep  in  mind  that  the  main  objective  of  this  analysis  is  to  find  the  variables  that   have  a  stronger  relation  with  precipitation  so  as  to  be  able  to  estimate  in  for  future  scenarios.Statistical  Hydrology.     It  is  difficult  to  get  any  outcome  from  Figure  2  so  it  is  not  possible  or  at  least  trivial  to  guess  the  physical  meaning   that  the  principal  components  would  have  in  this  case.     Nevertheless.

 The  first  idea  was  to  find  a  single  relationship  but  after  analyzing  together  the  contribution  that  each   variable  had  inside  the  first  components  some  interesting  points  were  discovered.  Fall  2012             Carlos  Serrano  Moreno     Figure  2:  Variables  factor  main  for  the  2  main  principal  components  for  the  data  registered  in  Guadalajara.   Just  by  doing  a  visual  analysis  of  these  variables  factor  map  One  may  be  tempted  to  think  that  the  data  set  is  not   depending  on  the  station  where  it  has  registered  so  all  the  meteorological  variables  play  a  similar  role  around  the   country.Statistical  Hydrology.     9       .3  Discussion  of  the  results  obtained  after  performing  the  PCA:     After   performing   the   PCA   analysis   for   each   meteorological   station   the   decision   of   trying   to   find   just   a   unique   relation  to  extrapolate  the  precipitation  for  the  whole  country  or  finding  an  individual  relation  for  each  station  had   to  be  taken.     2.

)  and  also  the  coefficient  becomes   positive  in  Zaragoza  and  Palencia  (Zaragoza  is  one  city  that  is  300  km  to  the  west  from  Barcelona  and  Palencia  is   another  city  that  is  close  to  the  border  with  Portugal.   It   is   interesting   to   say   that   Cadiz   is   really   close   to   the   Strait   of   Gibraltar.   However.     There  are  also  more  differences  that  can  be  seen  when  taking  a  detailed  look  into  the  different  variables.  North-­‐east  of  Spain.   However.Statistical  Hydrology.   It   seems   then   that   these   singularities   that   every   city   has   can   also   be  observed  taking  a  detailed  look  into  the  PCA  results.   and   is   one   of   the   most   important   places   in   Europe   to   practice   surf.   there   are   some   singularities  in  some  of  the  variables  that  may  play  an  important  key  role  when  trying  to  get  an  estimation  of  the   precipitation.)  Figure  4  shows  a  climatic  map  of  Spain  where  the  position  of   these  cities  is  show.     As   it   can   be   seen   in   Figure   3   there   is   a   main   trend   that   most   of   the   station   follows.  Fall  2012             Carlos  Serrano  Moreno     Figure  3:  Contribution  that  each  of  the  registered  variables  has  in  the  first  principal  component  for  each  station.  The  cities   mentioned   before   offer   different   coefficients   to   the   ones   offered   by   the   majority   of   the   cities   for   most   of   the   variables.  which  climate  could  be  defined  as  continental.  Tarragona  and   Castellón  (4  cities  that  are  along  the  Mediterranean  coast.  it  gets  nearer  to  cero  for  the  stations  placed  in  Barcelona.   10       .  Girona.  As  it  can  be  seen  the  climate  of  the  cities  that  are  distributed  along  the  Mediterranean  coast   and   also   Zaragoza   an   Palencia   have   particular   climatic   conditions   that   made   them   different   to   the   rest   of   the   cities   used  in  this  study.   there   are   some   variables   like   Wind   velocity   where   cities   as   Cadiz   also   can   be   differentiated   from   the   main   trend.   Taking   a   look   at   the   Averaged   temperature   it   is   possible   to   see   3   different   behaviors:     in   most   of   the   cities  the  coefficient  is  negative.

    Figure  4:  Climate  in  Spain  and  position  of  the  cities  were  the  coefficient  that  Average  temperature  is  different  to  the  one   offered  by  most  of  the  stations  in  the  first  principal  component.  Fall  2012             Carlos  Serrano  Moreno   Something   similar   happens   with   Jaen   if   one   focuses   on   the   insolation   %.     As  a  conclusion  of  the  PCA  analysis  for  every  station  it  seems  logical  that  the  estimation  of  the  precipitation  will   offer   a   better   performance   if   one   works   in   a   regional   scale   rather   than   trying   to   deal   with   the   problem   for   the   whole  country.   Even   these   conclusions   are   just   qualitative   and   seem   hard   to   be   proved   it   is  very  interesting  to  see  how  PCA  method  is  pointing  out  these  insights  that  could  be  very  difficult  to  see  from  a   simple  observation  of  the  registered  values.Statistical  Hydrology.   One   should   take   into   account   that   the   quality   of   the   olives   are   closely   related   to   the   insolation   that   they   have.       11       .   Jaen   is   one   of   the   Spanish   provinces   where   most   of   the   olive   oil   is   produced.

 Fall  2012             Carlos  Serrano  Moreno   3  Analysis  and  Results   Thanks  to  the  PCA  analysis  it  was  possible  to  decide  that  the  best  way  to  predict  the  precipitation  as  a   function   of   the   other   registered   variables   was   to   use   a   regional   approach.   However.     Taking  into  consideration  that  the  variables  that  are  available  for  future  scenarios  the  following  multiple   regression  approach  was  used  to  predict  precipitation  for  every  station:     Even.     Figure  5:  Correlation  obtained  after  adjusting  the  Multivariate  Regression  model  suggested  to  estimate  precipitation.   geopotencial  height  that  are  easier  to  predict.   these   institutions   provide   estimations   of   other   variables   such   as   temperature.   Then   one   expression   was   adjusted  to  every  single  location.   Most   of   the   research   centres   that   work   on   Climate   Change   modelling   are   not   able   to   provide   reliable   estimations   of   rainfall   due   to   the   fact   that   rainfall   is   a   very   random   phenomena.     12       .Statistical  Hydrology.   atmospheric   pressure.  the  approach  here  suggested  to  obtain  an  estimate  of  the  precipitation  is  very  simple  (according   to  the  correlation  matrix  shown  at  the  previous  section)  and  is  not  able  to  offer  a  good  prediction  for   precipitation  an  interesting  result  can  be  seen  in  Figure  5.

   In  this  case  PCA  has  not  been  used  to  reduce  the  dimension  of  the  problem  because  of  the  fact  that   the  variables  of  the  data  set  showed  a  very  low  correlation  between  each  other.  PCA  is  shown  to  be  a  useful  technique  to  study  the  problem  here   suggested.      It  is  obvious  then  that  at  least  in  these  areas  the  problem  should  be  tried  to  solve  using   different  variables.  the  linear  model  used  to  estimate  the  precipitation  is  not  offering  a  good  result  but  can  be  helpful  as  a   first   approach   for   further   research.Statistical  Hydrology.   4  Conclusions   As  it  has  been  described  in  the  previous  sections.   Even   the   linear   regression   model   offers   a   very   low   performance   another   interesting   thing   and   also   a   consistent   result   was   that   precipitation   in   the   Mediterranean   areas   is   harder   to   be   predicted  while  in  Atlantic  climate  areas  the  correlation  between  the  atmospheric  variables  are  closer.   13       .  the  Mediterranean   climate  is  known  to  suffer  from  convective  rainfall  events  that  happen  at  a  smaller  scale  and  are  much   more  difficult  to  predict.  However.  Fall  2012             Carlos  Serrano  Moreno   It  is  interesting  to  see  that  the  stations  that  are  closer  to  the  Mediterranean  are  not  able  at  all  to  predict   precipitation.  However  PCA  has  been  useful  to   discover  insights  relations  of  the  variables  and  the  location  that  could  not  have  been  directly  observed.  However  the  areas  that  are  affecting  by  the  Atlantic  an  continental  climate  offer  a   better   result   even   the   model   here   suggested   is   also   not   valid.   Then.   it   has   to   be   said   that   the   result  is  very  logical  because  the  rainfall  event  that   take  place  at  the  Atlantic  areas  are  easier  to  predict   because   they   are   originated   meanly   by   oceanic   storms   that   are   big   scale   phenomena   that   always   developed  under  some  certain  atmospheric  pressure  and  wind  conditions.  Thanks  to  the  output  offered  by  PCA  it  has  been  possible  to  identify  the  most  suitable  way  of  dealing  with   the   problem   just   by   using   a   simple   approach   to   try   to   estimate   precipitation   at   each   station   in   stead   of   doing   it   for   the  whole  country.  It  is  very   important  to  analyze  the  output  of  the  PCA  using  different  points  of  view.   Nevertheless.     However.

Statistical  Hydrology.  Odiyo.  1999:  Principal  Component  Analysis  of  Precipitaiton  in  Thessaly  Region  (Central  Greece).C.  Myronidis.  Vol  11.  Journal  of   Hydrologic  Engineering.  4.  and  T.   Lê  S.  R.  and  F.  Piechota.O.  No.  D.  Vol  13.  FactoMineR:  An  R  Package  for  Multivariate  Analysis.  2008:  Streamflow  Regionalization:  Case  Study  of  Turkey.  Fall  2012             Carlos  Serrano  Moreno     5  References   Reference  paper:     Kahy  E..  pp.  J.    Volume  25.K.  Josse.  Journal  of  Statistical   Software.  pp  467-­‐476.  Mpeta...   Global  NEST  Journal.  Husson.J.  19:  69-­‐80.  Kalayci  .  No  4.  and  E.     Stathis  D.  S.  205-­‐214   Other  papers:   Basalirwa  C.  J.  Mngodo.  Issue  1.     14       ..  2008.  International  Journal  of  Climatology.j.  1999:  The  climatological  Regions  of  Tanzania  Based  on   the  rainfall  characteristics.P.