Fron%ers  of   Computa%onal  Journalism  

  Columbia  Journalism  School   Week  1:  Basics     September  4,  2013      

Lecture  1:  Basics  
  Computer  Science  and  Journalism     Represen%ng  Data     Interpre%ng  High  Dimensional  Data          

Computa%onal  Journalism:  Defini%ons  
“Broadly  defined,  it  can  involve  changing  how   stories  are  discovered,  presented,  aggregated,   mone%zed,  and  archived.  Computa%on  can   advance  journalism  by  drawing  on  innova%ons   in  topic  detec%on,  video  analysis,   personaliza%on,  aggrega%on,  visualiza%on,  and   sensemaking.”      -­‐  Cohen,  Hamilton,  Turner,  Computa(onal  Journalism  

Computa%onal  Journalism:  Defini%ons  
“Stories  will  emerge  from  stacks  of  financial   disclosure  forms,  court  records,  legisla%ve  hearings,   officials'  calendars  or  mee%ng  notes,  and   regulators'  email  messages  that  no  one  today  has   %me  or  money  to  mine.  With  a  suite  of  repor%ng   tools,  a  journalist  will  be  able  to  scan,  transcribe,   analyze,  and  visualize  the  paUerns  in  these   documents.”      -­‐  Cohen,  Hamilton,  Turner,  Computa(onal  Journalism  

Cohen  et  al.  model  

Data  

Repor%ng   User  

Computer   Science  

CS  for  presenta%on  /  interac%on  

CS   CS  

Data  

Repor%ng   User  

Filter  many  stories  for  user  
CS   CS  

Data  

Repor%ng  

CS  

CS  

CS  

Data  

Repor%ng  

Filtering  
User  

CS  

CS  

Data  

Repor%ng  

Examples  of  filters  
•  •  •  •  •  •  •  What  an  editor  puts  on  the  front  page   Google  News   Reddit’s  comment  system   TwiUer   Facebook  news  feed   Techmeme   …  

Memetracker  by  Leskovic,  Backstrom,  Kleinberg    

Kony  2012  early  network,  by  Gilad  Lotan  /  Socialflow  

Track  effects  
CS   CS  

Data  

Repor%ng  

CS  

CS  

CS  

CS  

Data  

Repor%ng  

Filtering   User  

Effects  

CS  

CS  

Data  

Repor%ng  

Computer  Science  in  Journalism  
  Repor%ng   Presenta%on   Filtering   Tracking    

Computa%onal  Journalism:  Defini%ons  
“the  applica%on  of  computer  science  to  the   problems  of  public  informa%on,  knowledge,  and   belief,  by  prac%%oners  who  see  their  mission  as   outside  of  both  commerce  and  government.”      -­‐  Jonathan  Stray,  A  Computa(onal  Journalism  Reading  List  

Course  Structure  
•  •  •  •  •  •  •  •  Informa%on  retrieval:  TF-­‐IDF,  search  engines   Text  analysis:  clustering  and  topic  modeling   Informa%on  filtering  systems   Social  network  analysis   Knowledge  representa%on   Drawing  conclusions  from  data   Informa%on  Security   Tracking  flow  and  effects  

Informa%on  Retrieval  

Data  Science  

Natural  Language   Processing  

Clustering   Text  Analysis   Filter  Design   Social  Network  Analysis   Knowledge  Representa%on   Drawing  Conclusions   Sociology   Graph  Theory  

Ar%ficial     Intelligence  

Sta%s%cs  

Cogni%ve  Science  

Administra%on  
Assignment  acer  each  class    
Four  assignments  require  programming,  but   your  wri%ng  counts  for  more  than  your  code!  

Course  blog  
hUp://jmsc.hku.hk/courses/jmsc6041spring2013/    

Final  project  
to  be  completed  Feb-­‐April  

Lecture  1:  Basics  
  Computer  Science  and  Journalism     Represen%ng  Data     Interpre%ng  High  Dimensional  Data          

Defini%on  of  data     a  collec%on  of  similar  pieces  of   informa%on  

structured  data  

unstructured  data  

Vector  representa%on  of  objects  
Fundamental  representa%on  for  (almost)  all   data  mining,  clustering,  machine  learning,   visualiza%on,  NLP,  etc.  algorithms.    
! # # # # # # # " x1 $ & x2 & & x3 &  & & xN & %

! # # # # # # # "

x1 $ & x2 & & x3 &  & & xN & %

Each  xi  is  a  numerical  or  categorical  feature   N  =  number  of  features  or  “dimension”    

Examples  of  features  
•  •  •  •  •  •  •  •  •  number  of  claws   la%tude   color  ∈{red,  yellow,  blue}   number  of  break-­‐ins   1  for  “bought  X”,  0  for  “did  not  buy  X”   %me,  dura%on,  etc.   number  of  %mes  word  Y  appears  in  document   votes  cast   …  

“Feature  selec%on”  
Technical  meaning  in  machine  learning  etc.:      which  variables  ma.er?     We’re  journalists,  so  we’re  interested  in  an   earlier  process:     how  to  describe  the  world  in  numbers?  

Choosing  Features  
! # # # # # # # "
Journalism   How  do  we   represent  the   world   numerically?  

x1 $ & x2 & & x3 &  & & xN & %

! x # f (1) # x f (2) # #  # x f (k ) "

$ & & & & & %

where  k  ≤N  

Machine  learning   Which  variables   carry  the  most   informa%on?  

Different  types  of  “quan%ta%ve”  
•  Numeric  
–  con%nuous   –  countable   –  bounded?   –  units  of  measurement?  

•  Categorical  
–  finite,  e.g.  {on,  off}   –  infinite  e.g.  {red,  yellow,  blue,  ...  chartreuse…}   –  ordered?   –  equivalence  classes  or  other  structure?  

Different  types  of  scales  
Temperature   Con%nuous  scale,  fixed  zero  point,  physical  units,   compara%ve,  uniform  

Likert  Scale     Discrete  scale,  no  fixed  origin  ,  abstract  units,   compara%ve,  non-­‐uniform  

Likert  scales  are  non-­‐uniform  

No  averages  on  a  non-­‐uniform  scale  
It’s  not  linear,  so              is  2X1  twice  as  good?   (X1+c)  –  (X2+c)    ≠  X1  –  X2     Lots  of  things  don’t  make  much  sense,  such  as     sum(X1  ...  XN)  /  N  =  ?   Average  is  not  well  defined!  (Nor  std  dev,  etc.)   But  rank  order  sta%s%cs  are  robust.   And  all  of  this  might  not  be  a  problem  in  prac%ce.  

Other  issues  with“quan%ta%ve”  
•  Where  did  the  data  come  from?  
–  physical  measurement   –  computer  logging     –  human  recording  

•  What  are  the  sources  of  error?  
–  measurement  error   –  missing  data   –  ambiguity  in  human  classifica%on   –  process  errors     –  inten%onal  bias  /  decep%on  

! # # # # # # # "

x1 $ & x2 & & x3 &  & & xN & %

Even  with  all  these  caveats,  the  vector   representa%on  is  incredibly  flexible  and  powerful.      

Examples  of  vector  representa%ons  
Obvious  
–  movies  watched  /  items  purchased   –  Legisla%ve  vo%ng  history  for  a  poli%cian   –  crime  loca%ons  

Less  obvious,  but  standard  
–  document  vector  space  model   –  psychological  survey  results  

Tricky  research  problem:  disparate  field  types  
–  Corporate  filing  document   –  Wikileaks  SIGACT  

What  can  we  do  with  vectors?  
  Predict  one  variable  based  on  others  
–  this  is  called  “regression”   –  supervised  machine  learning    

Group  similar  items  together  
–  This  is  classifica%on  or  clustering   –  We  may  or  may  not  know  pre-­‐exis%ng  classes  

 

Lecture  1:  Basics  
  Computer  Science  and  Journalism     Represen%ng  Data     Interpre%ng  High  Dimensional  Data          

Interpre%ng  High  Dimensional  Data  

UK  House  of  Lords  vo%ng  record,  2000-­‐2012.   N  =  1043  votes  by  M  =  1630  lords     2  =  aye,  4  =  nay,  -­‐9  =  didn't  vote    

Vote  vectors  
let  v(i,j)  =  vote  of  MP  i  on  issue  j.  Then  we  can  look  at   all  votes  for  a  par%cular  MP       # mpi = ! v ( i , 0) v ( i ,1)  v ( i , N ) " $     Now  we  have  1043  vectors,  each  of  dimension  1630.   What  could  we  learn  from  this?  What  is  their   structure?  

Visualizing  High  Dimensional  Data  

We  can  visualize  3  dimensions  at  a  %me.   What  do  we  do  with  1043?  

Looking  at  all  MPs  for  votes  100,  200,  300  

Dimensionality  reduc%on  
Problem:  vector  space  is  high-­‐dimensional.  Up  to   thousands  of  dimensions.  The  screen  is  two-­‐ dimensional.      We  have  to  go  from      x  ∈  RN     to  much  lower  dimensional  points    y  ∈  RK<<N       Probably  K=2  or  K=3.    

This  is  called  "projec%on"  

Projec%on  from  3  to  2  dimensions  

  Think  of  this  as  rota%ng  to  align  the  "screen"  with  coordinate   axes,  then  simply  throwing  out  values  of  higher  dimensions.  

Projec%on  from  3  to  2  dimensions  

Direc%on  of  projec%on  maUers!  

Which  direc%on  should  we  look  from?  
Intui%on:  find  a  direc%on  that  "spreads  out"  points.        

House  of  Lords  PCA  analysis  

Principal  Components  Analysis  finds  the  direc%ons  of  maximum   variance.  Here,  we're  ployng  the  two  dims  of  greatest  variance.    

Interpreta%on  requires  context  

Conserva%ve  and  Liberal  Democrats  really  do  vote  together,   mostly.  Cross-­‐benchers  and  bishops  in  the  middle,  Labor  opposite.  

Sign up to vote on this title
UsefulNot useful