ISTE-­‐612  Knowledge  Processing  

Technologies  
Week  1  

Agenda
•  Course overview
•  Syllabus:
–  Instructor
–  Basics
–  Course materials
–  Grading
–  Schedule
–  Dishonesty policy
•  Module 1

2

Self Introduction
•  Day Job: Data Scientist at Booz Allen
Hamilton
–  Consult on data science problems

•  General research interests:
–  Biologically Inspired Computing, Machine
Learning, and Data Mining

•  Academic Background:
–  RIT: Biotechnology and Bioinformatics
–  George Mason: Computer Science
3

Who are you?
•  Name, program/year, where from
•  Research/focus area or specialty of interest
•  Course taken, skills/experiences related to this
course
•  Why do you want to take this course?
•  What do you want to get from the course?
•  What would make you like/hate this course?
•  Anything else

4

Course Overview
•  Data has become largely unstructured
–  focuses on unstructured data (ISTE-610 on structured data)

•  Overall goal: access and process large-scale
unstructured data
–  Build systems to process, model, and store
unstructured data for convenient and accurate
retrieval of information (Module 1)
–  Develop algorithms to extract high-level knowledge
from data (Module 2)

5

Basics
•  Class Time and Location: M 6:30pm-9:15pm, 70-2650  
•  Instructor: Paul Yacci
–  Office: 70-2634
–  Email: pmyics@rit.edu (Please include “612” in the
subject.)
–  Office hours
•  Monday: 5:30pm – 6:30pm
•  MyCourses Conference
•  *Some classes will meet online*

6

Basics  
•  Lecture  covering  the  topics  for  the  week  
•  Remainder  of  the  Fme  to  work  on  labs,  
discuss  projects  and  meet  one-­‐on-­‐one.  

Course Materials
•  Lecture
–  Module 1 (8 parts)
–  Module 2 (3 parts)
–  Module 3 (2 parts)

•  Lab (5)
•  Project (four check points)
•  Exam (2)

8

Grading
Component  

Weight  

Lab  

20%  

Midterm  exam  

25%  

Final  exam  

30%  

Project  

25%  

9

Academic  Integrity  

•  Academic  Dishonesty  =    AutomaFc  ‘F’  

WHY  DO  WE  NEED  INFORMATION  
RETRIEVAL  

hVp://www.blackcloudanalyFcs.com/news/who-­‐needs-­‐big-­‐data/  

The  human  Face  of  Big  Data,  Against  All  Odds  ProducFons,  California  
Humanfaceo\igdata.com  

 

ISTE-­‐612    
Knowledge  Processing  Technologies  
Module  1-­‐1  

WHAT  IS  INFORMATION  
RETRIEVAL?  

InformaFon  Retrieval  
•  InformaFon  Retrieval  (IR)  is  finding  material  
(usually  documents)  of  an  unstructured  
nature  (usually  text)  that  saFsfies  an  
informaFon  need  from  within  large  collecFons  
(usually  stored  on  computers).  

•  Other  examples  you  can  think  of?  
16  

Sec. 1.1

Basic  assumpFons  of  InformaFon  Retrieval  
•  CollecFon:  A  set  of  documents  or  items  
•  Goal:  Retrieve  documents  with  informaFon  
that  is  relevant  to  the  user’s  informaFon  
need  and  helps  the  user  complete  a  task  

17  

HOW  DO  WE  KNOW  IF  THE  
RESULTS  ARE  ANY  GOOD?  

Sec. 1.1

How  good  are  the  retrieved  docs?  
§  Precision  :  FracFon  of  retrieved  docs  that  are  
relevant  to  the  user’s  informaFon  need  
§  Recall  :  FracFon  of  relevant  docs  in  collec*on  
that  are  retrieved  
 

19  

Model  and  Represent  
Unstructured  Text  

Sec. 1.1

Unstructured  data  in  1620  
•  Which  plays  of  Shakespeare  contain  the  words  
Brutus  AND  Caesar    but  NOT  Calpurnia?  
•  One  could  grep  all  of  Shakespeare’s  plays  for  
Brutus  and  Caesar,  then  strip  out  lines  containing  
Calpurnia?  
•  Why  is  this  not  a  good  soluFon?  

21  

Term-­‐document  incidence  
matrices  

Sec. 1.1

Antony and Cleopatra

Julius Caesar

The Tempest

Hamlet

Othello

Macbeth

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Brutus AND Caesar BUT NOT
Calpurnia

1 if play contains
word, 0 otherwise

Sec. 1.1

Incidence  vectors  
•  So  we  have  a  0/1  vector  for  each  term.  
•  To  answer  query:  take  the  vectors  for  Brutus,  
Caesar  and  Calpurnia  (complemented)  è    
bitwise  AND.  
–  110100  AND  
–  110111  AND  
–  101111  =    
–  100100  

Antony and Cleopatra

Julius Caesar

The Tempest

Hamlet

Othello

Macbeth

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

23  

Sec. 1.1

Answers  to  query  
•  Antony and Cleopatra,  Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.

•  Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.

24  

Sec. 1.1

Bigger  collecFons  
•  Consider  N  =  1  million  documents,  each  with  
about  1000  words.  
•  Avg  6  bytes/word  including  spaces/
punctuaFon    
–  6GB  of  data  in  the  documents.  

•  Say  there  are  M  =  500K  dis4nct  terms  among  
these.  

25  

Sec. 1.1

Can't  build  the  matrix  
•  500K  x  1M  matrix  has  half-­‐a-­‐trillion  0’s  and  
1’s.  
Why?

•  But  it  has  no  more  than  one  billion  1’s.  
–  matrix  is  extremely  sparse.  

•  What  is  a  beVer  representaFon?  

26