Professional Documents
Culture Documents
11 - Big Data and Digital Privacy
11 - Big Data and Digital Privacy
Digital Privacy
Big data
• An information set too large/too insufficiently structured for traditional analysis
tools like database management systems (DBMSs)
• Term in use from early 1990s when data was comparatively small
• Potentially useful unstructured/less structured data requiring special handling
• Human expertise
• Computer processing power
4
Data analytics
• Area of research seeking ways to extract info from big data and analyze it
• Multidisciplinary field
• Computer science
• Information science
• Mathematics
• Statistics
• Can be used to
• Describe features of a data set
• Group subjects of a data set
• Compare groups in a data set
• Make predictions based on trends found in a data set
5
Data analytics
• Parts of data analytics overlap with artificial intelligence (AI)
• Example: Data mining uses AI techniques to spot similarities among group
members in big data and make predictions from them
• More of this in our AI unit
6
Web crawler
• Bot (automated process) originating with a search engine company
• Browses the Web to add what it finds to search engine's index of pages
• Builds a queue of web pages to visit, starting with an initial list of URLs
• Visits each page in its queue, in turn
• Stores some or all of each page in the search engine's index of pages
• Adds URLs of any links it finds to its queue
Stop if queue empty,
Queue of UR Ls otherwise retrieve Get page Store page info in
to visit and remove a URL from Web search engine index
from the queue
How a web
Add URLs from hyperlinks to queue Get hyperlinks crawler works
from page
8
Web crawler
• Crawling is a continuous process
• Frequency of visits by crawlers to specific pages may be determined by
• How often changes in the page are detected
• How important the page is deemed to be
• Portion of crawled pages that gets stored in indexes varies by search engine
• Example: Googlebot (Google's crawler) indexes entire pages
• Indexing techniques will vary by web object type
• Example: Images handled differently from text
9
Web crawler
• Not all pages are crawled. Reasons include
• The page is on a private network
• No pages in the index link to the page
• The page never ends up in a crawler's search queue
• The page is temporary (constructed on the server as part of a transaction)
• Access to the page is controlled
• By user authentication (user ID and password)
• By a humanity test (example: CAPTCHA)
I J
10
Ranking pages
• Search engines use propriety algorithms to rank pages according to importance
• Ranking used in determining order of links shown in a search
• Most famous of these algorithms is Google Search's PageRank
• Performed analyses of data produced by Googlebot crawler
• Assigned numerical rank to web objects in its index
• A page's rank increased
• If another page in the index linked to the page
• If the linking page itself had a high PageRank
• No longer used but much studied
• Search engine companies always look for ways to improve their ranking systems
12
Search advertising
• Search engine companies make money through advertising
• 70% or more of Google parent Alphabet Inc.'s annual revenue comes from
advertising
• Largest share of that is from Google Search
• Ads appearing with search results selected by
• Keywords in user's query
• Location of the user
• May be determined from user settings
• User's interests
• Determined from search history
14
Search advertising
ft RE/ MAX Service First
T Realty Inc., Brokerage §
•
VALLEYVIEW
•
• HILLENDALE,
➔ View all
15
Places Lived
Iii, ~
Email
Contact and Basic Info
Life Events
Social Link Twitter •
Basic Info
II Gender
Birth date
Locational privacy
• Advocates feel that where a person goes is that person's business alone
• Hard to achieve
• Location tracking done using
• Video surveillance
• Facial recognition software
• Phones
• Tracking using geolocation GPS-based services
• Tracking by changing cell phone tower connections
• In-store purchases, stops at bank machines, etc.
• Tracking by debit or credit card use
• Toll roads with automatic sensors
23
Malware
• Malicious software
• Programs or parts of programs
• Intended to harm computers or networks
• Associated costs very high
• Hundreds to thousands for individual users
• Many thousands or millions for corporations
24
Malware
• Viruses
• Attached to host program files
• Executing host program runs the viral code first
• Does something harmful to the system
• Attaches copies of itself to other program files
• Hands control back to host program
• Spread in pre-network days by floppy disk
• Problematic at one time for Microsoft Office users
• Viruses carried in files containing Visual Basic for Applications programs
• Spread by emailed attachments (Word or Excel files)
• Rare now
25
Malware
• Trojan horses (Trojans)
• Entire program is the malware (unlike a virus)
• Users tricked into thinking they're installing useful programs
• They usually do not replicate themselves
• Worms
• Entire program is the malware
• Replicates across networks using exploits
• Exploit: Software that takes advantage of known network security
weakness
26
Malware
• Ransomware
• Arrives on a system as a Trojan or a worm
• Encrypts data files on the system
• Presents a ransom message
• Instructs the user to send money by untraceable means
• On payment, user usually receives instructions for restoring files
27
Phishing
• Users tempted to give personal data to someone posing as a legitimate company
• Example: Email message appears to be from a legitimate corporation, but it isn't
• Message says the user's account needs verifying
• Message contains a link to a convincing, but fake, login page
• User "logs in" thus divulging login credentials
28
Data breaches
• Unauthorized infiltration of servers
• Security concerns for servers' owners
• Security concerns for users with accounts on those servers
CANADA
Data breaches
• Example: January 2021 Microsoft Exchange Server (MES) breach
• MES used by corporations for email and calendaring services
• About 250,000 organizations' servers compromised
• Organizations included businesses and governments
• Result of zero-day attacks
• Used exploits of network weaknesses unknown to the network admins
• Known effects of the breach include
• Users' email stolen
• Ransomware attacks on servers
30
Data breaches
• Data breaches responsible for many cases of
identity theft
• Example: Collection #1 (2019)
• Dark Web collection of 87 GB of user IDs,
passwords in 12,000 files
• Some were from known data breaches
• 140 million email addresses and 10
million passwords were not
Summary
• We looked at
• The size of the internet and the concept of big data
• The science of data analytics
• The importance of web search engines and how they work
• Targeted advertising as done by Google
• The sale of our personal data by top social networking sites
• Our willingness to divulge personal data on social networking sites
• Privacy legislation in Canada (and how it can be circumvented)
• Locational privacy
• Malware and phishing
• Data breaches