You are on page 1of 31

Big Data and

Digital Privacy

| CISC 181 DIGITAL SOCIETIES


2

How big is the internet?


• May be impossible to answer
• The internet is not centralized; not hierarchical; no single measurement point
• CISCO Systems measures internet traffic to estimate its "size"
• Doubling roughly every three years
• Annual traffic now measurable in zettabytes
• 1 zettabyte = 1,000,000,000 terabytes
• Storage capacity harder to judge
• Much hidden behind modems, corporate gateways
• Much data on connected systems but not intended for the internet
• By any measure, very large and getting increasingly larger Image generated by opte.org
3

Big data
• An information set too large/too insufficiently structured for traditional analysis
tools like database management systems (DBMSs)
• Term in use from early 1990s when data was comparatively small
• Potentially useful unstructured/less structured data requiring special handling
• Human expertise
• Computer processing power
4

Data analytics
• Area of research seeking ways to extract info from big data and analyze it
• Multidisciplinary field
• Computer science
• Information science
• Mathematics
• Statistics
• Can be used to
• Describe features of a data set
• Group subjects of a data set
• Compare groups in a data set
• Make predictions based on trends found in a data set
5

Data analytics
• Parts of data analytics overlap with artificial intelligence (AI)
• Example: Data mining uses AI techniques to spot similarities among group
members in big data and make predictions from them
• More of this in our AI unit
6

Searching the Web


• Web search engines
• Quickly became essential tools for extracting useful info from the Web
• Size of the Web demands fully automated searching
• Client's view:
• User types in a query, clicks or taps on a submit button
• Search engine responds quickly with list of hyperlinks, descriptive text or
pictures
• Search engine's servers draw on very large databases storing info on many,
many web pages
• Links presented to a user must be in order of likely usefulness
7

Web crawler
• Bot (automated process) originating with a search engine company
• Browses the Web to add what it finds to search engine's index of pages
• Builds a queue of web pages to visit, starting with an initial list of URLs
• Visits each page in its queue, in turn
• Stores some or all of each page in the search engine's index of pages
• Adds URLs of any links it finds to its queue
Stop if queue empty,
Queue of UR Ls otherwise retrieve Get page Store page info in
to visit and remove a URL from Web search engine index
from the queue

How a web
Add URLs from hyperlinks to queue Get hyperlinks crawler works
from page
8

Web crawler
• Crawling is a continuous process
• Frequency of visits by crawlers to specific pages may be determined by
• How often changes in the page are detected
• How important the page is deemed to be
• Portion of crawled pages that gets stored in indexes varies by search engine
• Example: Googlebot (Google's crawler) indexes entire pages
• Indexing techniques will vary by web object type
• Example: Images handled differently from text
9

Web crawler
• Not all pages are crawled. Reasons include
• The page is on a private network
• No pages in the index link to the page
• The page never ends up in a crawler's search queue
• The page is temporary (constructed on the server as part of a transaction)
• Access to the page is controlled
• By user authentication (user ID and password)
• By a humanity test (example: CAPTCHA)

D I'm not a robot


One of Google's less challenging
reCAPTCHA
PTi acy Term
CAPTCHAs (Source: Google)

I J
10

Indexed vs unindexed Web


• The Surface (or Indexed) Web
• That portion of the Web indexed by search engines
• The Deep Web
• That portion of the Web not indexed by search engines
• Won't appear in a search engine
• Believed to be much larger than the Surface Web
• The Dark Web
• That portion of the Deep Web used for criminal activity
• Has a growing number of its own search engines
11

Ranking pages
• Search engines use propriety algorithms to rank pages according to importance
• Ranking used in determining order of links shown in a search
• Most famous of these algorithms is Google Search's PageRank
• Performed analyses of data produced by Googlebot crawler
• Assigned numerical rank to web objects in its index
• A page's rank increased
• If another page in the index linked to the page
• If the linking page itself had a high PageRank
• No longer used but much studied
• Search engine companies always look for ways to improve their ranking systems
12

A Google data centre


The cost of
running the
world's most
successful search
engine and email
services is high.
This data centre in
Iowa is one of over
twenty that
Google maintains
worldwide.
13

Search advertising
• Search engine companies make money through advertising
• 70% or more of Google parent Alphabet Inc.'s annual revenue comes from
advertising
• Largest share of that is from Google Search
• Ads appearing with search results selected by
• Keywords in user's query
• Location of the user
• May be determined from user settings
• User's interests
• Determined from search history
14

Search advertising
ft RE/ MAX Service First
T Realty Inc., Brokerage §

VALLEYVIEW

• HILLENDALE,

9 Century 21 Champ Realty - @ -=


Alfred Wong. f t-!..._ -
Real EstateAgent T {
•• ~ -
POLSON PARK Google appears to have figured out that I live in
AUDEN PARK a'. Map data ©2021 Google Kingston. The query that produced these ads was
Hours • "real estate"
Alfred Wong, Real Estate Agent
No reviews · Real Estate Agency
472 Canterbury Crescent Directions

Century 21 Champ Realty


4 .3 ** (3) · Rea l Estate Agency
5+ years in business · 1642 Bath Rd · (6 13) 389-2121
Website Directions
Open 24 hours

RE/MAX Service First Realty Inc., Brokerage


4 .8 *** * (5) · Rea l Estate Agency
5+ years in business · 821 Blackburn Mews · (613) 766-7650 Website Directions
Open · Closes 5 p.m.

➔ View all
15

Social networks, apps, and user data


• Apps sharing users' personal data with 3rd parties in 2021 (Apple, via pcloud.com)
1. Instagram (79% of personal data collected)
2. Facebook (57% of personal data collected)
3. LinkedIn (50% of personal data collected)
4. Uber Eats (50% of personal data collected)
5. Trainline (43% of personal data collected)
6. YouTube (43% of personal data collected)
7. YouTube Music (43% of personal data collected)
8. Deliveroo (36% of personal data collected)
9. Duolingo (36% of personal data collected)
10. eBay (36% of personal data collected)
16

Social networks, apps, and user data


About Contact Info

Overview (±) Add a mobile phone

Work and Education


(±) Add your address

Places Lived
Iii, ~
Email
Contact and Basic Info

Family and Relationships


Websites and Social Links

Details About You (±) Add a website

Life Events
Social Link Twitter •

Some of the personal information


+ Add a social link

Facebook encourages its users to share


"-1.. Friends Cancel -

Basic Info

(±) Add a language


in their "Profiles"
(±) Add your rel igious views

(±) Add your political views

(±) Add who you're interested in

II Gender

Birth date

Birth year -·..


17

Social networks, apps, and user data


• Information shared in social network posts divulge personal information
• Example: User's location, whether expressly posted in "Profile" or not
• From context of posts; places the user posts about
• From where "Friends" live
• "Friends" lists typically extend far beyond actual friends and family members
• May include fake users
• Privacy policies on social networks change
• Users may be unaware of such changes
• Users may not understand ramifications of changes
18

Digital privacy in Canada


• Privacy Act (1983)
• Limits collection, use, and dissemination of personal information
• Applies to the federal government and its agencies
• Personal Information Protection and Electronic Documents Act (PIPEDA) (2000)
• Applies to everyone else except in provinces with similar legislation in place
19

Digital privacy in Canada


• PIPEDA gives Canadians the right to
• Know why their personal info is being used or collected
• Expect any use or disclosure of personal data aligns with their consent
• Know who is responsible for protecting their personal data
• Expect that an organization will keep their personal data private and secure
• Expect that an organization will keep their personal data correct and current
• Gain access to their personal information and ask for corrections to it
• Complain if they feel their privacy rights have not been respected
20

Digital privacy in Canada


• Under PIPEDA, organizations must
• Obtain consent to collect, use, or disclose personal data
• Not refuse service to a person refusing to divulge data not essential to a
transaction
• Only collect data legally and fairly
• Make policies on personal data collection and use available and
understandable
• The Digital Charter Implementation Act (2020) (not enacted at this writing)
• Proposed replacement for PIPEDA
• Would strengthen enforcement
21

Digital privacy in Canada


• Canadian privacy protection laws end at the border
• US privacy laws less robust than Canada's
• Largest internet corporations have servers in the US
• Many used by Canadian companies
• US National Security Agency known to have collected private information
secretly and illegally
22

Locational privacy
• Advocates feel that where a person goes is that person's business alone
• Hard to achieve
• Location tracking done using
• Video surveillance
• Facial recognition software
• Phones
• Tracking using geolocation GPS-based services
• Tracking by changing cell phone tower connections
• In-store purchases, stops at bank machines, etc.
• Tracking by debit or credit card use
• Toll roads with automatic sensors
23

Malware
• Malicious software
• Programs or parts of programs
• Intended to harm computers or networks
• Associated costs very high
• Hundreds to thousands for individual users
• Many thousands or millions for corporations
24

Malware
• Viruses
• Attached to host program files
• Executing host program runs the viral code first
• Does something harmful to the system
• Attaches copies of itself to other program files
• Hands control back to host program
• Spread in pre-network days by floppy disk
• Problematic at one time for Microsoft Office users
• Viruses carried in files containing Visual Basic for Applications programs
• Spread by emailed attachments (Word or Excel files)
• Rare now
25

Malware
• Trojan horses (Trojans)
• Entire program is the malware (unlike a virus)
• Users tricked into thinking they're installing useful programs
• They usually do not replicate themselves
• Worms
• Entire program is the malware
• Replicates across networks using exploits
• Exploit: Software that takes advantage of known network security
weakness
26

Malware
• Ransomware
• Arrives on a system as a Trojan or a worm
• Encrypts data files on the system
• Presents a ransom message
• Instructs the user to send money by untraceable means
• On payment, user usually receives instructions for restoring files
27

Phishing
• Users tempted to give personal data to someone posing as a legitimate company
• Example: Email message appears to be from a legitimate corporation, but it isn't
• Message says the user's account needs verifying
• Message contains a link to a convincing, but fake, login page
• User "logs in" thus divulging login credentials
28

Data breaches
• Unauthorized infiltration of servers
• Security concerns for servers' owners
• Security concerns for users with accounts on those servers

World Canada Local Politics Money Health Entertainment Lifestyl


NEWS

CANADA

Confidential information exposed in recent data


breach: Bombardier
By Staff • The Canadian Press
Posted February 23 , 2021 5:05 pm
29

Data breaches
• Example: January 2021 Microsoft Exchange Server (MES) breach
• MES used by corporations for email and calendaring services
• About 250,000 organizations' servers compromised
• Organizations included businesses and governments
• Result of zero-day attacks
• Used exploits of network weaknesses unknown to the network admins
• Known effects of the breach include
• Users' email stolen
• Ransomware attacks on servers
30

Data breaches
• Data breaches responsible for many cases of
identity theft
• Example: Collection #1 (2019)
• Dark Web collection of 87 GB of user IDs,
passwords in 12,000 files
• Some were from known data breaches
• 140 million email addresses and 10
million passwords were not

Troy Hunt, the security researcher who


discovered Collection #1 on the Dark
Web (Photo by Troy Hunt)
31

Summary
• We looked at
• The size of the internet and the concept of big data
• The science of data analytics
• The importance of web search engines and how they work
• Targeted advertising as done by Google
• The sale of our personal data by top social networking sites
• Our willingness to divulge personal data on social networking sites
• Privacy legislation in Canada (and how it can be circumvented)
• Locational privacy
• Malware and phishing
• Data breaches

You might also like