0% found this document useful (0 votes)

14 views30 pages

Live Classroom 1

The document outlines the course MET CS688 on Web Analytics and Mining, taught by Michael Joner, including prerequisites, grading structure, and essential course details. It emphasizes the importance of early assignment completion, participation in discussions, and the necessity of knowledge in R or Python. The course covers topics such as machine learning, web mining, and data scraping, with a focus on practical coding skills and legal considerations in web scraping.

Uploaded by

nannn.gao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views30 pages

Live Classroom 1

Uploaded by

nannn.gao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

MET CS688 OL

WEB ANALYTICS AND MINING

LIVE CLASSROOM 1
About Me

• Michael Joner
– E-mail: mjoner@bu.edu
– Phone: 513-328-9115
• Text messages preferred
• I don’t answer unknown numbers
– Response time: always less than 24 hours,
usually less than 6 hours
About Course
• Prerequisite:
– MET CS 544 (Foundations of Analytics) OR 100–93.00 A
– MET CS 555 (Data Analysis and 92.99–90.00 A−
Visualization) OR
– Instructor Approval 89.99–87.00 B+
• Course Grading: 86.99–83.00 B
– 15% discussions
82.99–80.00 B−
– 50% assignments
– 35% final 79.99–77.00 C+
• No textbook required 76.99–73.00 C
• A knowledge of R or Python is essential
(any issues here?) 72.99–70.00 C−
– NOTE: We do NOT teach most concepts in
both languages. This course tends to lean 69.99–60.00 D
toward R implementations. Below 60.00 F
600-Level Class! Large Time Investment Required!
• If you fall behind it can be very hard to catch up
• Some assignments are bigger than others
• Start the assignment early in the week
• Full points on discussions require posts throughout the week

• Most students in this class spend at least 20 hours per week

– Listening to lecture, understanding what the different code examples do, writing
their own code, testing & running the code, and explaining what the code does
and what the output means.
• Many of you will get A’s, but many of you will not
Assignments and Graded Discussions
• Assignment and Discussion deadlines
– Due every Tuesday morning, 6am Eastern
– Discussions in “Group Discussion” board (not “Class Discussion”)
• Need to submit late?
– Starting with 2nd late assignment and discussion, 5% penalty per day late
– No assignments accepted after Friday morning, 6am Eastern
– Contact facilitator if you believe you have an emergency requiring more than 3
days of extra time

• An example of a good assignment will be posted on Friday or Saturday

Discussions

Implication: Even if your posts

are amazing, you will NOT get
above a 90 if you wait until
Saturday, Sunday, Monday to do
your original post and
comments on others’ posts!!
Need to participate throughout
the discussion period.
Class Discussion Board
• You can ask questions
– I will check the board
– All facilitators will check the board
– It is probably the fastest way to get answers
– Do NOT post solutions here
– If you have code that you think is close to correct, do not post it online
• E-mail your facilitator instead

• Examples of good assignments will be posted

• Some weeks, there will be downloads to help with assignments
• These live class slide decks will be posted
Not getting an answer to an e-mail?
• If you are logging in to an @yahoo.com, @gmail.com, or other web
service
• And have set it to make it look like your e-mail is coming from @bu.edu
• Your e-mail may get filtered by spam filters !!!

• If you are not getting an answer to an e-mail, try logging in to your BU

web mail portal (not to your personal Yahoo, Gmail, etc.)
• Or send from your personal Yahoo, Gmail, etc. with your BU address
turned off
Facilitator Live Office
• The facilitators will have a weekly office hour
• These are held on Saturday mornings at 10:00 Boston time
• This same Zoom room that you are in now

• They will not give you the answers to homework

• They will help clarify questions
• They will show you some tips, give you some ideas, and often will share
things from a different point of view
Topics Covered
• Machine learning algorithm fundamentals, including types of algorithms, the
importance of reviewing your data, and evaluating algorithm performance
• Web mining, which studies how web crawlers and scrapers are used to process and
index the content of web sites, how search works, and how results are ranked
• Text mining, which covers the analysis of text including content extraction,
clustering, sentiment analysis, etc.
• Graph/Network algorithms, to assess relationships between connected objects

• Most of the material in this class is taught in both R and Python

– Some material is not available in both languages, is poorly implemented in one of
them, or we simply don’t have enough time. In these cases we will present one or
the other as the situation allows.
Some Keys to the Course
• You will be:
– Writing your own code
– Accessing data you did not create
– Running some code you did not create
• This can all be time consuming, but you WILL be doing this in your future job
Schedule
• Live classrooms are Tuesdays and Thursdays at 8:30PM Eastern
– Approx 1-1.5 hours per session
– Different content on Tuesday vs. Thursday
– Classrooms will be recorded
• Homework and discussions are due Tuesdays at 6AM Eastern
• Lecture reading materials are available on Blackboard for all 6
modules
There Will Probably Be Changes!!!
• Some things might not work as demonstrated in the course materials

• Course materials have just been revised and typos/errors happen

• Packages change
• Websites that host data can go down
• You may have to do research if some code doesn’t work!!
• You will have to deal with this if you get a job that requires these skills
Module 1 Objectives
• Explain the foundations of data mining
• Describe web data collection techniques such as
scraping/crawling
• Prepare a web site scraper in both R and Python
• Understand regulation of web data collection and legal
limitations
Working with Data
• Huge amount of digital data
– Estimated that in 2025, a person connected to the Internet will have at
least one data interaction every 18 seconds
– Estimated that by 2025 there will be 175ZB (175 billion TB) of data
• Source: https://www.seagate.com/files/www-content/our-story/trends/files/idc-
seagate-dataage-whitepaper.pdf

• Analytics = discovery of patterns in data

• Mining = extracting useful knowledge from data
• Goal: learn something from data, but we can’t look at it all
Data, Information, Knowledge, and Wisdom

Health Care Informatics (Englebardt and Nelson 2002)

Data Mining
• The process of extracting knowledge (and wisdom) from data
• The process of applying machine learning techniques

• Computers assist us in recognizing patterns in data

Machine Learning
• A science of programming computers so they learn from data
This definition from Géron (reference in online Module)

• A technique or method used to extract knowledge from data

• Three main purposes:

– Descriptive
– Predictive
– Mimic human behavior = Artificial Intelligence
Overview of Web Data
• HTML = Hyper-Text Markup Language
• In HTML we use tags (wrappers) around text to enable linking
and formatting

<b> Example </b>

These are the tags.
In a web browser, text between them will be boldfaced, like this:

Example
Web Crawlers
• A web crawler fetches, analyses
and files information from web
servers.
• Web crawlers (also referred to as
spiders) can copy all the indexed
pages they visit for quicker
processing by a search engine.
This map is called a “network” or a “graph”.
You will learn more about these in Module 6!
Web Crawlers
• The basic operational steps of a hypertext crawler are
– Begin with one or more URLs that constitute a seed set
– Fetch the web page from the seed set
– Parse the fetched web page to extract the text and the links
• Extracted text is fed to a text indexer
• Extracted links (URLs) are added to URLs whose corresponding pages have
yet to be fetched by the crawler
– The visited URLs are deleted from the seed set

• Multi-threaded design to fetch & process a large number of web pages quickly.
– A fetch rate of 100 pages/second would fetch a billion pages in a month.
– This is a small fraction of the static Web at present (so massively distributed parallel computing
is typically used).
Scraping Data from Web Sites
• Crawlers are useful for indexing the web and the entire
contents of the web pages

• But sometimes you are interested in the specific contents of

just one or a small number of specific web pages

• When you know what site you want to look at and what
content you want to get from it, you want to “scrape” from
those web pages
Scraping: By API
• Some web sites have built
Application Programming Interfaces

• How they work:

– You send a specific request (usually by HTTP) to the server
– The server finds or calculates the data you are asking for
– The server sends a response to your request
Scraping: By API
• We will briefly demo the use of the arXiv API (in R, library is called
aRxiv) to search the arXiv database of research papers (arxiv.org)

• Later in the course, we will also show you a very small amount on
how to access the Twitter API using R’s rtweet library.
Scraping: By Looking for Info on Web Pages
• The rvest library is a great tool for reading from web pages
• In Python, the BeautifulSoup module does a similar thing

• To scrape, you need to understand some HTML

• Some people like a Chrome plugin called “Selector Gadget”
Scraping: By Looking for Info on Web Pages
• Example: Coolidge Corner Theater
https://coolidge.org/showtimes

• Page Source Code:

Possible Legal Issues in Web Scraping
• Patents : inventors’ right to be compensated when others use
the invention
– Possible situation: you might think of a more efficient way to scrape
the web. But if someone else invented it first and patented it, you
can’t use that method.
– Solution: use well-known methods (open-source) which are publicly
available, well documented, to assure freedom to practice
Possible Legal Issues in Web Scraping
• Copyright : a creator’s ownership of their own art, words,
music, designs, code, etc.
– Possible situation: I created this slide. I spent time doing it. It is mine.
Not yours. Don’t take pieces of it without asking me first.
– Web site situation: Depending on what is on a web site and how it is
arranged, you don’t have the right to publish information derived
from it.
– Note:
• Copyright only applies to things that are created.
• Facts (like prices, names of places, names of people) are not protected by
copyright. Anyone can re-use and share that information.
Possible Legal Issues in Web Scraping
• Trespass : using someone’s property without permission
– Possible situation:
• Some “other site” has spent money on gathering the information, providing
servers and internet connections
• You create a tool that scrapes information from the “other site” to share on
your site, maybe in the hope of taking business away from the “other site”
• You are using “other site’s” property to your advantage and potentially to
“other site’s” harm without “other site’s” permission (for example, Terms of
Service on the “other site” may prohibit scraping)
– Solution: You must either compensate “other site” (buy their data) or
find some other way to get it on your own.
Optional – time permitting: Functions in R

Multiple inputs:
Name of function Inputs separate with commas

Multiple outputs:
Outputs make a new data frame or list first, then return that

Web Mining
No ratings yet
Web Mining
10 pages
Python and Web Development Labs Guide
No ratings yet
Python and Web Development Labs Guide
17 pages
Syllabus EUI17
No ratings yet
Syllabus EUI17
3 pages
Python Web Apps Course Syllabus
No ratings yet
Python Web Apps Course Syllabus
4 pages
Data Science Course Syllabus Overview
No ratings yet
Data Science Course Syllabus Overview
3 pages
Advanced Machine Learning Course 2023
No ratings yet
Advanced Machine Learning Course 2023
17 pages
Web Mining Course Overview and Details
No ratings yet
Web Mining Course Overview and Details
4 pages
Course Plan For Web Mining
No ratings yet
Course Plan For Web Mining
8 pages
13 Surv736 2018 Spring Syllabus
No ratings yet
13 Surv736 2018 Spring Syllabus
8 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Coding Dojo PH Course Packet
No ratings yet
Coding Dojo PH Course Packet
13 pages
Python Web App Programming Course
No ratings yet
Python Web App Programming Course
5 pages
Data Science Curriculum Overview
No ratings yet
Data Science Curriculum Overview
11 pages
Web Pentesting/bug Bounty Hunting Guide v2: By: Aayan
No ratings yet
Web Pentesting/bug Bounty Hunting Guide v2: By: Aayan
5 pages
Sma U-2
No ratings yet
Sma U-2
19 pages
Data Programming for Analytics Course
No ratings yet
Data Programming for Analytics Course
7 pages
Web Crawler Assisted Web Page Cleaning For Web Data Mining
No ratings yet
Web Crawler Assisted Web Page Cleaning For Web Data Mining
75 pages
Web App Programming Course Overview
No ratings yet
Web App Programming Course Overview
4 pages
Metis Bootcamp: Data Science Journey
No ratings yet
Metis Bootcamp: Data Science Journey
15 pages
Data Acquisition Techniques in Engineering
No ratings yet
Data Acquisition Techniques in Engineering
48 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
dataEngineerRoadmap To Make Money
No ratings yet
dataEngineerRoadmap To Make Money
19 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
Data Engineering Course Overview
No ratings yet
Data Engineering Course Overview
33 pages
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Software Engineering Program Overview
No ratings yet
Software Engineering Program Overview
18 pages
08 Gtu TPT Report
No ratings yet
08 Gtu TPT Report
37 pages
Module 2 - Final
No ratings yet
Module 2 - Final
58 pages
Web Crawling and Scraping with Python
No ratings yet
Web Crawling and Scraping with Python
34 pages
Web Technologies Lab Course Overview
No ratings yet
Web Technologies Lab Course Overview
7 pages
Web Technologies Course Syllabus
No ratings yet
Web Technologies Course Syllabus
4 pages
Python Essentials Objectives
No ratings yet
Python Essentials Objectives
2 pages
BTech - CSE - 7thsem - Syllabus For Website
No ratings yet
BTech - CSE - 7thsem - Syllabus For Website
21 pages
Free - Courses - Karim - Abdelmotalib
No ratings yet
Free - Courses - Karim - Abdelmotalib
39 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Web Scraping Techniques by Joseph Siryani
No ratings yet
Web Scraping Techniques by Joseph Siryani
35 pages
Data Aggregation via Web Scraping
No ratings yet
Data Aggregation via Web Scraping
48 pages
Web Mining Course Overview and Objectives
No ratings yet
Web Mining Course Overview and Objectives
12 pages
Web Development Course Syllabus
No ratings yet
Web Development Course Syllabus
7 pages
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes
No ratings yet
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes
289 pages
Web Crawling Basics and Techniques
No ratings yet
Web Crawling Basics and Techniques
39 pages
Lecture 1 How The Web Works
No ratings yet
Lecture 1 How The Web Works
18 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
Python Basics for Aspiring Data Scientists
No ratings yet
Python Basics for Aspiring Data Scientists
16 pages
Computer
No ratings yet
Computer
10 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
SWE 363 Syllabus-222
No ratings yet
SWE 363 Syllabus-222
4 pages
Technology Training & Solutions Courses
100% (1)
Technology Training & Solutions Courses
31 pages
Data Mining Module 5: Key Topics & Techniques
No ratings yet
Data Mining Module 5: Key Topics & Techniques
28 pages
Master Python: From Novice to Expert
No ratings yet
Master Python: From Novice to Expert
11 pages
Data Cleaning and Web Scraping Guide
No ratings yet
Data Cleaning and Web Scraping Guide
4 pages
Python For Everybody With Web Scraping
No ratings yet
Python For Everybody With Web Scraping
4 pages
Crawling The Web: Information Retrieval © Crista Lopes, UCI
No ratings yet
Crawling The Web: Information Retrieval © Crista Lopes, UCI
25 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
SMTB1301 (Nov 2024)
No ratings yet
SMTB1301 (Nov 2024)
4 pages
Spark Plug Testing Machine Fabrication
0% (1)
Spark Plug Testing Machine Fabrication
2 pages
Ethanol Manufacturing Project Report
No ratings yet
Ethanol Manufacturing Project Report
32 pages
Living With Siva
100% (6)
Living With Siva
1,002 pages
IGCSE Physics Measurement Questions
No ratings yet
IGCSE Physics Measurement Questions
22 pages
Transfer CFT UsersGuide AllOS en
No ratings yet
Transfer CFT UsersGuide AllOS en
2,506 pages
Dayton's Call for Reinsurance Commitments
No ratings yet
Dayton's Call for Reinsurance Commitments
1 page
Introduction to Yoga and Patanjali's Teachings
No ratings yet
Introduction to Yoga and Patanjali's Teachings
25 pages
I and You: A High School Play Script
No ratings yet
I and You: A High School Play Script
100 pages
Pert and CPM
No ratings yet
Pert and CPM
137 pages
Is RG Jah Ss 2622023
No ratings yet
Is RG Jah Ss 2622023
9 pages
Alfa Laval Shell-and-Tube Condensers
No ratings yet
Alfa Laval Shell-and-Tube Condensers
20 pages
Financial Literacy Advocacy Module
No ratings yet
Financial Literacy Advocacy Module
12 pages
Telegram-Rakshita Singh: Chapter Objectives
No ratings yet
Telegram-Rakshita Singh: Chapter Objectives
15 pages
Ipv4 Vs Ipv6: What Is Ip?
No ratings yet
Ipv4 Vs Ipv6: What Is Ip?
6 pages
Hold Tight
No ratings yet
Hold Tight
2 pages
Customs Clearance Responsibilities for Importers
100% (3)
Customs Clearance Responsibilities for Importers
5 pages
Dimensions of Organizational Structure
No ratings yet
Dimensions of Organizational Structure
35 pages
Foundation Design for 320MW Switchyard
No ratings yet
Foundation Design for 320MW Switchyard
75 pages
Gay Talese's Field Guide To The Social Order of New York's Cats, Illustrated - Brain Pickings
No ratings yet
Gay Talese's Field Guide To The Social Order of New York's Cats, Illustrated - Brain Pickings
1 page
Corporate Entrepreneurship and Agency Cost - A Theoretical Perspective
No ratings yet
Corporate Entrepreneurship and Agency Cost - A Theoretical Perspective
8 pages
RTC Magazine December 2005
No ratings yet
RTC Magazine December 2005
72 pages
Enhancing Adversarial Attack Stealthiness
No ratings yet
Enhancing Adversarial Attack Stealthiness
18 pages
AP EAMCET 2016 Engineering Test Solutions by Sri Chaitanya PDF
100% (1)
AP EAMCET 2016 Engineering Test Solutions by Sri Chaitanya PDF
52 pages
Industrial Upc
No ratings yet
Industrial Upc
28 pages
Chapter 2 Functions of Real Variables 2020 2021
100% (1)
Chapter 2 Functions of Real Variables 2020 2021
48 pages
SASMO 2018 Grade-6
No ratings yet
SASMO 2018 Grade-6
9 pages
Lesson Plan Grade 10
No ratings yet
Lesson Plan Grade 10
3 pages
CATIA V5: Create Threads and Taps
No ratings yet
CATIA V5: Create Threads and Taps
11 pages
Amendment to Bengali Knowledge Rules
No ratings yet
Amendment to Bengali Knowledge Rules
2 pages