0% found this document useful (0 votes)
14 views30 pages

Live Classroom 1

The document outlines the course MET CS688 on Web Analytics and Mining, taught by Michael Joner, including prerequisites, grading structure, and essential course details. It emphasizes the importance of early assignment completion, participation in discussions, and the necessity of knowledge in R or Python. The course covers topics such as machine learning, web mining, and data scraping, with a focus on practical coding skills and legal considerations in web scraping.

Uploaded by

nannn.gao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views30 pages

Live Classroom 1

The document outlines the course MET CS688 on Web Analytics and Mining, taught by Michael Joner, including prerequisites, grading structure, and essential course details. It emphasizes the importance of early assignment completion, participation in discussions, and the necessity of knowledge in R or Python. The course covers topics such as machine learning, web mining, and data scraping, with a focus on practical coding skills and legal considerations in web scraping.

Uploaded by

nannn.gao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MET CS688 OL

WEB ANALYTICS AND MINING


LIVE CLASSROOM 1
About Me

• Michael Joner
– E-mail: mjoner@bu.edu
– Phone: 513-328-9115
• Text messages preferred
• I don’t answer unknown numbers
– Response time: always less than 24 hours,
usually less than 6 hours
About Course
• Prerequisite:
– MET CS 544 (Foundations of Analytics) OR 100–93.00 A
– MET CS 555 (Data Analysis and 92.99–90.00 A−
Visualization) OR
– Instructor Approval 89.99–87.00 B+
• Course Grading: 86.99–83.00 B
– 15% discussions
82.99–80.00 B−
– 50% assignments
– 35% final 79.99–77.00 C+
• No textbook required 76.99–73.00 C
• A knowledge of R or Python is essential
(any issues here?) 72.99–70.00 C−
– NOTE: We do NOT teach most concepts in
both languages. This course tends to lean 69.99–60.00 D
toward R implementations. Below 60.00 F
600-Level Class! Large Time Investment Required!
• If you fall behind it can be very hard to catch up
• Some assignments are bigger than others
• Start the assignment early in the week
• Full points on discussions require posts throughout the week

• Most students in this class spend at least 20 hours per week


– Listening to lecture, understanding what the different code examples do, writing
their own code, testing & running the code, and explaining what the code does
and what the output means.
• Many of you will get A’s, but many of you will not
Assignments and Graded Discussions
• Assignment and Discussion deadlines
– Due every Tuesday morning, 6am Eastern
– Discussions in “Group Discussion” board (not “Class Discussion”)
• Need to submit late?
– Starting with 2nd late assignment and discussion, 5% penalty per day late
– No assignments accepted after Friday morning, 6am Eastern
– Contact facilitator if you believe you have an emergency requiring more than 3
days of extra time

• An example of a good assignment will be posted on Friday or Saturday


Discussions

Implication: Even if your posts


are amazing, you will NOT get
above a 90 if you wait until
Saturday, Sunday, Monday to do
your original post and
comments on others’ posts!!
Need to participate throughout
the discussion period.
Class Discussion Board
• You can ask questions
– I will check the board
– All facilitators will check the board
– It is probably the fastest way to get answers
– Do NOT post solutions here
– If you have code that you think is close to correct, do not post it online
• E-mail your facilitator instead

• Examples of good assignments will be posted


• Some weeks, there will be downloads to help with assignments
• These live class slide decks will be posted
Not getting an answer to an e-mail?
• If you are logging in to an @yahoo.com, @gmail.com, or other web
service
• And have set it to make it look like your e-mail is coming from @bu.edu
• Your e-mail may get filtered by spam filters !!!

• If you are not getting an answer to an e-mail, try logging in to your BU


web mail portal (not to your personal Yahoo, Gmail, etc.)
• Or send from your personal Yahoo, Gmail, etc. with your BU address
turned off
Facilitator Live Office
• The facilitators will have a weekly office hour
• These are held on Saturday mornings at 10:00 Boston time
• This same Zoom room that you are in now

• They will not give you the answers to homework


• They will help clarify questions
• They will show you some tips, give you some ideas, and often will share
things from a different point of view
Topics Covered
• Machine learning algorithm fundamentals, including types of algorithms, the
importance of reviewing your data, and evaluating algorithm performance
• Web mining, which studies how web crawlers and scrapers are used to process and
index the content of web sites, how search works, and how results are ranked
• Text mining, which covers the analysis of text including content extraction,
clustering, sentiment analysis, etc.
• Graph/Network algorithms, to assess relationships between connected objects

• Most of the material in this class is taught in both R and Python


– Some material is not available in both languages, is poorly implemented in one of
them, or we simply don’t have enough time. In these cases we will present one or
the other as the situation allows.
Some Keys to the Course
• You will be:
– Writing your own code
– Accessing data you did not create
– Running some code you did not create
• This can all be time consuming, but you WILL be doing this in your future job
Schedule
• Live classrooms are Tuesdays and Thursdays at 8:30PM Eastern
– Approx 1-1.5 hours per session
– Different content on Tuesday vs. Thursday
– Classrooms will be recorded
• Homework and discussions are due Tuesdays at 6AM Eastern
• Lecture reading materials are available on Blackboard for all 6
modules
There Will Probably Be Changes!!!
• Some things might not work as demonstrated in the course materials

• Course materials have just been revised and typos/errors happen


• Packages change
• Websites that host data can go down
• You may have to do research if some code doesn’t work!!
• You will have to deal with this if you get a job that requires these skills
Module 1 Objectives
• Explain the foundations of data mining
• Describe web data collection techniques such as
scraping/crawling
• Prepare a web site scraper in both R and Python
• Understand regulation of web data collection and legal
limitations
Working with Data
• Huge amount of digital data
– Estimated that in 2025, a person connected to the Internet will have at
least one data interaction every 18 seconds
– Estimated that by 2025 there will be 175ZB (175 billion TB) of data
• Source: https://www.seagate.com/files/www-content/our-story/trends/files/idc-
seagate-dataage-whitepaper.pdf

• Analytics = discovery of patterns in data


• Mining = extracting useful knowledge from data
• Goal: learn something from data, but we can’t look at it all
Data, Information, Knowledge, and Wisdom

Health Care Informatics (Englebardt and Nelson 2002)


Data Mining
• The process of extracting knowledge (and wisdom) from data
• The process of applying machine learning techniques

• Computers assist us in recognizing patterns in data


Machine Learning
• A science of programming computers so they learn from data
This definition from Géron (reference in online Module)

• A technique or method used to extract knowledge from data

• Three main purposes:


– Descriptive
– Predictive
– Mimic human behavior = Artificial Intelligence
Overview of Web Data
• HTML = Hyper-Text Markup Language
• In HTML we use tags (wrappers) around text to enable linking
and formatting

<b> Example </b>


These are the tags.
In a web browser, text between them will be boldfaced, like this:

Example
Web Crawlers
• A web crawler fetches, analyses
and files information from web
servers.
• Web crawlers (also referred to as
spiders) can copy all the indexed
pages they visit for quicker
processing by a search engine.
This map is called a “network” or a “graph”.
You will learn more about these in Module 6!
Web Crawlers
• The basic operational steps of a hypertext crawler are
– Begin with one or more URLs that constitute a seed set
– Fetch the web page from the seed set
– Parse the fetched web page to extract the text and the links
• Extracted text is fed to a text indexer
• Extracted links (URLs) are added to URLs whose corresponding pages have
yet to be fetched by the crawler
– The visited URLs are deleted from the seed set

• Multi-threaded design to fetch & process a large number of web pages quickly.
– A fetch rate of 100 pages/second would fetch a billion pages in a month.
– This is a small fraction of the static Web at present (so massively distributed parallel computing
is typically used).
Scraping Data from Web Sites
• Crawlers are useful for indexing the web and the entire
contents of the web pages

• But sometimes you are interested in the specific contents of


just one or a small number of specific web pages

• When you know what site you want to look at and what
content you want to get from it, you want to “scrape” from
those web pages
Scraping: By API
• Some web sites have built
Application Programming Interfaces

• How they work:


– You send a specific request (usually by HTTP) to the server
– The server finds or calculates the data you are asking for
– The server sends a response to your request
Scraping: By API
• We will briefly demo the use of the arXiv API (in R, library is called
aRxiv) to search the arXiv database of research papers (arxiv.org)

• Later in the course, we will also show you a very small amount on
how to access the Twitter API using R’s rtweet library.
Scraping: By Looking for Info on Web Pages
• The rvest library is a great tool for reading from web pages
• In Python, the BeautifulSoup module does a similar thing

• To scrape, you need to understand some HTML


• Some people like a Chrome plugin called “Selector Gadget”
Scraping: By Looking for Info on Web Pages
• Example: Coolidge Corner Theater
https://coolidge.org/showtimes

• Page Source Code:


Possible Legal Issues in Web Scraping
• Patents : inventors’ right to be compensated when others use
the invention
– Possible situation: you might think of a more efficient way to scrape
the web. But if someone else invented it first and patented it, you
can’t use that method.
– Solution: use well-known methods (open-source) which are publicly
available, well documented, to assure freedom to practice
Possible Legal Issues in Web Scraping
• Copyright : a creator’s ownership of their own art, words,
music, designs, code, etc.
– Possible situation: I created this slide. I spent time doing it. It is mine.
Not yours. Don’t take pieces of it without asking me first.
– Web site situation: Depending on what is on a web site and how it is
arranged, you don’t have the right to publish information derived
from it.
– Note:
• Copyright only applies to things that are created.
• Facts (like prices, names of places, names of people) are not protected by
copyright. Anyone can re-use and share that information.
Possible Legal Issues in Web Scraping
• Trespass : using someone’s property without permission
– Possible situation:
• Some “other site” has spent money on gathering the information, providing
servers and internet connections
• You create a tool that scrapes information from the “other site” to share on
your site, maybe in the hope of taking business away from the “other site”
• You are using “other site’s” property to your advantage and potentially to
“other site’s” harm without “other site’s” permission (for example, Terms of
Service on the “other site” may prohibit scraping)
– Solution: You must either compensate “other site” (buy their data) or
find some other way to get it on your own.
Optional – time permitting: Functions in R

Multiple inputs:
Name of function Inputs separate with commas

Multiple outputs:
Outputs make a new data frame or list first, then return that

You might also like