You are on page 1of 10

DATA-INTENSIVE COMPUTING

CSE487/587
BINA RAMAMURTHY (BINA@BUFFALO.EDU)
ABOUT THE COURSE

1. What will I learn in the course?


2. What is this course about?
3. What are some of the industries hiring in the data-related area? Let me handle these
questions one by one.
4. What are my responsibilities as a student in this course? Please think of other
5. How can I get the best out of this course? questions you may have
about the course,
6. How should I assess the success in this course? What are the metrics? its format, etc.
WHAT WILL I LEARN IN THE COURSE?

• You’ll learn about basic data analytics process from defining a problem to cleaning
and processing data for downstream analytics (and application of algorithms),
knowledge extraction. (Figure adapted form Doing Data science book)

• You’ll learn about big data infrastructures and algorithms. Managing large scale
data requires special structures and algorithms. (Figure from Lin and Dyer’s book.)

• You’ll learn newer data challenges and methods to address these. For example
streaming data, how to process and analyze these in a timely manner. (Examples for
data stream are social media data and multi-modal enterprise data)
Data stream
WHAT IS THIS COURSE ABOUT?
• This is course is about
• Foundational concepts in data-intensive computing (structured data in tables)
• Identifying a problem, data acquisition, understanding data and extracting features, analysis, visualizing and presenting the
outcomes of analysis; some useful tools: R Studio and Python with its libraries
• From small data (excel tables, and structured data) to unstructured big data (mainly text)
• What are some of the issues in handling big data?
• Big data data structures and algorithms (For example; Hadoop and MapReduce)
• Big data to streaming data
• What is streaming data? (Social media as well as enterprise data)
• How is it characterized?
• What are some of the methods for managing streaming data? (For example, Spark streaming)
• Different project idea for each team
WHAT ARE SOME OF THE INDUSTRIES HIRING IN THE
DATA-RELATED AREA?

Data-related jobs from around Buffalo

Note the 100+


WHAT ARE MY RESPONSIBILITIES AS A STUDENT IN THIS COURSE?

1. Attend lectures – this course is a synchronized, real-time, remote course. Attendance is required.
2. Read the books and other material referenced. This is valuable resource for all the work assigned in the course.
3. You will work on three projects: (resulting in three “products”): You’ll work in groups of at most 2 people.
1. Foundation data analytics with structured data from real sources (Pew, data.gov, etc.): Python, DS book
2. Data-intensive analytics of big data (unstructured) using Hadoop and MapReduce like setup: Java or Python, VM will be
provided
3. Streaming data analytics using appropriate data structures and algorithms (DAG): Spark: Java, Scala, or Python; VM can be
used.

4. Complete timed multiple-choice quizzes on the topics covered in the course.


5. Take part in class activity that requires active participation .
6. Attend office hours if you have any questions, concerns, don’t wait till the end of the semester.
HOW CAN I GET THE BEST OUT OF THIS COURSE?

• Be eager to learn. Here is an opportunity to learn about an emerging technology that is in high demand?
• Focus on opportunities for learning not on methods for cheating to get a better grade.
• Work hard. Learn new skills and knowledge. There is no substitute for hard work. Copying and
cheating on your quizzes and projects is not going to get you anywhere.
• Be attentive in class. Just because it is remote, does not mean you can start the zoom and go away. I
may require you have the video on.
• Work on the projects yourself even though teams are allowed.
• No sharing of data or code among teams is allowed.
HOW SHOULD I ASSESS THE SUCCESS IN THIS COURSE?
WHAT ARE THE METRICS?

• What would you say? Not by the letter grade but by the
• New concepts you learn about data-intensive computing (and data science)
• New skills you develop to solve data-related problems and improve your understanding
• New knowledge you gain about the data applications, cool libraries in python, beauty of MR,
steaming data

• Don’t be afraid to learn the new languages and learn new programming (for example,
python programming: don’t say “I don’t know python programming and I don’t want to
learn”. That is not a good idea.)
LET’S GET A CONCEPT, SKILL AND KNOWLEDGE FOR
TODAY (AND LEARN A NEW ONE EVERYDAY)

• Concept: structured  unstructured  streaming data


• Skill: Understand the nature of these different types of data: tables large scale data
store  real-time streams
• Knowledge: Find some generators and sources for these types of data
TO DO (BEFORE NEXT CLASS)
• Form teams of at most 2 people
• Signup with one of the TAs.
There are three TAs (actually 21/2 TAs for the course.)
Lets introduce the Tas:
Jiayi Xian (jxian@buffalo.edu) – 30 teams
Ping Yu (pingyu@buffalo.edu) – 30 teams
Chen Yuan (chenyu@buffalo.edu) – hanf a TA – 10 teams
You will work with your TA about grades, and other questions you may about the projects, quizzes and the grades.
They will provide office hours zoom links and google sign up sheets for your to associate with one of the TA.
• Find a good data source for your project 1. Here are some sources;
• Pew, research (https://www.pewresearch.org/download-datasets/)
• data.gov (218,319 datasets found)
• Amazon and google data sets
• Form a problem statement for your project 1: “I am going to analyze this data to find out this.. To address this
problem and provide a data-driven solution”

You might also like