Summary Notes GEI

The document covers various aspects of data collection, analysis, and interpretation, emphasizing the roles of ethnographers and data scientists. It discusses the importance of understanding cultural context, the limitations and biases in data, and the significance of different data types and structures. Additionally, it highlights the challenges of machine learning and natural language processing in relation to data and classification.

Uploaded by

Yui Hisame

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views4 pages

Summary Notes GEI

Uploaded by

Yui Hisame

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

GEI1002 Summary Notes technology and conscious choices made by o A phenomenon is represented as a set of

people designing the dataset schema and objects (data points, measurements,
Lecture #1 collecting the data  PROXY. samples, records) and their features
- Ethnographers immerse themselves in - Good proxy e.g., using food center stall turnover (attributes, variables). The features may
communities, observe behaviors, conduct rates as a proxy for hawker industry health. include already available metadata and
interviews, and analyze cultural practices. - Poor proxy e.g., using the number of void deck measurements of objects’ characteristics we
- Data scientists use computers to discern trends events as a proxy for community engagement, generate using algorithms (feature
that might be inaccessible to the naked eye. using the number of people with gym extraction).
- Challenges in the age of AI: (1) bias in training memberships to assess health standards in the o The objects and their features form a dataset.
data leading to biased AI decisions, (2) risk of community. o The number of objects in a dataset must be
overreliance on quantitative data, missing - Cultural context in data collection: decisions are finite.
qualitative factors in understanding huma behavior informed by the worldviews, histories, values, and o Features are encoded using data types:
and preferences. priorities of those collecting data. whole and fractional number, categories,
- Data (Data Scientist)  a collection of - When analyzing a dataset with a decision spatial coordinates, spatial shapes and
measurements, attributes or observations that can diagram, we also need to note who created the trajectories, dates, times, text tags, or free
be analyzed to derive insights or make decisions. data and why. text.
- Data (Humanities)  a representation of human - Be aware of the unintentional errors when o Each feature can use only one data type.
experiences, behaviors, and cultural artifacts, analyzing a dataset. o The number of features in the dataset must
contextual and interpretive in nature. - Biography of a dataset  origin, context,
be finite.
- Types of data: quantitative, categorical (grouped), evolution, challenges, unexpected turns of events.
- Different objectives across the disciplines
ordinal (ranking), and continuous vs. discrete. - Close reading a dataset: (1) always look at the
(prediction – science; control – engineering;
- Each variable (feature) forms a column, and each raw data, (2) understand as many rows as
explanation – social science; interpretation –
observation forms a row. possible, not just the first few, (3) look for patterns,
humanities; automation – machine learning)
- You cannot combine different types of data in the outliers, and inconsistencies, (4) how missing data
- Big data – volume, variety, velocity, veracity, and
same column and only have one observation per is represented.
value. Outliers and errors becoming less
row. - EDA: summary statistics (central tendency-mean,
important.
- Dataset schema: the structural blueprint of a median, mode; dispersion-std, range),
- No data in the humanities is fully objective it
dataset. Specifies variables, observations, and visualizations (histograms, box plots, and scatter
requires active efforts to collect or produce
constraints that govern the structure and content plots).
depending on technology, opinions and
of the dataset. - Data Scientist (systematic analysis, computational
perspectives of people creating a system.
- Schematic bias: systematic distortion or limitation tools, EDA and visualization)
- Problems of data  standardization,
in data collection and representation that arises - Ethnography (paying attention to context,
incompleteness, and inaccuracy.
when complex realities are forced to fit emphasizing history, focusing on definitions, and
predetermined categories or structures. Manifest appreciating nuance)
Lecture #3
in two primary ways:
o In the initial design of data collection where Lecture #2
decisions about what to include and exclude - What is considered as data? Data is context-
shape the potential insights and limitation of dependent; data in one context might be
dataset. information or knowledge in another.
o In the processing of data where information is - Every category is a preconception, types of data
trimmed, transformed, or transfigured to can be manually assigned or computationally
match the expectations of the database derived.
scheme or analytical tools. - Characteristic of a dataset (Manovich, 2020):
- Limitations of data: data is always an imperfect
approximation of reality, dependent on available
- Variance: (1/N sum(Xi – mean)^2 )^2 - Classifying texts for news articles: factual
- KDE (Kernel Density Estimation) show a statement vs. opinion, news event vs. background
smoothed representation of the data distribution in information, direct quotation vs. paraphrase,
comparison to histograms. attribution vs. non-attribution, temporal
- Pair plots  scatterplots and KDE of multiple classification, eyewitness account vs. expert
variables for the same samples statement, cause vs. effect.
- Technical vs. Interpretive
- William Playfair  Exports and Imports and
general trade in England
- The “Infoviz” approach  visualization aesthetic
common in the news developed by Nigel Holmes.
- Sources of bias in data visualization: axis Lecture #6
cropping, axis scaling, multiple dimensions,
unnormalized values, spurious correlation. - A model (ML) is only as good as the data it was
- Problem of pie charts: 3D visualizations, without trained on. This includes the training labels.
labels telling the % there would be little to no - Data  Training  Model
chance of accurately guessing it. People tend to - No dataset is value-neutral, every dataset used to
underestimate the size of acute angles and train ML systems contains a worldview.
overestimate the size of obtuse ones. - Where do labels come from? From human
- Each visualizations tell different stories. labelers that assign them (Ground Truth).
- John Snow  map of cholera outbreaks and - Inter-rater reliability  there are many ways to
wells. - List of caveats: order your data, cut or not to cut,
spaghetti chart, pie chart. calculate the agreement between human labelers
using Fleiss’ Kappa (K = number of labelers).
Lecture #5 - Evaluating the results of a model with “accuracy”
and “confusion matrix”.
- NLP (search engines, social media recommender
systems, flagging spam, identify posts that interest
users.
- NLP in humanities and social sciences: (1)
description, rather than prediction.
- LEMMA, sometimes verbs such as “say” would
- Florence Nightingale  Coxcom (polar area present in different forms in a text, such as
graph) “saying” or “said”; to count all instances of verb we
can access LEMMA (the root form of a word).

Lecture #5
- NLP for more complex tasks (providing summary
of a document, finding the main themes or claims - Objective: to understand how ‘ground truth’ is
in a text, classifying sentences, paragraphs or always interpretive in the sociocultural world.
whole texts)
- Classifying texts for literary texts: literary genre Lecture #7
- Charles Joseph Minard  Napoleon’s March to classification, theme identification, period - A network consists of nodes (things that are
Moscow classification, cultural context classification, writing connected) and edges (connections between
style analysis, tone/mood classification, and poetic those things).
structure analysis.
- Directed networks: edges flow in one direction;
Undirected networks (edges are reciprocal).
- Edges definition in this course  edges must be
explicit, uniform connections between things. A
family tree is not a network since edges have
different meanings assign to it (married, siblings,
blood-related, etc.)
- Edges can be weighted by assigning numerical
value between two nodes. This applies to both
directed and undirected networks. - The maximum eccentricity is the graph diameter,
- Nodes definition in this course  nodes must all and the minimum eccentricity is the graph radius.
be the same kind of thing.

- What should the edges be? Co-presence and

interactions.
- A network with more edges connecting the nodes
would have higher betweenness but lower
normalized betweenness than a network with
less edges connecting the nodes to one another.

Lecture #9
- Vectors are defined in x and y coordinates (point,
lines, and polygons). Calculating distance, area,
etc.
- Raster is matrices or grids (elevation,
temperature, soil pH, continuous type).
- The largest shortest path is the diameter (2) and - Descriptive statistics for points, lines, and
the smallest shortest path is the radius (1) polygons (measures of central tendency,
dispersion, proximity analysis, points within the
radius of another element, clustering, and spatial
autocorrelation)
- Map as networks  places are nodes, edges as large volumes of text such as novels, news
communication lines, similarity on a quantitative 6. If you want to compare the std of two articles, or other types of text?
measurement, movement from one place to variables, which is the best option? It would unleash new research potential.
another. Bar chart with error bars.
- Thick mapping: collecting, aggregating, and 16.Main challenges in text classification?
visualizing ever more layers of geographic or 7. When reading a traceback, where can you find Technical and interpretive challenges.
place-specific data. Thickness connotes a kind of the name of the most recent errors?
cultural analysis trained on the political, economic, At the end. 17.In the context of machine learning, what is the
linguistic, social, and other realities in which training set?
human beings act and create. 8. Why is the violin plot for budget in the example A set of predefined examples with known outputs
- Cartogram  map in which the geometry of a cut? and inputs.
place is distorted to represent a variable. Budget cannot have negative values.
- Dunham  works of a choreographer 18.The term of Fleiss Kappa in the course?
(dimensional, visualization of time and space) It is a metric used to measure agreement between
Lecture #11 9. In the context of text analysis, what are human labellers.
- Humanities approaches  attention to context, ‘collocate’?
Words that occur in close proximity to a particular 19.Why will we never reach a perfect system of
history, definitions, and small differences.
target word within the text classification?
- Why this matters?  learning the history and
Because of contextual differences, ambiguity, and
context of how data is collected, processed, and
10.What are the objectives of using Voyant tools disagreements among humans.
classified will make you a more informed person in
a data-driven world. and spacy to analyze a text?
To understand the structure, patterns, and 20.Regarding the compare_raters() function, what
grammatical roles of words in a text. must each excel file in the selected folder
Weekly Quiz Summary
contain?
1. What does the term “data provenance” refer 11.What is a LEMMA? The sentences and labels given by an individual
to? The root form of a word. labeler.
The origin and history of the data.
12.Why does spacy count the sentence “Welcome 21.Consider an undirected, unweighted network
2. What is the primary purpose of “close reading” back.” as three tokens? with 10 nodes and 35 edges. What is its
in data analysis? Because spacy counts punctuation as separate density?
To identify patterns, outliers, and inconsistencies tokens. T = 10 nodes. Potential Edges = (10*9)/2 = 45.
in the raw data. Density = 35/45 = 0.78
13.What does the line “nlp” =
3. Why is it important to understand both spacy.load (“en_core_web_lg”) do in the 22.Thick mapping statements.
ethnographic and data science perspectives? script? An approach that considers maps as cultural
To develop a more comprehensive and nuanced Load a model from spacy into our nlp project. objects that make specific claims.
approach to working with data.
14.What would the following script line do? 23.Ibsen Stage map show?
4. What does the lecture suggest about the matcher.add(“LIGHT_NOUN”, How the performance of Ibsen’s works spread
objectivity of data? [[{“LEMMA”:”say”, “POS”:”VERB”}]] around the world.
Both numerical and categorical data can be Add a pattern to match verb forms on the word
subjective or objective. say. 24.What is the Republic of Letters?
A metaphor for the correspondence networks of
5. Something is considered data depending on how 15.What does the lecture suggest could be the the 17th and 18th centuries in Europe.
one uses the information/knowledge. result of being able to automatically classify

Manovich - Data Article.2020
No ratings yet
Manovich - Data Article.2020
4 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
270 pages
Lecture 01
No ratings yet
Lecture 01
40 pages
Introduction To Data Science - Module 1
No ratings yet
Introduction To Data Science - Module 1
4 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Data Visualization - Spring 2017
No ratings yet
Data Visualization - Spring 2017
57 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
161 pages
Unit 1
No ratings yet
Unit 1
26 pages
Foundations of Data Science Syllabus
No ratings yet
Foundations of Data Science Syllabus
244 pages
Class 9 AI Project Cycle Notes
100% (1)
Class 9 AI Project Cycle Notes
8 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
FODS Full Notes
No ratings yet
FODS Full Notes
217 pages
AI Project Cycle Overview for Class X
No ratings yet
AI Project Cycle Overview for Class X
19 pages
DV Syllabus
No ratings yet
DV Syllabus
4 pages
Jhansi Ki Rani Lesson Plan 2024-2025
No ratings yet
Jhansi Ki Rani Lesson Plan 2024-2025
4 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Data Literacy in Humanities Module GEI1002
No ratings yet
Data Literacy in Humanities Module GEI1002
53 pages
Data Visualization Course Overview
No ratings yet
Data Visualization Course Overview
3 pages
AI3104 Foundation of Data Science (Handout) 2024
No ratings yet
AI3104 Foundation of Data Science (Handout) 2024
7 pages
NETWORK SCIENCE FOR DATA ANALYTICS-2023-I-Syllabus-posted
No ratings yet
NETWORK SCIENCE FOR DATA ANALYTICS-2023-I-Syllabus-posted
7 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Big Data Analytics Overview and Challenges
No ratings yet
Big Data Analytics Overview and Challenges
15 pages
Introduction to Data Science Course
No ratings yet
Introduction to Data Science Course
9 pages
Data Visualization - Syllabus
No ratings yet
Data Visualization - Syllabus
3 pages
Introduction To Data Science 439
No ratings yet
Introduction To Data Science 439
28 pages
LL LL LLLLL LLLLL
No ratings yet
LL LL LLLLL LLLLL
39 pages
9 Ai Project Cycle Notes
No ratings yet
9 Ai Project Cycle Notes
5 pages
Slide#3 - Understanding Data
No ratings yet
Slide#3 - Understanding Data
44 pages
Data Management: Recovery, Loss, and Ethics
No ratings yet
Data Management: Recovery, Loss, and Ethics
12 pages
Data Visualization for Researchers
No ratings yet
Data Visualization for Researchers
64 pages
BA ZG523 Introduction To Data Science
50% (2)
BA ZG523 Introduction To Data Science
12 pages
Introduction to Data Science Course
No ratings yet
Introduction to Data Science Course
64 pages
Fdsa PPT - Unit 1
No ratings yet
Fdsa PPT - Unit 1
19 pages
Revised NOTES On AI PROJECT CYCLE Class 9 and 10 As On 29-10-2024 1
No ratings yet
Revised NOTES On AI PROJECT CYCLE Class 9 and 10 As On 29-10-2024 1
21 pages
22CD1101
No ratings yet
22CD1101
2 pages
Data Science
No ratings yet
Data Science
9 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
185 pages
Etman MachineL 3
No ratings yet
Etman MachineL 3
47 pages
AI Project Cycle Overview for Students
No ratings yet
AI Project Cycle Overview for Students
10 pages
Information Visualization Techniques
No ratings yet
Information Visualization Techniques
37 pages
Big Data Analysis in Network Theory
No ratings yet
Big Data Analysis in Network Theory
8 pages
Model QP Scheme-2
No ratings yet
Model QP Scheme-2
19 pages
Introduction to Data Science Lecture Notes
83% (6)
Introduction to Data Science Lecture Notes
133 pages
Machine Learning Course Overview and Insights
No ratings yet
Machine Learning Course Overview and Insights
62 pages
Machine Learning Course Overview and Insights
No ratings yet
Machine Learning Course Overview and Insights
29 pages
Interactive Visualization in Data Science
No ratings yet
Interactive Visualization in Data Science
65 pages
Ocs353dsf Unit Wise Notes
100% (4)
Ocs353dsf Unit Wise Notes
121 pages
Data Science Overview and Techniques
No ratings yet
Data Science Overview and Techniques
28 pages
Introduction to Data Science Concepts
100% (1)
Introduction to Data Science Concepts
167 pages
R Programming Lab Manual
No ratings yet
R Programming Lab Manual
54 pages
Data
No ratings yet
Data
70 pages
Data Science An Introduction
No ratings yet
Data Science An Introduction
5 pages
Cs3352 Fds Notes Mk1
No ratings yet
Cs3352 Fds Notes Mk1
30 pages
Computational Social Science Course
No ratings yet
Computational Social Science Course
6 pages
110106064
No ratings yet
110106064
522 pages
DCPP Notes
No ratings yet
DCPP Notes
6 pages
UNIT 2 Ai Project Cycle
No ratings yet
UNIT 2 Ai Project Cycle
10 pages
Machine Learning Usefull Things
No ratings yet
Machine Learning Usefull Things
18 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
103 pages
Industrial Training Diary: ML Project Summary
No ratings yet
Industrial Training Diary: ML Project Summary
13 pages
NBA Salary Prediction Analysis
No ratings yet
NBA Salary Prediction Analysis
29 pages
B.Tech Curriculum: Computer Science & Design
No ratings yet
B.Tech Curriculum: Computer Science & Design
5 pages
Machine Learning for Breast Cancer Detection
No ratings yet
Machine Learning for Breast Cancer Detection
3 pages
Introduction to Pattern Recognition
No ratings yet
Introduction to Pattern Recognition
72 pages
Body of Knowledge v1 Bookmarksv2
No ratings yet
Body of Knowledge v1 Bookmarksv2
319 pages
Vanishing & Exploding Gradients in Neural Networks
No ratings yet
Vanishing & Exploding Gradients in Neural Networks
8 pages
YOLO v1 and Viola-Jones Face Detection
No ratings yet
YOLO v1 and Viola-Jones Face Detection
35 pages
Medical Imaging Using Machine Learning and Deep Learning Algorithms: A Review
No ratings yet
Medical Imaging Using Machine Learning and Deep Learning Algorithms: A Review
5 pages
Invariance Principle for OOD Generalization
No ratings yet
Invariance Principle for OOD Generalization
49 pages
Weighted KNN Accuracy Boost
No ratings yet
Weighted KNN Accuracy Boost
13 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
25 pages
Mod 3 AIML QB With Answers
No ratings yet
Mod 3 AIML QB With Answers
26 pages
Remote Sensing Tutorials for ERDAS & ArcView
No ratings yet
Remote Sensing Tutorials for ERDAS & ArcView
35 pages
Unit Iii Deep Learning Algorithms For Ai
No ratings yet
Unit Iii Deep Learning Algorithms For Ai
34 pages
Flight Delay and Cancellation Assessment
No ratings yet
Flight Delay and Cancellation Assessment
11 pages
Diabetes Prediction Internship Report
No ratings yet
Diabetes Prediction Internship Report
15 pages
Naive Bayes Classification
100% (3)
Naive Bayes Classification
10 pages
Data Science Answer Key Overview
No ratings yet
Data Science Answer Key Overview
17 pages
BI All Solve MCQ's (E-Next - In) (E-Next - In)
100% (1)
BI All Solve MCQ's (E-Next - In) (E-Next - In)
84 pages
Pedestrian Detection via PCA Image Reconstruction
No ratings yet
Pedestrian Detection via PCA Image Reconstruction
7 pages
Natural Language Processing With Java - Sample Chapter
100% (1)
Natural Language Processing With Java - Sample Chapter
33 pages
Shoolini University Deep Learning Exam
No ratings yet
Shoolini University Deep Learning Exam
3 pages
Chapter 4 - IS 466 - Spring Semester 23-24 Final
No ratings yet
Chapter 4 - IS 466 - Spring Semester 23-24 Final
57 pages
Lec 23 Learning Rules
No ratings yet
Lec 23 Learning Rules
60 pages
Naive Bayes Classifier Explained
No ratings yet
Naive Bayes Classifier Explained
2 pages
Lecture 4
No ratings yet
Lecture 4
288 pages
Supervised and Unsupervised
100% (1)
Supervised and Unsupervised
191 pages

Summary Notes GEI

Uploaded by

Summary Notes GEI

Uploaded by

GEI1002 Summary Notes technology and conscious choices made by o A phenomenon is represented as a set of

- What should the edges be? Co-presence and

You might also like