0% found this document useful (0 votes)
22 views4 pages

Summary Notes GEI

The document covers various aspects of data collection, analysis, and interpretation, emphasizing the roles of ethnographers and data scientists. It discusses the importance of understanding cultural context, the limitations and biases in data, and the significance of different data types and structures. Additionally, it highlights the challenges of machine learning and natural language processing in relation to data and classification.

Uploaded by

Yui Hisame
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views4 pages

Summary Notes GEI

The document covers various aspects of data collection, analysis, and interpretation, emphasizing the roles of ethnographers and data scientists. It discusses the importance of understanding cultural context, the limitations and biases in data, and the significance of different data types and structures. Additionally, it highlights the challenges of machine learning and natural language processing in relation to data and classification.

Uploaded by

Yui Hisame
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

GEI1002 Summary Notes technology and conscious choices made by o A phenomenon is represented as a set of

people designing the dataset schema and objects (data points, measurements,
Lecture #1 collecting the data  PROXY. samples, records) and their features
- Ethnographers immerse themselves in - Good proxy e.g., using food center stall turnover (attributes, variables). The features may
communities, observe behaviors, conduct rates as a proxy for hawker industry health. include already available metadata and
interviews, and analyze cultural practices. - Poor proxy e.g., using the number of void deck measurements of objects’ characteristics we
- Data scientists use computers to discern trends events as a proxy for community engagement, generate using algorithms (feature
that might be inaccessible to the naked eye. using the number of people with gym extraction).
- Challenges in the age of AI: (1) bias in training memberships to assess health standards in the o The objects and their features form a dataset.
data leading to biased AI decisions, (2) risk of community. o The number of objects in a dataset must be
overreliance on quantitative data, missing - Cultural context in data collection: decisions are finite.
qualitative factors in understanding huma behavior informed by the worldviews, histories, values, and o Features are encoded using data types:
and preferences. priorities of those collecting data. whole and fractional number, categories,
- Data (Data Scientist)  a collection of - When analyzing a dataset with a decision spatial coordinates, spatial shapes and
measurements, attributes or observations that can diagram, we also need to note who created the trajectories, dates, times, text tags, or free
be analyzed to derive insights or make decisions. data and why. text.
- Data (Humanities)  a representation of human - Be aware of the unintentional errors when o Each feature can use only one data type.
experiences, behaviors, and cultural artifacts, analyzing a dataset. o The number of features in the dataset must
contextual and interpretive in nature. - Biography of a dataset  origin, context,
be finite.
- Types of data: quantitative, categorical (grouped), evolution, challenges, unexpected turns of events.
- Different objectives across the disciplines
ordinal (ranking), and continuous vs. discrete. - Close reading a dataset: (1) always look at the
(prediction – science; control – engineering;
- Each variable (feature) forms a column, and each raw data, (2) understand as many rows as
explanation – social science; interpretation –
observation forms a row. possible, not just the first few, (3) look for patterns,
humanities; automation – machine learning)
- You cannot combine different types of data in the outliers, and inconsistencies, (4) how missing data
- Big data – volume, variety, velocity, veracity, and
same column and only have one observation per is represented.
value. Outliers and errors becoming less
row. - EDA: summary statistics (central tendency-mean,
important.
- Dataset schema: the structural blueprint of a median, mode; dispersion-std, range),
- No data in the humanities is fully objective it
dataset. Specifies variables, observations, and visualizations (histograms, box plots, and scatter
requires active efforts to collect or produce
constraints that govern the structure and content plots).
depending on technology, opinions and
of the dataset. - Data Scientist (systematic analysis, computational
perspectives of people creating a system.
- Schematic bias: systematic distortion or limitation tools, EDA and visualization)
- Problems of data  standardization,
in data collection and representation that arises - Ethnography (paying attention to context,
incompleteness, and inaccuracy.
when complex realities are forced to fit emphasizing history, focusing on definitions, and
predetermined categories or structures. Manifest appreciating nuance)
Lecture #3
in two primary ways:
o In the initial design of data collection where Lecture #2
decisions about what to include and exclude - What is considered as data? Data is context-
shape the potential insights and limitation of dependent; data in one context might be
dataset. information or knowledge in another.
o In the processing of data where information is - Every category is a preconception, types of data
trimmed, transformed, or transfigured to can be manually assigned or computationally
match the expectations of the database derived.
scheme or analytical tools. - Characteristic of a dataset (Manovich, 2020):
- Limitations of data: data is always an imperfect
approximation of reality, dependent on available
- Variance: (1/N sum(Xi – mean)^2 )^2 - Classifying texts for news articles: factual
- KDE (Kernel Density Estimation) show a statement vs. opinion, news event vs. background
smoothed representation of the data distribution in information, direct quotation vs. paraphrase,
comparison to histograms. attribution vs. non-attribution, temporal
- Pair plots  scatterplots and KDE of multiple classification, eyewitness account vs. expert
variables for the same samples statement, cause vs. effect.
- Technical vs. Interpretive
- William Playfair  Exports and Imports and
general trade in England
- The “Infoviz” approach  visualization aesthetic
common in the news developed by Nigel Holmes.
- Sources of bias in data visualization: axis Lecture #6
cropping, axis scaling, multiple dimensions,
unnormalized values, spurious correlation. - A model (ML) is only as good as the data it was
- Problem of pie charts: 3D visualizations, without trained on. This includes the training labels.
labels telling the % there would be little to no - Data  Training  Model
chance of accurately guessing it. People tend to - No dataset is value-neutral, every dataset used to
underestimate the size of acute angles and train ML systems contains a worldview.
overestimate the size of obtuse ones. - Where do labels come from? From human
- Each visualizations tell different stories. labelers that assign them (Ground Truth).
- John Snow  map of cholera outbreaks and - Inter-rater reliability  there are many ways to
wells. - List of caveats: order your data, cut or not to cut,
spaghetti chart, pie chart. calculate the agreement between human labelers
using Fleiss’ Kappa (K = number of labelers).
Lecture #5 - Evaluating the results of a model with “accuracy”
and “confusion matrix”.
- NLP (search engines, social media recommender
systems, flagging spam, identify posts that interest
users.
- NLP in humanities and social sciences: (1)
description, rather than prediction.
- LEMMA, sometimes verbs such as “say” would
- Florence Nightingale  Coxcom (polar area present in different forms in a text, such as
graph) “saying” or “said”; to count all instances of verb we
can access LEMMA (the root form of a word).

Lecture #5
- NLP for more complex tasks (providing summary
of a document, finding the main themes or claims - Objective: to understand how ‘ground truth’ is
in a text, classifying sentences, paragraphs or always interpretive in the sociocultural world.
whole texts)
- Classifying texts for literary texts: literary genre Lecture #7
- Charles Joseph Minard  Napoleon’s March to classification, theme identification, period - A network consists of nodes (things that are
Moscow classification, cultural context classification, writing connected) and edges (connections between
style analysis, tone/mood classification, and poetic those things).
structure analysis.
- Directed networks: edges flow in one direction;
Undirected networks (edges are reciprocal).
- Edges definition in this course  edges must be
explicit, uniform connections between things. A
family tree is not a network since edges have
different meanings assign to it (married, siblings,
blood-related, etc.)
- Edges can be weighted by assigning numerical
value between two nodes. This applies to both
directed and undirected networks. - The maximum eccentricity is the graph diameter,
- Nodes definition in this course  nodes must all and the minimum eccentricity is the graph radius.
be the same kind of thing.

- What should the edges be? Co-presence and


interactions.
- A network with more edges connecting the nodes
would have higher betweenness but lower
normalized betweenness than a network with
less edges connecting the nodes to one another.

Lecture #9
- Vectors are defined in x and y coordinates (point,
lines, and polygons). Calculating distance, area,
etc.
- Raster is matrices or grids (elevation,
temperature, soil pH, continuous type).
- The largest shortest path is the diameter (2) and - Descriptive statistics for points, lines, and
the smallest shortest path is the radius (1) polygons (measures of central tendency,
dispersion, proximity analysis, points within the
radius of another element, clustering, and spatial
autocorrelation)
- Map as networks  places are nodes, edges as large volumes of text such as novels, news
communication lines, similarity on a quantitative 6. If you want to compare the std of two articles, or other types of text?
measurement, movement from one place to variables, which is the best option? It would unleash new research potential.
another. Bar chart with error bars.
- Thick mapping: collecting, aggregating, and 16.Main challenges in text classification?
visualizing ever more layers of geographic or 7. When reading a traceback, where can you find Technical and interpretive challenges.
place-specific data. Thickness connotes a kind of the name of the most recent errors?
cultural analysis trained on the political, economic, At the end. 17.In the context of machine learning, what is the
linguistic, social, and other realities in which training set?
human beings act and create. 8. Why is the violin plot for budget in the example A set of predefined examples with known outputs
- Cartogram  map in which the geometry of a cut? and inputs.
place is distorted to represent a variable. Budget cannot have negative values.
- Dunham  works of a choreographer 18.The term of Fleiss Kappa in the course?
(dimensional, visualization of time and space) It is a metric used to measure agreement between
Lecture #11 9. In the context of text analysis, what are human labellers.
- Humanities approaches  attention to context, ‘collocate’?
Words that occur in close proximity to a particular 19.Why will we never reach a perfect system of
history, definitions, and small differences.
target word within the text classification?
- Why this matters?  learning the history and
Because of contextual differences, ambiguity, and
context of how data is collected, processed, and
10.What are the objectives of using Voyant tools disagreements among humans.
classified will make you a more informed person in
a data-driven world. and spacy to analyze a text?
To understand the structure, patterns, and 20.Regarding the compare_raters() function, what
grammatical roles of words in a text. must each excel file in the selected folder
Weekly Quiz Summary
contain?
1. What does the term “data provenance” refer 11.What is a LEMMA? The sentences and labels given by an individual
to? The root form of a word. labeler.
The origin and history of the data.
12.Why does spacy count the sentence “Welcome 21.Consider an undirected, unweighted network
2. What is the primary purpose of “close reading” back.” as three tokens? with 10 nodes and 35 edges. What is its
in data analysis? Because spacy counts punctuation as separate density?
To identify patterns, outliers, and inconsistencies tokens. T = 10 nodes. Potential Edges = (10*9)/2 = 45.
in the raw data. Density = 35/45 = 0.78
13.What does the line “nlp” =
3. Why is it important to understand both spacy.load (“en_core_web_lg”) do in the 22.Thick mapping statements.
ethnographic and data science perspectives? script? An approach that considers maps as cultural
To develop a more comprehensive and nuanced Load a model from spacy into our nlp project. objects that make specific claims.
approach to working with data.
14.What would the following script line do? 23.Ibsen Stage map show?
4. What does the lecture suggest about the matcher.add(“LIGHT_NOUN”, How the performance of Ibsen’s works spread
objectivity of data? [[{“LEMMA”:”say”, “POS”:”VERB”}]] around the world.
Both numerical and categorical data can be Add a pattern to match verb forms on the word
subjective or objective. say. 24.What is the Republic of Letters?
A metaphor for the correspondence networks of
5. Something is considered data depending on how 15.What does the lecture suggest could be the the 17th and 18th centuries in Europe.
one uses the information/knowledge. result of being able to automatically classify

You might also like