You are on page 1of 31

Data Collection:

Image and Text


CC19 – Data Mining
Agenda
• Image Collection
• Defining Image Collection
• Collecting Image Data
• Managing Image Data
• Text Collection
• Defining Text Collection
• Collecting Text Data
• Managing Text Data
Image Collection
Data Collection: Image and Text
Defining Image Collection
• Images are a visual or mental representation of something
or someone.
• In data science, these images have a class or type that
they belong to.
• Different types of images are used to conduct analysis
and train modern systems.
Defining Image Collection
• Image collection refers to the practice of utilizing digital
images for research and data science.
• Using images for data science is common now but is a
recent practice in the information age.
• Image collection requires its own steps and tools due to
the nature and characteristics of images.
Collecting Image Data
• Collecting image data can be challenging, since they
come from a variety of sources:
• Forums
• Data Vendors
• Digital Creation
• Image Sharing Platforms
• Image Scanning
• Photography
• Open-Source Datasets
• Search Engines
• Social Media
Collecting Image Data
• When collecting images, we need to ensure that we know
what type of images we want to collect.
• Ideally, we should be able to answer these questions
about our image data:
• What type of images will be gathered?
• In what format will the images appear in?
• What is the expected volume of images?
Collecting Image Data: Characteristics
• Images organized around specific themes or categories
can provide insights into a variety of topics.
• Understanding the characteristics of images are
important, to glean patterns in our data.
• That is why knowing the characteristics of our image data
is important.
Collecting Image Data: Characteristics
• There are many other characteristics to image data than
just the image itself:
Characteristic Digital Image Details
Data The Viewable Image or the Analytic
Content derived from the Image
Metadata Camera Make & Model, Image
Timestep, Aperture, Shutter Speed,
Caption, Author, Legal Information,
Copyright Information, etc.
Paradata Image Source/s, Image File Type, Image
Edit History, Image Sharing History
Collecting Image Data: Image Format
• In addition to knowing the characteristics of image data, it
is important to know how it is formatted.
• The format used for images can affect how it will be
analyzed when fed to an algorithm or program.
• It is also important to consider the tools that will be used to
analyze the images and what formats they support.
Collecting Image Data: Image Format
• These are the most common types of image formats:
• JPEG (or JPG) – Joint Photographic Experts Group
• GIF – Graphics Interchange Format
• PNG – Portable Network Graphics
• HEIF – High Efficiency Image File Format
• TIFF – Tagged Image File Format
• BMP – Windows Bitmap
• WebP
Collecting Image Data: Image Format
• You need to consider the volume and type of images you
are collecting when deciding on the best format.
• Ideally, you would choose the highest quality available for
all images, but your tradeoff will be larger image sizes.
• Cornell University, a leading university in data science,
recommends TIFF for images.
Collecting Image Data: Volume
• Your storage is limited, so you should take into
consideration how many images to collect.
• While larger storage devices have become relatively
cheaper in recent years, images have become larger.
• Expectation of big data analytics has also called for larger
and larger image datasets.
Collecting Image Data: Volume
• The quickest solution to large images is lossy compression,
converting the image to a lossy format (e.g. JPEG).
• An alternative is storing all the images into a lossless file
compression format (e.g. .zip, .7z, and .gz).
• This retains the overall quality of the images while reducing
the overall file size of the dataset.
Managing Image Data
• Storing and organizing your images is important, to ensure
that you can easily sort to different images as needed.
• This includes labeling, storing images in specific folders,
renaming the images themselves, or editing metadata.
• The effectiveness of these techniques decreases as the
dataset becomes larger.
Managing Image Data
• Using simple folders and labels is sufficient for small image
sets (~10,000 images).
• As you make use of larger image sets, using special image
management tools are recommended (e.g. ImageKit).
Text Collection
Data Collection: Image and Text
Defining Text Collection
• Text refers to printed books, documents, and media that
cover different ideas and content.
• In data science, text content is typically analyzed to
determine the sentiments of users.
• The text content can come from different sources
depending on your needs and goals.
Defining Text Collection
• Text collection is a type of data collection that deals
specifically with unstructured text data.
• It typically makes use of natural language processing
(NLP) techniques to extract insights from the text.
• Text collection is a multidisciplinary field, involving different
techniques such as text retrieval and text analysis.
Collecting Text Data
• Text data comes from a variety of sources:
• Books
• Classic Literature
• Corpora
• Data Vendors
• Dictionaries
• Forums
• Interview Transcripts
• Magazines
• Open-Source Datasets
• Short Stories
• Surveys
• Web Articles
Collecting Text Data
• Text collection has two key
phases in its process:
• Text Refining
• Knowledge Distillation
Collecting Text Data: Text Refining
• Text refining is transforming free-form text to a chosen
intermediate form (IF).
• This IF can be in a semi-structured form such as
conceptual graphs.
• It can also be in a structured form such as relational data.
Collecting Text Data: Text Refining
• The purpose of turning text data into an IF is to make it
easier to process and organize the text.
• An IF will typically have labels for the individual text, or
tags which describe the topic or idea.
• It might also take note of keywords for sentiment analysis.
Collecting Text Data: Text Refining
• Mining a document-based
IF deduces patterns and
relationship across
documents.
• Examples of this are
clustering/visualization and
categorization.
Collecting Text Data: Text Refining
• Mining a concept-based IF
deduces patterns and
relationships across objects
and concepts.
• Examples of this are
predictions and associative
discovery.
Collecting Text Data: Knowledge Distillation

• Knowledge distillation deduces patterns or knowledge


from the IF.
• This is where you will utilize analysis tools or machine
learning models to comb through your text data.
• The tools used for knowledge distillation depends on the
goals of your data mining process.
Collecting Text Data: Knowledge Distillation

• These are examples of methods used for text data


analysis:
• Text Data Categorization
• Text Data Extraction
• Text Data Identification
• Text Data Parsing
• Text Data Translation
• Text Data Visualization
Managing Text Data
• Typically, we store both the “raw” data itself and the
analyzed/”new” data together.
• This allows us to validate our text refining process and
ensure that our text is being interpreted correctly.
• We also keep the “raw” data so that we can utilize
multiple data mining techniques on the same data.
Managing Text Data
• Processing text data is iterative in nature, which means
that we will better understand our data as we analyze it.
• This process can involve various specialized techniques
such as feature selection and feature extraction.
• Due to this, we usually end up with a corpora at the end
of our data mining process.
Managing Text Data
• A corpora acts as a collection of the linguistic patterns
that we have analyzed from our text data.
• This is what allows us to analyze new data and make
predictions.
• When we rebuild a text mining tool, we typically also
rebuild the corpora.
References
• Image Management as a Data Service (iassistquarterly.com)
• Text-Mining-The-state-of-the-art-and-the-challenges.pdf
(researchgate.net)
• Text Data Collection Services | OCR Dataset- GTS
• Getting Started in Text Mining | PLOS Computational Biology
• Automated Data Collection with R – A Practical Guide to Web
Scraping and Text Mining (core.ac.uk)
• Online-Data-Collection.pdf (researchgate.net)
• Text Mining in Data Mining - GeeksforGeeks

You might also like