CC19 – Data Mining Agenda • Image Collection • Defining Image Collection • Collecting Image Data • Managing Image Data • Text Collection • Defining Text Collection • Collecting Text Data • Managing Text Data Image Collection Data Collection: Image and Text Defining Image Collection • Images are a visual or mental representation of something or someone. • In data science, these images have a class or type that they belong to. • Different types of images are used to conduct analysis and train modern systems. Defining Image Collection • Image collection refers to the practice of utilizing digital images for research and data science. • Using images for data science is common now but is a recent practice in the information age. • Image collection requires its own steps and tools due to the nature and characteristics of images. Collecting Image Data • Collecting image data can be challenging, since they come from a variety of sources: • Forums • Data Vendors • Digital Creation • Image Sharing Platforms • Image Scanning • Photography • Open-Source Datasets • Search Engines • Social Media Collecting Image Data • When collecting images, we need to ensure that we know what type of images we want to collect. • Ideally, we should be able to answer these questions about our image data: • What type of images will be gathered? • In what format will the images appear in? • What is the expected volume of images? Collecting Image Data: Characteristics • Images organized around specific themes or categories can provide insights into a variety of topics. • Understanding the characteristics of images are important, to glean patterns in our data. • That is why knowing the characteristics of our image data is important. Collecting Image Data: Characteristics • There are many other characteristics to image data than just the image itself: Characteristic Digital Image Details Data The Viewable Image or the Analytic Content derived from the Image Metadata Camera Make & Model, Image Timestep, Aperture, Shutter Speed, Caption, Author, Legal Information, Copyright Information, etc. Paradata Image Source/s, Image File Type, Image Edit History, Image Sharing History Collecting Image Data: Image Format • In addition to knowing the characteristics of image data, it is important to know how it is formatted. • The format used for images can affect how it will be analyzed when fed to an algorithm or program. • It is also important to consider the tools that will be used to analyze the images and what formats they support. Collecting Image Data: Image Format • These are the most common types of image formats: • JPEG (or JPG) – Joint Photographic Experts Group • GIF – Graphics Interchange Format • PNG – Portable Network Graphics • HEIF – High Efficiency Image File Format • TIFF – Tagged Image File Format • BMP – Windows Bitmap • WebP Collecting Image Data: Image Format • You need to consider the volume and type of images you are collecting when deciding on the best format. • Ideally, you would choose the highest quality available for all images, but your tradeoff will be larger image sizes. • Cornell University, a leading university in data science, recommends TIFF for images. Collecting Image Data: Volume • Your storage is limited, so you should take into consideration how many images to collect. • While larger storage devices have become relatively cheaper in recent years, images have become larger. • Expectation of big data analytics has also called for larger and larger image datasets. Collecting Image Data: Volume • The quickest solution to large images is lossy compression, converting the image to a lossy format (e.g. JPEG). • An alternative is storing all the images into a lossless file compression format (e.g. .zip, .7z, and .gz). • This retains the overall quality of the images while reducing the overall file size of the dataset. Managing Image Data • Storing and organizing your images is important, to ensure that you can easily sort to different images as needed. • This includes labeling, storing images in specific folders, renaming the images themselves, or editing metadata. • The effectiveness of these techniques decreases as the dataset becomes larger. Managing Image Data • Using simple folders and labels is sufficient for small image sets (~10,000 images). • As you make use of larger image sets, using special image management tools are recommended (e.g. ImageKit). Text Collection Data Collection: Image and Text Defining Text Collection • Text refers to printed books, documents, and media that cover different ideas and content. • In data science, text content is typically analyzed to determine the sentiments of users. • The text content can come from different sources depending on your needs and goals. Defining Text Collection • Text collection is a type of data collection that deals specifically with unstructured text data. • It typically makes use of natural language processing (NLP) techniques to extract insights from the text. • Text collection is a multidisciplinary field, involving different techniques such as text retrieval and text analysis. Collecting Text Data • Text data comes from a variety of sources: • Books • Classic Literature • Corpora • Data Vendors • Dictionaries • Forums • Interview Transcripts • Magazines • Open-Source Datasets • Short Stories • Surveys • Web Articles Collecting Text Data • Text collection has two key phases in its process: • Text Refining • Knowledge Distillation Collecting Text Data: Text Refining • Text refining is transforming free-form text to a chosen intermediate form (IF). • This IF can be in a semi-structured form such as conceptual graphs. • It can also be in a structured form such as relational data. Collecting Text Data: Text Refining • The purpose of turning text data into an IF is to make it easier to process and organize the text. • An IF will typically have labels for the individual text, or tags which describe the topic or idea. • It might also take note of keywords for sentiment analysis. Collecting Text Data: Text Refining • Mining a document-based IF deduces patterns and relationship across documents. • Examples of this are clustering/visualization and categorization. Collecting Text Data: Text Refining • Mining a concept-based IF deduces patterns and relationships across objects and concepts. • Examples of this are predictions and associative discovery. Collecting Text Data: Knowledge Distillation
• Knowledge distillation deduces patterns or knowledge
from the IF. • This is where you will utilize analysis tools or machine learning models to comb through your text data. • The tools used for knowledge distillation depends on the goals of your data mining process. Collecting Text Data: Knowledge Distillation
• These are examples of methods used for text data
analysis: • Text Data Categorization • Text Data Extraction • Text Data Identification • Text Data Parsing • Text Data Translation • Text Data Visualization Managing Text Data • Typically, we store both the “raw” data itself and the analyzed/”new” data together. • This allows us to validate our text refining process and ensure that our text is being interpreted correctly. • We also keep the “raw” data so that we can utilize multiple data mining techniques on the same data. Managing Text Data • Processing text data is iterative in nature, which means that we will better understand our data as we analyze it. • This process can involve various specialized techniques such as feature selection and feature extraction. • Due to this, we usually end up with a corpora at the end of our data mining process. Managing Text Data • A corpora acts as a collection of the linguistic patterns that we have analyzed from our text data. • This is what allows us to analyze new data and make predictions. • When we rebuild a text mining tool, we typically also rebuild the corpora. References • Image Management as a Data Service (iassistquarterly.com) • Text-Mining-The-state-of-the-art-and-the-challenges.pdf (researchgate.net) • Text Data Collection Services | OCR Dataset- GTS • Getting Started in Text Mining | PLOS Computational Biology • Automated Data Collection with R – A Practical Guide to Web Scraping and Text Mining (core.ac.uk) • Online-Data-Collection.pdf (researchgate.net) • Text Mining in Data Mining - GeeksforGeeks