Unstructured Data - Wikipedia

Unstructured data - Wikipedia https://en.wikipedia.
org/wiki/Unstructured_data
Unstructured data
Unstructured data (or unstructured information) is information that either does not
have a pre-defined data model or is not organized in a pre-defined manner. Unstructured
information is typically text-heavy, but may contain data such as dates, numbers, and facts as
well. This results in irregularities and ambiguities that make it difficult to understand using
traditional programs as compared to data stored in fielded form in databases or annotated
(semantically tagged) in documents.
In 1998, Merrill Lynch said "unstructured data comprises the vast majority of data found in an
organization, some estimates run as high as 80%."[1] It's unclear what the source of this number
is, but nonetheless it is accepted by some.[2] Other sources have reported similar or higher
percentages of unstructured data.[3][4][5]
As of 2012, IDC and Dell EMC project that data will grow to 40 zettabytes by 2020, resulting in
a 50-fold growth from the beginning of 2010.[6] More recently, IDC and Seagate predict that the
global datasphere will grow to 163 zettabytes by 2025 [7] and majority of that will be
unstructured. The Computer World magazine states that unstructured information might
account for more than 70–80% of all data in organizations.[1]
Contents
Background
Issues with terminology
Dealing with unstructured data
Approaches in natural language processing
Approaches in medicine and biomedical research
The use of "unstructured" in data privacy regulations
See also
Notes
References
External links
Background
The earliest research into business intelligence focused in on unstructured textual data, rather
than numerical data.[8] As early as 1958, computer science researchers like H.P. Luhn were
particularly concerned with the extraction and classification of unstructured text.[8] However,
only since the turn of the century has the technology caught up with the research interest. In
2004, the SAS Institute developed the SAS Text Miner, which uses Singular Value
Decomposition (SVD) to reduce a hyper-dimensional textual space into smaller dimensions for
significantly more efficient machine-analysis.[9] The mathematical and technological advances
sparked by machine textual analysis prompted a number of businesses to research applications,
1 av 6 2022-12-03 03:52
Unstructured data - Wikipedia https://en.wikipedia.org/wiki/Unstructured_data
leading to the development of fields like sentiment analysis, voice of the customer mining, and
call center optimization.[10] The emergence of Big Data in the late 2000s led to a heightened
interest in the applications of unstructured data analytics in contemporary fields such as
predictive analytics and root cause analysis.[11]
Issues with terminology

The term is imprecise for several reasons:
1. Structure, while not formally defined, can still be implied.

2. Data with some form of structure may still be characterized as unstructured if its structure is
not helpful for the processing task at hand.
3. Unstructured information might have some structure (semi-structured) or even be highly
structured but in ways that are unanticipated or unannounced.
Dealing with unstructured data

Techniques such as data mining, natural language processing (NLP), and text analytics provide
different methods to find patterns in, or otherwise interpret, this information. Common
techniques for structuring text usually involve manual tagging with metadata or part-of-speech
tagging for further text mining-based structuring. The Unstructured Information Management
Architecture (UIMA) standard provided a common framework for processing this information
to extract meaning and create structured data about the information.[12]
Software that creates machine-processable structure can utilize the linguistic, auditory, and
visual structure that exist in all forms of human communication.[13] Algorithms can infer this
inherent structure from text, for instance, by examining word morphology, sentence syntax, and
other small- and large-scale patterns. Unstructured information can then be enriched and
tagged to address ambiguities and relevancy-based techniques then used to facilitate search and
discovery. Examples of "unstructured data" may include books, journals, documents, metadata,
health records, audio, video, analog data, images, files, and unstructured text such as the body
of an e-mail message, Web page, or word-processor document. While the main content being
conveyed does not have a defined structure, it generally comes packaged in objects (e.g. in files
or documents, ...) that themselves have structure and are thus a mix of structured and
unstructured data, but collectively this is still referred to as "unstructured data".[14] For
example, an HTML web page is tagged, but HTML mark-up typically serves solely for rendering.
It does not capture the meaning or function of tagged elements in ways that support automated
processing of the information content of the page. XHTML tagging does allow machine
processing of elements, although it typically does not capture or convey the semantic meaning
of tagged terms.
Since unstructured data commonly occurs in electronic documents, the use of a content or
document management system which can categorize entire documents is often preferred over
data transfer and manipulation from within the documents. Document management thus
provides the means to convey structure onto document collections.
Search engines have become popular tools for indexing and searching through such data,
especially text.
Approaches in natural language processing
2 av 6 2022-12-03 03:52
Specific computational workflows have been developed to impose structure upon the
unstructured data contained within text documents. These workflows are generally designed to
handle sets of thousands or even millions of documents, or far more than manual approaches to
annotation may permit. Several of these approaches are based upon the concept of online
analytical processing, or OLAP, and may be supported by data models such as text cubes.[15]
Once document metadata is available through a data model, generating summaries of subsets of
documents (i.e., cells within a text cube) may be performed with phrase-based approaches.[16]
Approaches in medicine and biomedical research
Biomedical research generates one major source of unstructured data as researchers often
publish their findings in scholarly journals. Though the language in these documents is
challenging to derive structural elements from (e.g., due to the complicated technical vocabulary
contained within and the domain knowledge required to fully contextualize observations), the
results of these activities may yield links between technical and medical studies[17] and clues
regarding new disease therapies.[18] Recent efforts to enforce structure upon biomedical
documents include self-organizing map approaches for identifying topics among documents,[19]
general-purpose unsupervised algorithms,[20] and an application of the CaseOLAP workflow[16]
to determine associations between protein names and cardiovascular disease topics in the
literature.[21] CaseOLAP defines phrase-category relationships in an accurate (identifies
relationships), consistent (highly reproducible), and efficient manner. This platform offers
enhanced accessibility and empowers the biomedical community with phrase-mining tools for
widespread biomedical research applications.[21]
The use of "unstructured" in data privacy regulations

In Sweden (EU), pre 2018, some data privacy regulations did not apply if the data in question
was confirmed as "unstructured".[22] This terminology, unstructured data, is rarely used in the
EU after GDPR came into force in 2018. GDPR does neither mention nor define "unstructured
data". It does use the word "structured" as follows (without defining it);
▪ Parts of GDPR Recital 15, "The protection of natural persons should apply to the processing
of personal data ... if ... contained in a filing system."
▪ GDPR Article 4, "‘filing system’ means any structured set of personal data which are
accessible according to specific criteria ..."
GDPR Case-law on what defines a "filing system"; "the specific criterion and the specific form in
which the set of personal data collected by each of the members who engage in preaching is
actually structured is irrelevant, so long as that set of data makes it possible for the data relating
to a specific person who has been contacted to be easily retrieved, which is however for the
referring court to ascertain in the light of all the circumstances of the case in the main
proceedings.” (CJEU, Todistajat v. Tietosuojavaltuutettu, Jehovan, Paragraph 61 (https://curia.
europa.eu/juris/document/document.jsf?docid=203822&doclang=EN%7CJehovan)).
If personal data is easily retrieved - then it is a filing system and - then it is in scope for GDPR
regardless of being "structured" or "unstructured". Most electronic systems today, subject to
access and applied software, can allow for easy retrieval of data.
See also
▪ Clustering
3 av 6 2022-12-03 03:52
▪ Pattern recognition
▪ List of text mining software
▪ Semi-structured data
▪ Structured data
Notes
1. ^ Today's Challenge in Government: What to do with Unstructured Information and Why
Doing Nothing Isn't An Option, Noel Yuhanna, Principal Analyst, Forrester Research, Nov
2010
References
1. Shilakes, Christopher C.; Tylman, Julie (16 Nov 1998). "Enterprise Information Portals" (http
s://web.archive.org/web/20110724175845/http://ikt.hia.no/perep/eip_ind.pdf) (PDF). Merrill
Lynch. Archived from the original (http://ikt.hia.no/perep/eip_ind.pdf) (PDF) on 24 July 2011.
2. Grimes, Seth (1 August 2008). "Unstructured Data and the 80 Percent Rule" (http://breakthr
oughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule). Breakthrough
Analysis - Bridgepoints. Clarabridge.
3. Gandomi, Amir; Haider, Murtaza (April 2015). "Beyond the hype: Big data concepts,
methods, and analytics" (https://doi.org/10.1016%2Fj.ijinfomgt.2014.10.007). International
Journal of Information Management. 35 (2): 137–144. doi:10.1016/j.ijinfomgt.2014.10.007 (h
ttps://doi.org/10.1016%2Fj.ijinfomgt.2014.10.007). ISSN 0268-4012 (https://www.worldcat.or
g/issn/0268-4012).
4. "The biggest data challenges that you might not even know you have - Watson" (https://ww
w.ibm.com/blogs/watson/2016/05/biggest-data-challenges-might-not-even-know/). Watson.
2016-05-25. Retrieved 2018-10-02.
5. "Structured vs. Unstructured Data" (https://www.datamation.com/big-data/structured-vs-unst
ructured-data.html). www.datamation.com. Retrieved 2018-10-02.
6. "EMC News Press Release: New Digital Universe Study Reveals Big Data Gap: Less Than
1% of World's Data is Analyzed; Less Than 20% is Protected" (http://www.emc.com/about/n
ews/press/2012/20121211-01.htm). www.emc.com. EMC Corporation. December 2012.
7. "Trends | Seagate US" (https://www.seagate.com/our-story/data-age-2025/). Seagate.com.
Retrieved 2018-10-01.
8. Grimes, Seth. "A Brief History of Text Analytics" (http://www.b-eye-network.com/view/6311).
B Eye Network. Retrieved June 24, 2016.
9. Albright, Russ. "Taming Text with the SVD" (https://web.archive.org/web/20160930182157/h
ttp://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf) (PDF). SAS.
Archived from the original (http://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeS
VD.pdf) (PDF) on 2016-09-30. Retrieved June 24, 2016.
10. Desai, Manish (2009-08-09). "Applications of Text Analytics" (http://mybusinessanalytics.blo
gspot.com/2009/08/applications-of-text-analytics.html). My Business Analytics @ Blogspot.
Retrieved June 24, 2016.
11. Chakraborty, Goutam. "Analysis of Unstructured Data: Applications of Text Analytics and
Sentiment Mining" (https://support.sas.com/resources/papers/proceedings14/1288-2014.pd
f) (PDF). SAS. Retrieved June 24, 2016.
4 av 6 2022-12-03 03:52
12. Holzinger, Andreas; Stocker, Christof; Ofner, Bernhard; Prohaska, Gottfried; Brabenetz,
Alberto; Hofmann-Wellenhof, Rainer (2013). "Combining HCI, Natural Language
Processing, and Knowledge Discovery – Potential of IBM Content Analytics as an Assistive
Technology in the Biomedical Field" (https://semanticscholar.org/paper/6a81bb782a68c72ec
26e79463cd2aec1d0cd917c). In Holzinger, Andreas; Pasi, Gabriella (eds.). Human-
Computer Interaction and Knowledge Discovery in Complex, Unstructured, Big Data.
Lecture Notes in Computer Science. Springer. pp. 13–24. doi:10.1007/978-3-642-39146-0_2
(https://doi.org/10.1007%2F978-3-642-39146-0_2). ISBN 978-3-642-39146-0.
S2CID 39461100 (https://api.semanticscholar.org/CorpusID:39461100).
13. "Structure, Models and Meaning: Is "unstructured" data merely unmodeled?" (http://www.int
elligententerprise.com/showArticle.jhtml?articleID=59301538). InformationWeek. March 1,
2005.
14. Malone, Robert (April 5, 2007). "Structuring Unstructured Data" (https://www.forbes.com/200
7/04/04/teradata-solution-software-biz-logistics-cx_rm_0405data.html). Forbes.
15. Lin, Cindy Xide; Ding, Bolin; Han, Jiawei; Zhu, Feida; Zhao, Bo (December 2008). Text
Cube: Computing IR Measures for Multidimensional Text Database Analysis. 2008 Eighth
IEEE International Conference on Data Mining. IEEE. CiteSeerX 10.1.1.215.3177 (https://cit
eseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.215.3177). doi:10.1109/icdm.2008.135 (htt
ps://doi.org/10.1109%2Ficdm.2008.135). ISBN 9780769535029. S2CID 1522480 (https://ap
i.semanticscholar.org/CorpusID:1522480).
16. Tao, Fangbo; Zhuang, Honglei; Yu, Chi Wang; Wang, Qi; Cassidy, Taylor; Kaplan, Lance;
Voss, Clare; Han, Jiawei (2016). "Multi-Dimensional, Phrase-Based Summarization in Text
Cubes" (http://sites.computer.org/debull/A16sept/p74.pdf) (PDF).
17. Collier, Nigel; Nazarenko, Adeline; Baud, Robert; Ruch, Patrick (June 2006). "Recent
advances in natural language processing for biomedical applications". International Journal
of Medical Informatics. 75 (6): 413–417. doi:10.1016/j.ijmedinf.2005.06.008 (https://doi.org/1
0.1016%2Fj.ijmedinf.2005.06.008). ISSN 1386-5056 (https://www.worldcat.org/issn/1386-50
56). PMID 16139564 (https://pubmed.ncbi.nlm.nih.gov/16139564).
18. Gonzalez, Graciela H.; Tahsin, Tasnia; Goodale, Britton C.; Greene, Anna C.; Greene,
Casey S. (January 2016). "Recent Advances and Emerging Applications in Text and Data
Mining for Biomedical Discovery" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4719073).
Briefings in Bioinformatics. 17 (1): 33–42. doi:10.1093/bib/bbv087 (https://doi.org/10.1093%
2Fbib%2Fbbv087). ISSN 1477-4054 (https://www.worldcat.org/issn/1477-4054).
PMC 4719073 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4719073). PMID 26420781
(https://pubmed.ncbi.nlm.nih.gov/26420781).
19. Skupin, André; Biberstine, Joseph R.; Börner, Katy (2013). "Visualizing the topical structure
of the medical sciences: a self-organizing map approach" (https://www.ncbi.nlm.nih.gov/pmc
/articles/PMC3595294). PLOS ONE. 8 (3): e58779. Bibcode:2013PLoSO...858779S (https://
ui.adsabs.harvard.edu/abs/2013PLoSO...858779S). doi:10.1371/journal.pone.0058779 (http
s://doi.org/10.1371%2Fjournal.pone.0058779). ISSN 1932-6203 (https://www.worldcat.org/is
sn/1932-6203). PMC 3595294 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3595294).
PMID 23554924 (https://pubmed.ncbi.nlm.nih.gov/23554924).
20. Kiela, Douwe; Guo, Yufan; Stenius, Ulla; Korhonen, Anna (2015-04-01). "Unsupervised
discovery of information structure in biomedical documents" (https://doi.org/10.1093%2Fbioi
nformatics%2Fbtu758). Bioinformatics. 31 (7): 1084–1092. doi:10.1093/bioinformatics
/btu758 (https://doi.org/10.1093%2Fbioinformatics%2Fbtu758). ISSN 1367-4811 (https://ww
w.worldcat.org/issn/1367-4811). PMID 25411329 (https://pubmed.ncbi.nlm.nih.gov/2541132
9).
5 av 6 2022-12-03 03:52
21. Liem, David A.; Murali, Sanjana; Sigdel, Dibakar; Shi, Yu; Wang, Xuan; Shen, Jiaming;
Choi, Howard; Caufield, John H.; Wang, Wei; Ping, Peipei; Han, Jiawei (Oct 1, 2018).
"Phrase mining of textual data to analyze extracellular matrix protein patterns across
cardiovascular disease" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6230912).
American Journal of Physiology. Heart and Circulatory Physiology. 315 (4): H910–H924.
doi:10.1152/ajpheart.00175.2018 (https://doi.org/10.1152%2Fajpheart.00175.2018).
ISSN 1522-1539 (https://www.worldcat.org/issn/1522-1539). PMC 6230912 (https://www.nc
bi.nlm.nih.gov/pmc/articles/PMC6230912). PMID 29775406 (https://pubmed.ncbi.nlm.nih.go
v/29775406).
22. "Swedish data privacy regulations discontinue separation of "unstructured" and
"structured" " (https://sverigeskommunikatorer.se/kunskap/nyheter/gdpr-del-3--missbruksreg
eln-upphor-vad-innebar-det-for-kommunikatoren/#:~:text=Vad%20inneb%C3%A4r%20Miss
bruksregeln%3F,men%20%C3%A4ven%20publicering%20av%20bilder).
External links
▪ Matching Unstructured Data and Structured Data (http://www.tdan.com/view-articles/5009)
▪ a brief description for Structured Data (https://dynomapper.com/blog/21-sitemaps-and-seo/4
33-what-is-structured-data-for-seo)
▪ Unstructured Data Definition, Examples, Benefits & Challenges (https://securiti.ai/unstructur
ed-data-101-definition-examples-benefits-challenges/)
Retrieved from "https://en.wikipedia.org/w/index.php?title=Unstructured_data&oldid=1117044522"
This page was last edited on 19 October 2022, at 17:01 (UTC).
Text is available under the Creative Commons Attribution-ShareAlike License 3.0; additional terms may apply. By
using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the
Wikimedia Foundation, Inc., a non-profit organization.
6 av 6 2022-12-03 03:52

Unstructured Data - Wikipedia

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unstructured Data - Wikipedia

Uploaded by

Copyright:

Available Formats

Unstructured data - Wikipedia https://en.wikipedia.

Issues with terminology

1. Structure, while not formally defined, can still be implied.

Dealing with unstructured data

Approaches in natural language processing

Approaches in medicine and biomedical research

The use of "unstructured" in data privacy regulations

Retrieved from "https://en.wikipedia.org/w/index.php?title=Unstructured_data&oldid=1117044522"

This page was last edited on 19 October 2022, at 17:01 (UTC).

You might also like