List of Datasets For Machine-Learning Research

List of datasets for machine-learning research
These datasets are applied for machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of
machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively,
the availability of high-quality training datasets.[1] High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are
usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality
datasets for unsupervised learning can also be difficult and costly to produce.[2][3][4][5]
Many organizations including governments publish and share their datasets . The datasets are classified, based on the licenses, as Open data and Non-Open data.
The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made
available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes.
List of sorting used for datasets

Type Subtypes
Finance, Economics, Commerce, Societal, Health, Academy, Sports, Food, Agriculture, Travel, Geospatial, Political, Consumer,
Specific category
Transport, Logistics, Environmental, Real-Estate, Legal, Entertainment, Energy, Hospitality
Scope Supranational Union, National, Subnational, Municipality, Urban, Rural
Language Mandarin Chinese, Spanish, English, Arabic, Hindi, Bengali
Type Tabular, Graph, Text, Image, Sound, Video

Usage Training, validating, and testing
File-Formats CSV, JSON, XML, KML, GeoJSON, Shapefile, GML
Licenses Creative-Commons, GPL, Other Non-Open data licenses

Last-Updated Last-Hour, Last-Day, Last-Week, Last-Month, Last-Year
File-Size Minimum, Maximum, Range
Status (https://docs.openml.org/
Verified, In-Preparation, Deactivated(or Deprecated)
#dataset-status)
Number of records 100s, 1000s, 10000s, 100000s, Millions
Number of variables Less than 10, 10s, 100s, 1000s, 10000s
Services Individual, Aggregation
The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many
government organizations and academic institutions.
List of open data portals

Portal-Name License List of Installations of the Portal Typical Usages
https://ckan.github.io/ckan-instances/
Data repository for government or non-profit
Comprehensive Knowledge
AGPL organisations, Data Management Solution for
Archive Network (CKAN) https://github.com/sebneu/ckan_instances/blob/master/instances.csv Research Institutes
Data repository for government or non-profit

DKAN (https://getdkan.or
GPL https://getdkan.org/community organisations, Data Management Solution for
g/)
Research Institutes
https://dataverse.org/installations
Data Management Solution for Research
Dataverse Apache
https://dataverse.org/metrics Institutes
Data Management Solution for Research

DSpace BSD https://registry.lyrasis.org/
Institutes
Data Management Solution to share

OpenML (https://www.open
BSD https://www.openml.org/search?type=data&sort=runs&status=active datasets, algorithms, and experiments
ml.org/)
results through APIs.
List of portals suitable for multiple types of applications

The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many machine learning applications.
Academic Torrents https://academictorrents.com
Amazon Datasets https://registry.opendata.aws/
Awesome Public Datasets Collection https://github.com/awesomedata/awesome-public-datasets

data.world https://data.world/datasets/machine-learning
Datahub – Core Datasets https://datahub.io/docs/core-data

DataONE https://www.dataone.org/
DataPortals https://dataportals.org/
Datasetlist.com https://www.datasetlist.com
Global Open Data Index – Open Knowledge https://index.okfn.org/ Archived (https://web.archive.org/web/20200525213547/https://index.okfn.org/) 25 May 2020 at the
Foundation Wayback Machine
Google Dataset Search https://datasetsearch.research.google.com/
Hugging Face https://huggingface.co/docs/datasets/
IBM's Data Asset Exchange https://developer.ibm.com/exchanges/data/

Jupyter – Tutorial Data https://jupyter-tutorial.readthedocs.io/en/latest/data-processing/opendata.html
Kaggle https://www.kaggle.com/datasets
Machine learning datasets https://macgence.com/data-sets-and-cataloges/

Major Smart Cities with Open Data https://rlist.io/l/major-smart-cities-with-open-data-portals
Microsoft Datasets https://msropendata.com/datasets
Open Data Inception https://opendatainception.io/

Opendatasoft https://data.opendatasoft.com/explore/dataset/open-data-sources%40public/table/?sort=code_en
OpenDOAR https://v2.sherpa.ac.uk/opendoar/
OpenML https://www.openml.org/search?type=data
Papers with Code https://paperswithcode.com/datasets
Penn Machine Learning Benchmarks https://github.com/EpistasisLab/pmlb/tree/master/datasets
Public APIs https://github.com/public-apis/public-apis

Registry of Open Access Repositories http://roar.eprints.org/
REgistry of REsearch Data REpositories https://www.re3data.org/
UCI Machine Learning Repository http://mlr.cs.umass.edu/ml/

Speech Dataset https://www.shaip.com/offerings/speech-data-catalog/
Visual Data Discovery https://visualdata.io/discovery
List of portals suitable for a specific subtype of applications

The data portals which are suitable for a specific subtype of machine learning application are listed in the subsequent sections.
Image data
These datasets consist primarily of images or videos for tasks such as object detection, facial recognition, and multi-label classification.
Facial recognition
In computer vision, face images have been used extensively to develop facial recognition systems, face detection, and many other projects that use images of faces.
Dataset Created
Brief description Preprocessing Instances Format Default task Reference Creator
name (updated)
298 videos of 200

individuals,
~1,250,000
manually annotated the detected
affect CVPR[6]
images: annotated faces, facial ~1,250,000
recognition
in terms of landmarks and manually video (visual + audio D.Kollias et
Aff-Wild (valence- 2017
dimensional affect valence- annotated modalities)
arousal IJCV[7] al.
(valence-arousal); arousal images
estimation)
in-the-wild setting; annotations
color database;
various resolutions
(average = 640x360)
558 videos of 458

individuals,
~2,800,000
manually annotated
images: annotated
in terms of i) affect
categorical affect (7 recognition
basic expressions: (valence-
the detected BMVC[8]
neutral, happiness, ~2,800,000 arousal
faces, detected
sadness, surprise, manually video (visual + audio estimation, D.Kollias et
Aff-Wild2 and aligned 2019
fear, disgust, anger);
faces and
annotated modalities) basic FG[9] al.
ii) dimensional affect images expression
annotations
(valence-arousal); iii) classification,
action units (AUs action unit
1,2,4,6,12,15,20,25); detection)
in-the-wild setting;
color database;
various resolutions
(average =
1030x630)
11338 images of
FERET United
1199 individuals in Classification,
(facial [10][11] States
different positions None. 11,338 Images face 2003
recognition Department
and at different recognition
technology) of Defense
times.
Files labelled
Ryerson
7,356 video and with
Audio-Visual Classification,
audio recordings of expression. S.R.
Database of face
24 professional Perceptual [12][13] Livingstone
Emotional 7,356 Video, sound files recognition, 2018
actors. 8 emotions validation and F.A.
Speech and voice
each at two ratings Russo
Song recognition
intensities. provided by
(RAVDESS)
319 raters.
Location of
Color images of facial features Classification,
[14][15] M. Grgic et
SCFace faces at various extracted. 4,160 Images, text face 2011
al.
angles. Coordinates of recognition
features given.
Faces of 15
Yale Face individuals in 11 Labels of Face [16][17] J. Yang et
165 Images 1997
Database different expressions. recognition al.
expressions.
Cohn-Kanade
Large database of Tracking of Facial
AU-Coded 500+ [18][19] T. Kanade
images with labels certain facial Images, text expression 2000
Expression sequences et al.
for expressions. features. analysis
Database
213 images of 7 Images are

facial expressions (6 cropped to the
JAFFE Facial basic facial facial region. Facial Lyons,
Expression expressions + 1 Includes 213 Images, text expression 1998 [20][21] Kamachi,
Database neutral) posed by 10 semantic cognition Gyoba
Japanese female ratings data on
models. emotion labels.
Images of public
figures scrubbed Name and m/f Face [22][23]
FaceScrub 107,818 Images, text 2014 H. Ng et al.
from image annotation. recognition
searching.
Images of faces
BioID Face Manually set Face [24][25]
with eye positions 1521 Images, text 2001 BioID
Database eye positions. recognition
marked.
Skin Randomly sampled

B, G, R, values Segmentation, [26][27]
Segmentation color values from 245,057 Text 2012 R. Bhatt.
extracted. classification
Dataset face images.
34 action units
and 6
expressions Face
3D Face image [28][29] A Savran et
Bosphorus
database.
labeled; 24 4652 Images, text recognition, 2008
al.
facial classification
landmarks
labeled.
neutral face, 5
expressions: anger, Face
UOY 3D- [30][31] University
Face
happiness, sadness, labeling. 5250 Images, text recognition, 2004
of York
eyes closed, classification
eyebrows raised.
Institute of
Expressions: Anger,
CASIA 3D Face Automation,
smile, laugh, [32][33]
Face
surprise, closed
None. 4624 Images, text recognition, 2007 Chinese
Database classification Academy of
eyes.
Sciences
Expressions: Anger Annotated Visible
Face
Disgust Fear Spectrum and Near Infrared [34] Zhao, G. et
CASIA NIR None. 480 recognition, 2011
Happiness Sadness Video captures at 25 al.
classification
Surprise frames per second
neutral face, and 6

expressions: anger, Facial
happiness, sadness, expression [35] Binghamton
BU-3DFE None. 2500 Images, text 2006
surprise, disgust, recognition, University
fear (4 levels). 3D classification
images extracted.
Up to 22 samples
Face National
for each subject.
Recognition Face Institute of
Expressions: anger, [36][37]
Grand None. 4007 Images, text recognition, 2004 Standards
happiness, sadness,
Challenge classification and
surprise, disgust,
Dataset Technology
puffy. 3D Data.
Up to 61 samples
for each subject.
Expressions neutral Face King Juan
Gavabdb face, smile, frontal None. 549 Images, text recognition, 2008 [38][39] Carlos
accentuated laugh, classification University
frontal random
gesture. 3D images.
Up to 100 subjects, Royal

Face
expressions mostly [40][41] Military
3D-RMA None. 9971 Images, text recognition, 2004
neutral. Several Academy
classification
poses as well. (Belgium)
Gender
A set of
classification,
synthetic filters
112 persons (66 42,592 face
(blur,
males and 46 (2,662 detection,
occlusions,
females) wear original face [42][43] Afifi, M. et
SoF noise, and Images, Mat file 2017
glasses under image × 16 recognition, al.
posterization )
different illumination synthetic age
with different
conditions. image) estimation,
level of
and glasses
difficulty.
detection
Gender
classification,
IMDb and Wikipedia face
R. Rothe,
face images with detection, [44]
IMDb-WIKI None 523,051 Images 2015 R. Timofte,
gender and age face
L. V. Gool
labels. recognition,
age
estimation
Action recognition
Created
Dataset name Brief description Preprocessing Instances Format Default Task Reference Creator
(updated)
Videos from 20 different

TV shows for prediction 6,766
TV Human Action [45] Patron-Perez,
social actions: None. video video clips 2013
Interaction Dataset prediction A. et al.
handshake, high five, clips
hug, kiss and none.
8 PhaseSpace
Motion Capture,
2 Stereo
Berkeley Multimodal Recordings of a single 660
Cameras, 4 Action [46]
Human Action person performing 12 MoCap pre-processing action 2013 Ofli, F. et al.
Quad Cameras, classification
Database (MHAD) actions samples
6
accelerometers,
4 microphones
45M Classification,
Large video dataset for Actions classified and Video, images, [47][48]
THUMOS Dataset frames of action 2013 Y. Jiang et al.
action classification. labeled. text
video detection
Video dataset for action Actions classified and Action [49]

MEXAction2 1000 Video 2014 Stoian et al.
localization and spotting labeled. detection
Object detection and recognition
Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference
Visual Images and their Image [50] R.

108,000 images, text 2016
Genome description captioning al.
Berkeley 3-D 849 images taken Object bounding boxes 849 labeled images, text Object 2014 [51][52] A.
Object in 75 different and labeling. recognition al.
Dataset scenes. About 50
different object
classes are
labeled.
500 natural images,

Berkeley explicitly separated
Contour
Segmentation into disjoint train,
Each image segmented detection and Un
Data Set and validation and test [53]
by five different subjects 500 Segmented images hierarchical 2011 Ca
Benchmarks subsets +
on average. image Be
500 benchmarking
segmentation
(BSDS500) code. Based on
BSDS300.
Microsoft
complex everyday Object highlighting,
Common
scenes of common labeling, and Object [54][55][56]
Objects in 2,500,000 Labeled images, text 2015 T.
objects in their classification into 91 recognition
Context
natural context. object types.
(COCO)
Very large scene Object

Places and objects are
SUN and object recognition, [57][58]
labeled. Objects are 131,067 Images, text 2014 J.
Database recognition scene
segmented.
database. recognition
Labeled object
image database,
Labeled objects, Object
used in the
bounding boxes, recognition, [59][60][61]
ImageNet ImageNet Large 14,197,122 Images, text 2009 (2014) J.
descriptive words, SIFT scene
Scale Visual
features recognition
Recognition
Challenge
A Large set of
images listed as
having CC BY 2.0 2017
license with image- Classification,
Image-level labels, [62]
Open Images level labels and 9,178,275 Images, text Object
bounding boxes
Bounding boxes
recognition (V7 : 2022)
spanning
thousands of
classes.
TV News
Channel TV commercials Audio and video features
Clustering, [63][64]
Commercial and news extracted from still 129,685 Text 2015 P.
classification
Detection broadcasts. images.
Dataset
The instances were

drawn randomly
from a database of
Statlog
7 outdoor images
(Image Many features [65] Un
and hand- 2310 Text Classification 1990
Segmentation) calculated. Ma
segmented to
Dataset
create a
classification for
every pixel.
Classification,
Detailed object outlines [66][67]
Caltech 101 Pictures of objects. 9146 Images object 2003 F.
marked.
recognition.
Large dataset of Classification,

Images categorized and [68][69]
Caltech-256 images for object 30,607 Images, Text object 2007 G.
hand-sorted.
classification. detection
10 billion pairs of alt-text

Classification,
Image-Text Pair and image sources in [70]
COYO-700M 746,972,269 Images, Text Image- 2022
Dataset HTML documents in
Language
CommonCrawl
SIFT features of Classification,
SIFT10M Extensive SIFT feature [71]
Caltech-256 11,164,866 Text object 2016 X.
Dataset extraction.
dataset. detection
MI
Classification, Sc
Annotated pictures [72]
LabelMe Objects outlined. 187,240 Images, text object 2005 Art
of scenes.
detection Int
La
Stereo video
sequences
recorded in street Classification,
Cityscapes Pixel-level segmentation [73] Da
scenes, with pixel- 25,000 Images, text object 2016
Dataset and labeling al.
level annotations. detection
Metadata also
included.
Large number of
Classification,
PASCAL VOC images for Labeling, bounding box [74][75] M.
500,000 Images, text object 2010
Dataset classification included et
detection
tasks.
Many small, low-

Classes labelled,
CIFAR-10 resolution, images [60][76] A.
training set splits 60,000 Images Classification 2009
Dataset of 10 classes of et
created.
objects.
Like CIFAR-10,
Classes labelled,
CIFAR-100 above, but 100 [60][76] A.
training set splits 60,000 Images Classification 2009
Dataset classes of objects et
created.
are given.
A unified Lu
contribution of Ell
CIFAR-10 and Classes labelled, Cro
CINIC-10 [77]
Imagenet with 10 training, validation, test 270,000 Images Classification 2018 An
Dataset
classes, and 3 set splits created. An
splits. Larger than Am
CIFAR-10. Sto
A MNIST-like Classes labelled,
Fashion- [78]
fashion product training set splits 60,000 Images Classification 2017 Za
MNIST
database created.
Some publicly
available fonts and
extracted glyphs
from them to make Classes labelled,
[79] Ya
notMNIST a dataset similar to training set splits 500,000 Images Classification 2011
Bu
MNIST. There are created.
10 classes, with
letters A-J taken
from different fonts.
Images from
vehicles of traffic
German signs on German
Traffic Sign roads. These signs
Detection comply with UN Signs manually labeled 900 Images Classification 2013 [80][81] S
Benchmark standards and
Dataset therefore are the
same as in other
countries.
Autonomous
vehicles driving
through a mid-size
KITTI Vision Classification,
city captured Many benchmarks >100 GB of [82][83][84]
Benchmark Images, text object 2012 AG
images of various extracted from data. data
Dataset detection
areas using
cameras and laser
scanners.
Classes labelled,
Linnaeus 5 Images of 5 [85] Ch
training set splits 8000 Images Classification 2017
dataset classes of objects. Ka
created.
Multi-modal dataset
for obstacle
detection in
agriculture
Classification,
including stereo
object
camera, thermal Classes labelled >400 GB of Images and 3D point [86]
FieldSAFE detection, 2017 M.
camera, web geographically. data clouds
object
camera, 360-
localization
degree camera,
lidar, radar, and
precise
localization.
11,076 hand
images (1600 x
1200 pixels) of 190
Gender
subjects, of varying
11,076 hand Images and (.mat, .txt, and recognition [87]
11K Hands ages between 18 – None 2017 M
images .csv) label files and biometric
75 years old, for
identification
gender recognition
and biometric
identification.
Specifically
designed for
Continuous/Lifelong
Learning and images (.png or .pkl)
Classes labelled,
Object Recognition,
training set splits 164,866 Classification,
is a collection of and (.pkl, .txt, .tsv) [88] V.
CORe50 created based on a 3- RBG-D Object 2017
more than 500 an
way, multi-runs images label files recognition
videos (30fps) of
benchmark.
50 domestic
objects belonging
to 10 different
categories.
OpenLORIS- Lifelong/Continual Classes labelled, 1,106,424 images (.png and .pkl) Classification, 2019 [89] Q.
Object Robotic Vision training/validation/testing RBG-D Lifelong
dataset set splits created by images and (.pkl) label files object
(OpenLORIS- benchmark scripts. recognition,
Object) collected Robotic
by real robots Vision
mounted with
multiple high-
resolution sensors,
includes a
collection of 121
object instances
(1st version of
dataset, 40
categories daily
necessities objects
under 20 scenes).
The dataset has
rigorously
considered 4
environment
factors under
different scenes,
including
illumination,
occlusion, object
pixel size and
clutter, and defines
the difficulty levels
of each factor
explicitly.
This multispectral More than

data set includes 20 videos.
terahertz, thermal, The
3D lookup tables are Experiments Ale
THz and visual, near duration of
provided that allow you with hidden [90][91] Mo
thermal video infrared, and three- each video AP2J 2019
to project images onto object Olg
data set dimensional videos is about 85
3D point clouds. detection Su
of objects hidden seconds
under people's (about 345
clothes. frames).
Labeled part
contains
15560
samples
with
Daimler pedestrians
It is a dataset of Object
Monocular and 6744
pedestrians in Pedestrians are box- recognition [92][93][94]
Pedestrian samples Images 2006 Da
urban wise labeled. and
Detection without.
environments. classification
dataset Test set
contains
21790
images
without
labels.
The Cambridge-
Ga
driving Labeled Object
The dataset is labeled Bro
Video Database over 700 recognition [95][96][97]
CamVid with semantic labels for Images 2008 Sh
(CamVid) is a images and
32 semantic classes. Fa
collection of classification
Ro
videos.
Oli
Ma
RailSem19 is a Object
Mu
dataset for recognition
The dataset is labeled Ma
understanding and [98][99]
RailSem19 semanticly and box- 8500 Images 2019 Ze
scenes for vision classification,
wise. Da
systems on scene
Ste
railways. recognition
Sa
Cs
Ke
Bu
BOREAS is a
J.
multi-season
Yu
autonomous driving
An
dataset. It includes
Object Ha
data from includes
recognition Sh
a Velodyne Alpha-
The data is annotated by 350 km of Images, Lidar and Radar and [100][101] Jin
BOREAS Prime (128-beam) 2023
3D bounding boxes. driving data data classification, We
lidar, a FLIR
scene Ts
Blackfly S camera,
recognition La
a Navtech CIR304-
Y.K
H radar, and an
An
Applanix POS LV
Sc
GNSS-INS.
Tim
Ba
5000
images for
The labeling include training and Ka
Bosch Small
It is a dataset of bounding boxes of traffic a video Traffic light [102][103] Be
Traffic Lights Images 2017
traffic lights. lights together with their sequence of recognition No
Dataset
state (active light). 8334 Bo
frames for
evaluation
Je
Nic
The labeling include Ré
It is a dataset of bounding boxes of Railway Ra
more than [104][105]
FRSign French railway railway signals together Images signal 2020 Ch
100000
signals. with their state (active recognition Gr
light). Ro
Po
Ha
The labeling include
Ph
It is a dataset of bounding boxes of Railway
[106][107] Fa
GERALD German railway railway signals together 5000 Images signal 2023
Ch
signals. with their state (active recognition
Sc
light).
Multi-cue Multi-cue onboard The databaset is labeled 1092 image Images Object 2009 [108] Ch
pedestrian pedestrian box-wise. pairs with recognition Wo
detection dataset is 1776 boxes and Wa
a dataset for for classification Sc
pedestrians
detection of
pedestrians.
Tu
RAWPED is a Bu
Object
dataset for Be
The dataset is labeled recognition [109][110]
RAWPED detection of 26000 Images 2020 Bu
box-wise. and
pedestrians in the Cu
classification
context of railways. Gu
Alp
OSDaR23 is a
DZ
multi-sensory Object
Sc
dataset for The databaset is labeled 16874 Images, Lidar, Radar and recognition [111][112]
OSDaR23 2023 De
detection of objects box-wise. frames Infrared and
an
in the context of classification
Fu
railways.
Arg
Argoverse is a Object
Ca
multi-sensory recognition
Me
dataset for The dataset is annotated 320 hours Data from 7 cameras and and [113][114]
Agroverse 2022 Un
detection of objects box-wise. of recording LiDAR classification,
Ge
in the context of object
Ins
roads. tracking
Te
Handwriting and character recognition
Dataset Brief Created

Preprocessing Instances Format Default Task Reference
name description (updated)
Artificially
generated Coordinates of
data lines drawn
Artificial
describing given as Handwriting recognition, [115]
Characters 6000 Text 1992
the structure integers. classification
Dataset
of 10 capital Various other
English features.
letters.
Upper-case 17 features are

Letter [116][117]
printed extracted from 20,000 Text OCR, classification 1991
Dataset
letters. all images.
Offline
handwritten
Chinese
Gray-scaled
character
images with
CASIA- database. Handwriting recognition, [118]
background 1,172,907 Images, Text 2009
HWDB 3755 classes classification
pixels labeled
in the GB
as 255.
2312
character
set.
Online
handwritten
Chinese
character
database,
Provides the
collected
CASIA- sequences of Handwriting recognition, [119][118]
using Anoto 1,174,364 Images, Text 2009
OLHWDB coordinates of classification
pen on paper.
strokes.
3755 classes
in the GB
2312
character
set.
Labeled
samples of
3-dimensional
pen tip
Character pen tip velocity
trajectories Handwriting recognition, [120][121]
Trajectories trajectory 2858 Text 2008
for people classification
Dataset matrix for each
writing
sample
simple
characters.
Character
recognition in
natural
Character recognition,
Chars74K images of [122]
74,107 handwriting recognition, 2009
Dataset symbols
OCR, classification
used in both
English and
Kannada
Derived from
NIST Special
Database 19. EMNIST dataset[124]
Handwritten Converted to character recognition,
EMNIST characters 28x28 pixel 800,000 Images classification, handwriting 2016
dataset from 3600 images, recognition Documentation[125
contributors matching the
MNIST
dataset.[123]
UJI Pen Isolated Coordinates of 11,640 Text Handwriting recognition, 2009 [126][127]
Characters handwritten pen position as classification
Dataset characters characters
were written
given.
Handwriting Features
samples extracted from
from the images, split
Gisette Handwriting recognition, [128]
often- into train/test, 13,500 Images, text 2003
Dataset classification
confused 4 handwriting
and 9 images size-
characters. normalized.
1623
different
handwritten
Omniglot Classification, one-shot [129][130]
characters Hand-labeled. 38,300 Images, text, strokes 2015
dataset learning
from 50
different
alphabets.
Database of
MNIST [131][132]
handwritten Hand-labeled. 60,000 Images, text Classification 1994
database
digits.
Optical
Recognition Normalized Size
of bitmaps of normalized and Handwriting recognition, [133]
5620 Images, text 1998
Handwritten handwritten mapped to classification
Digits data. bitmaps.
Dataset
Pen-Based
Feature
Recognition Handwritten
vectors
of digits on Handwriting recognition, [134][135]
extracted to be 10,992 Images, text 1998
Handwritten electronic classification
uniformly
Digits pen-tablet.
spaced.
Dataset
All handwritten
digits have
Semeion
Handwritten been
Handwritten Handwriting recognition, [136]
digits from normalized for 1593 Images, text 2008
Digit classification
80 people. size and
Dataset
mapped to the
same grid.
All symbols are

Handwritten
centered and of [137]
HASYv2 mathematical 168233 Images, text Classification 2017
size 32px x
symbols
32px.
Includes
Handwritten
Numeral
Dataset (10
classes) and Numeral Dataset:
Basic
Character 23330, Images, Handwriting recognition,
Noisy
Dataset (50 All images are
Handwritten [138][139]
classes), centered and of 2017
Bangla
each dataset size 32x32. Character Dataset: text classification
Dataset
has three
types of 76000
noise: white
gaussian,
motion blur,
and reduced
contrast.
Aerial images
Created
(updated)
Syed Waqas Zamir,
Aditya Arora,
Akshita Gupta,
Precise instance-level
annotatio carried out by Aerial
iSAID: Instance professional Classification, Salman Khan,
655,451
Segmentation in annotators, cross- Images, Object [140][141]
(15 2019
Aerial Images checked and validated
classes)
jpg, json Detection, Guolei Sun,
Dataset by expert annotators Instance
complying with well- Segmentation Fahad Shahbaz Khan,
defined guidelines.
Fan Zhu,
Ling Shao, Gui-Song

Xia, Xiang Bai
Aerial Image 80 high-resolution Images manually 80 Images Aerial 2013 [142][143] J. Yuan et al.
Segmentation aerial images with segmented. Classification,
Dataset spatial resolution object
detection
ranging from 0.3 to
1.0.
Multiple labeled Images manually
Images People
training and evaluation labeled to show paths [144][145]
KIT AIS Data Set ~ 150 with tracking, 2012 M. Butenuth et al.
datasets of aerial of individuals through
paths aerial tracking
images of crowds. crowds.
Remote sensing data Classification,

Various features [146][147]
Wilt Dataset of diseased trees and 4899 Images aerial object 2014 B. Johnson
extracted.
other land cover. detection
Maritime scenes of
optical aerial images
from the visible
spectrum. It contains
color images in
Classification,
dynamic marine Object bounding boxes [148][149]
MASATI dataset 7389 Images aerial object 2018 A.-J. Gallego et al.
environments, each and labeling.
detection
image may contain
one or multiple targets
in different weather
and illumination
conditions.
Forest Type Satellite imagery of Image wavelength [150][151]
326 Text Classification 2015 B. Johnson
Mapping Dataset forests in Japan. bands extracted.
Over 30 annotations
Annotated overhead and over 60 statistics
Overhead Imagery Images, [152][153]
imagery. Images with that describe the target 1000 Classification 2009 F. Tanner et al.
Research Data Set text
multiple objects. within the context of
the image.
SpaceNet is a corpus
GeoTiff and GeoJSON Classification,
of commercial satellite [154][155][156]
SpaceNet files containing building >17533 Images Object 2017 DigitalGlobe, Inc.
imagery and labeled
footprints. Identification
training data.
These images were

manually extracted This is a 21 class land Image
from large images from use image dataset chips of
UC Merced Land the USGS National meant for research 256x256, Land cover [157] Yi Yang and Shawn
2,100 2010
Use Dataset Map Urban Area purposes. There are 30 cm (1 classification Newsam
Imagery collection for 100 images for each foot)
various urban areas class. GSD
around the US.
SAT-4 has four broad

land cover classes,
Images were extracted
includes barren land,
from the National
SAT-4 Airborne trees, grassland and a [158][159]
Agriculture Imagery 500,000 Images Classification 2015 S. Basu et al.
Dataset class that consists of
Program (NAIP)
all land cover classes
dataset.
other than the above
three.
SAT-6 has six broad
Images were extracted
land cover classes,
from the National
SAT-6 Airborne includes barren land, [158][159]
Agriculture Imagery 405,000 Images Classification 2015 S. Basu et al.
Dataset trees, grassland, roads,
Program (NAIP)
buildings and water
dataset.
bodies.
Underwater images
Created
(updated)
Images with pixel

The images have been
annotations for eight object
rigorously collected during
categories: fish
oceanic explorations and [160] Md Jahidul
SUIM Dataset (vertebrates), reefs 1,635 Images Segmentation 2020
human-robot collaborative Islam et al.
(invertebrates), aquatic
experiments, and annotated
plants, wrecks/ruins, human
by human participants.
divers, robots, and sea-floor.
Images with pixel
annotations for ten object
Images have been collected categories: defects,
during underwater ship corrosion, paint peel, marine [161]
LIACI Dataset 1,893 Images Segmentation 2022 Waszak et al.
inspections and annotated growth, sea chest gratings,
by human domain experts. overboard valves, propeller,
anodes, bilge keel and ship
hull.
Other images
Created
(updated)
A. Ebadi, P.
A novel benchmark gas Image, [162][163]
NRC-GAMMA None 28,883 Classification 2021 Paul, S. Auer, &
meter image dataset Label
S. Tremblay
The Images of scanned None 4908 TIFF/pdf Source device 2020 [164] C. Ben Rabah
SUPATLANTIQUE official and Wikipedia identification, et al.
dataset documents forgery detection,
Classification,..
Raw data (in HDF5

Density functional 60744 test
Labelled images of raw format) and output labels
theory quantum and 501473 Labeled [165] K. Mills & I.
input to a simulation of from density functional Regression 2019
simulations of training images Tamblyn
graphene theory quantum
graphene files
simulation
Quantum simulations
Labelled images of raw Raw data (in HDF5 K. Mills, M.A.
of an electron in a two 1.3 million Labeled [166]
input to a simulation of format) and output labels Regression 2017 Spanner, & I.
dimensional potential images images
2d Quantum mechanics from quantum simulation Tamblyn
well
Activity paths and

Labeled
Videos and images of directions, labels, fine-
MPII Cooking 881,755 video, [167][168] M. Rohrbach et
various cooking grained motion labeling, Classification 2012
Activities Dataset frames images, al.
activities. activity class, still image
text
extraction and labeling.
Original PNG files,

5,000 unique sorted per camera and
microstructures, all then per acquisition. Images S.
FAMOS Dataset samples have been MATLAB datafiles with 30,000 and .mat Authentication 2012 [169] Voloshynovskiy,
acquired 3 times with one 16384 times 5000 files et al.
two different cameras. matrix per camera per
acquisition.
Class labeling, many
1,000 unique classes local descriptors, like Images
Fine-grain [170] O. Taran and S.
PharmaPack Dataset with 54 images per SIFT and aKaZE, and 54,000 and .mat 2017
classification Rezaeifar, et al.
class. local feature agreators, files
like Fisher Vector (FV).
Images of 120 breeds of Train/test splits and

Stanford Dogs Images, Fine-grain [171][172]
dogs from around the ImageNet annotations 20,580 2011 A. Khosla et al.
Dataset text classification
world. provided.
2D keypoints and 3D
StanfordExtra 2D keypoints and Labelled [173]
segmentations for the 12,035 reconstruction/pose 2020 B. Biggs et al.
Dataset segmentations provided. images
Stanford Dogs Dataset. estimation
Breed labeled, tight
37 categories of pets
The Oxford-IIIT Pet bounding box, Images, Classification, [172][174]
with roughly 200 images ~ 7,400 2012 O. Parkhi et al.
Dataset foreground-background text object detection
of each.
segmentation.
Many features including

M. Ortega-
Corel Image Features Database of images with color histogram, co- Classification, [175][176]
68,040 Text 1999 Bindenberger et
Data Set features extracted. occurrence texture, and object detection
al.
colormoments,
Online Video
Transcoding times for
Characteristics and [177]
various different videos Video features given. 168,286 Text Regression 2015 T. Deneke et al.
Transcoding Time
and video properties.
Dataset.
Descriptive caption and
Microsoft Sequential storytelling given for
Dataset for sequential Images, [178] Microsoft
Image Narrative each photo, and photos 81,743 Visual storytelling 2016
vision-to-language text Research
Dataset (SIND) are arranged in
sequences
Part locations for birds,

Caltech-UCSD Birds- Large dataset of images Images, [179][180]
bounding boxes, 312 11,788 Classification 2011 C. Wah et al.
200-2011 Dataset of birds. text
binary attributes given
YouTube video IDs and

Large and diverse associated labels from a Video, [181][182] S. Abu-El-Haija
YouTube-8M 8 million Video classification 2016
labeled video dataset diverse vocabulary of text et al.
4800 visual entities
Flickr Videos and
Images and associated
Large and diverse Video,
description, titles, tags, Video and Image [183][184] B. Thomee et
YFCC100M labeled image and video 100 million Image, 2016
and other metadata classification al.
dataset Text
(such as EXIF and
geotags)
Discrete LIRIS- Short videos annotated Valence and arousal Video emotion [185]
9800 Video 2015 Y. Baveye et al.
ACCEDE for valence and arousal. labels. elicitation detection
Long videos annotated

for valence and arousal
Continuous LIRIS- Valence and arousal Video emotion [186]
while also collecting 30 Video 2015 Y. Baveye et al.
ACCEDE labels. elicitation detection
Galvanic Skin
Response.
Extension of Discrete
MediaEval LIRIS- LIRIS-ACCEDE including Violence, valence and Video emotion [187]
10900 Video 2015 Y. Baveye et al.
ACCEDE annotations for violence arousal labels. elicitation detection
levels of the films.
Articulated human pose Images

Rough crop around
annotations in 2000 plus Human pose [188] S. Johnson and
Leeds Sports Pose single person of interest 2000 2010
natural sports images .mat file estimation M. Everingham
with 14 joint labels
from Flickr. labels
Articulated human pose Images
Leeds Sports Pose annotations in 10,000 14 joint labels via plus Human pose [189] S. Johnson and
10000 2011
Extended Training natural sports images crowdsourcing .mat file estimation M. Everingham
from Flickr. labels
6 different real multiple
choice-based exams
(735 answer sheets and 735 answer
Images Development of
33,540 answer boxes) to sheets and
and .mat multiple choice test [190][191]
MCQ Dataset evaluate computer vision None 33,540 2017 Afifi, M. et al.
file assessment
techniques and systems answer
labels systems
developed for multiple boxes
choice test assessment
systems.
19
Real surveillance videos surveillance
cover a large videos (7 [192] Taj-Eddin, I. A.
Surveillance Videos None Videos Data compression 2016
surveillance time (7 days days with T. F. et al.
with 24 hours each). 24 hours
each).
Labeled Information
Library of Alexandria:
Biology and
Conservation. Labeled
~10M [193] LILA working
LILA BC images that support None Images Classification 2019
images group
machine learning
research around ecology
and environmental
science.
32 videos for eight live

and eight dead leaves
Can We See Liveness detection [194] Taj-Eddin, I. A.
recorded under both DC None 32 videos Videos 2017
Photosynthesis? of plants T. F. et al.
and AC lighting
conditions.
Mathematical
Mathematical Collection of 10,000 Visual storytelling, [195]
None ~10,000 Images 2021 Mathematics
Mathematics Memes memes on mathematics. object detection.
Memes
Pruned with "various

automatic filters",
cropped and aligned to
Collection of images
Flickr-Faces-HQ faces, and had images [196]
containing a face each, 70,000 Images Face Generation 2019 Karras et al.
Dataset of statues, paintings, or
crawled from Flickr
photos of photos
removed via
crowdsourcing
Text data
These datasets consist primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.
Reviews
Created
Dataset Name Brief description Preprocessing Instances Format Default Task Reference Creator
(updated)
Classification,
US product reviews from 233.1 2015 [197][198]
Amazon reviews None. Text sentiment McAuley et al.
Amazon.com. million (2018)
analysis
Reviews of cars and hotels 42,230 / Sentiment

OpinRank Review [199][200] K. Ganesan et
from Edmunds.com and None. ~259,000 Text analysis, 2011
Dataset al.
TripAdvisor respectively. respectively clustering
22,000,000 ratings and
Regression,
580,000 tags applied to [201] GroupLens
MovieLens None. ~ 22M Text clustering, 2016
33,000 movies by 240,000 Research
classification
users.
Yahoo! Music User

Over 10M ratings of artists Clustering, [202][203]
Ratings of Musical None described. ~ 10M Text 2004 Yahoo!
by Yahoo users. regression
Artists
Car Evaluation Data Car properties and their Six categorical features [204][205]
1728 Text Classification 1997 M. Bohanec
Set overall acceptability. given.
User vote data for pairs of
YouTube Comedy
videos shown on YouTube. [206][207]
Slam Preference Video metadata given. 1,138,562 Text Classification 2012 Google
Users voted on funnier
Dataset
videos.
User reviews of airlines, Ratings are fine-grain and

Skytrax User Reviews Classification, [208]
airports, seats, and include many aspects of 41396 Text 2015 Q. Nguyen
Dataset regression
lounges from Skytrax. airport experience.
Features of each instance

Teaching Assistant Teaching assistant [209][210]
such as class, class size, 151 Text Classification 1997 W. Loh et al.
Evaluation Dataset reviews.
and instructor are given.
Vietnamese Students’
Feedback Corpus Students’ Feedback. Comments 16,000 Text Classification 1997 [211] Nguyen et al.
(UIT-VSFC)
Vietnamese Social
Users’ Facebook [212]
Media Emotion Comments 6,927 Text Classification 1997 Nguyen et al.
Comments.
Corpus (UIT-VSMEC)
Vietnamese Open-
domain Complaint [213]
Customer product reviews Comments 5,485 Text Classification 2021 Nguyen et al.
Detection dataset
(ViOCD)
Containing
ViHOS: Hate Speech
26k spans Span [214]
Spans Detection for Social Media Texts Comments Text 2021 Hoang et al.
on 11k Detection
Vietnamese
comments
News articles
Created
(updated)
English news articles

about the case relating to
Sentiment
allegations of sexual Filtered and presented in XML, [215] Dermouche, M.
NYSK Dataset 10,421 analysis, topic 2013
assault against the former XML format. text et al.
extraction
IMF director Dominique
Strauss-Kahn.
Classification,
The Reuters Corpus Large corpus of Reuters Fine-grain categorization [216]
810,000 Text clustering, 2002 Reuters
Volume 1 news stories in English. and topic codes.
summarization
Large corpus of Reuters Classification,

The Reuters Corpus Fine-grain categorization [217]
news stories in multiple 487,000 Text clustering, 2005 Reuters
Volume 2 and topic codes.
languages. summarization
Thomson Reuters Classification,

Large corpus of news [218]
Text Research Details not described. 1,800,370 Text clustering, 2009 T. Rose et al.
stories.
Collection summarization
Saudi Newspapers 31,030 Arabic newspaper Summarization, [219]
Metadata extracted. 31,030 JSON 2015 M. Alhagri
Corpus articles. clustering
Entity and Relation marked Classification,

RE3D (Relationship
data from various news Filtered, categorisation Entity and [220]
and Entity Extraction not known JSON 2017 Dstl
and government sources. using Baleen types Relation
Evaluation Dataset)
Sponsored by Dstl recognition
Clickbait, spam, crowd- Clustering,

Examiner Spam [221]
sourced headlines from Publish date and headlines 3,089,781 CSV Events, 2016 R. Kulkarni
Clickbait Catalogue
2010 to 2015 Sentiment
Entire news corpus of ABC Clustering,
ABC Australia News [222]
Australia from 2003 to Publish date and headlines 1,186,018 CSV Events, 2020 R. Kulkarni
Corpus
2019 Sentiment
Clustering,
Worldwide News – One week snapshot of all
Publish time, URL and Events, [223]
Aggregate of 20K online headlines in 20+ 1,398,431 CSV 2018 R. Kulkarni
headlines Language
Feeds languages
Detection
NLP,
11 Years of timestamped
Reuters News Wire Computational [224]
events published on the Publish time, Headline Text 16,121,310 CSV 2018 R. Kulkarni
Headline Linguistics,
news-wire
Events
NLP,
The Irish Times 24 Years of Ireland News Publish time, Headline Computational [225]
1,484,340 CSV 2020 R. Kulkarni
Ireland News Corpus from 1996 to 2019 Category and Text Linguistics,
Events
News Headlines High quality dataset with NLP,

Dataset for Sarcasm Sarcastic and Non- Clean, normalized text 26,709 JSON Classification, 2018 [226] Rishabh Misra
Detection sarcastic news headlines. Linguistics
Messages
Created
(updated)
Attachments removed,
Network
Emails from employees at invalid email addresses
analysis, 2004 [227][228] Klimt, B. and Y.
Enron Email Dataset Enron organized into converted to ~ 500,000 Text
sentiment (2015) Yang
folders. user@enron.com or
analysis
no_address@enron.com.
Four version of the corpus

Corpus containing both 2,412
involving whether or not a [229][230] Androutsopoulos,
Ling-Spam Dataset legitimate and spam Ham 481 Text Classification 2000
lemmatiser or stop-list was J. et al.
emails. Spam
enabled.
SMS Spam Collection Collected SMS spam [231][232]
None. 5,574 Text Classification 2011 T. Almeida et al.
Dataset messages.
Natural
Twenty Newsgroups Messages from 20 different [233]
None. 20,000 Text language 1999 T. Mitchell et al.
Dataset newsgroups.
processing
Spam
Many text features [234]
Spambase Dataset Spam emails. 4,601 Text detection, 1999 M. Hopkins et al.
extracted.
classification
Twitter and tweets
Created
(updated)
Movie rating dataset based

Classification, [235]
MovieTweetings on public and well- ~710,000 Text 2018 S. Dooms
regression
structured tweets
Text and Cross-media [236][237]

Twitter100k Pairs of images and tweets 100,000 2017 Y. Hu, et al.
Images retrieval
Tweet data from 2009 Tweets,
Classified using distant
including original text, time comma, Sentiment [238][239]
Sentiment140 supervision from presence 1,578,627 2009 A. Go et al.
stamp, user and separated analysis
of emoticon in tweet.
sentiment. values
Twitter network data, not 11,316,811

Clustering,
actual tweets. Shows users, [240][241] R. Zafarani et
ASU Twitter Dataset None. Text graph 2009
connections between a 85,331,846 al.
analysis
large number of users. connections
Clustering,
SNAP Social Circles: Node features, circles, [242][243] J. McAuley et
Large Twitter network data. 1,768,149 Text graph 2012
Twitter Database and ego networks. al.
analysis
Twitter Dataset for
Samples hand-labeled as [244][245]
Arabic Sentiment Arabic tweets. 2000 Text Classification 2014 N. Abdulla
positive or negative.
Analysis
Data from Twitter and

Data is windowed so that
Tom's Hardware. This
Buzz in Social Media the user can attempt to Regression, [246][247]
dataset focuses on 140,000 Text 2013 F. Kawala et al.
Dataset predict the events leading Classification
specific buzz topics being
up to social media buzz.
discussed on those sites.
This dataset focuses on

Paraphrase and whether tweets have tokenization, part-of-
Regression, [248][249]
Semantic Similarity in (almost) same speech and named entity 18,762 Text 2015 Xu et al.
Classification
Twitter (PIT) meaning/information or not. tagging
Manually labeled.
This dataset contains
tweets during different Classification,
Geoparse Twitter location annotations Tweets, [250][251] S.E. Middleton
news events in different 6,386 Information 2014
benchmark dataset added to JSON metadata JSON et al.
countries. Manually labeled Extraction
location mentions.
Dutch Social media This dataset contains classified for sentiment, 271,342 JSONL Sentiment, 2020 [252][253][254] Aaaksh Gupta,
collection COVID-19 tweets made by tweet text & user multi-label CoronaWhy
Dutch speakers or users description translated to classification,
from Netherlands. The data English. Industry mention machine
has been machine labeled are extracted translation
Dialogues
Created
(updated)
Hand privacy masked, NLP,
Posts from age-specific [255] Forsyth, E., Lin, J.,
NPS Chat Corpus tagged for part of speech ~ 500,000 XML programming, 2007
online chat rooms. & Martell, C.
and dialogue-act. linguistics
A-B-A triples extracted [256]

Twitter Triple Corpus 4,232 Text NLP 2016 Sordini, A. et al.
from Twitter.
Anonymized e-mails and

URLs. Omitted documents
[257] Shaoul, C., &
UseNet Corpus UseNet forum postings. with lengths <500 words or 7 billion Text 2011
Westbury C.
>500,000 words, or that
were <90% English.
SMS messages collected
NUS SMS Corpus between two users, with ~ 10,000 XML NLP 2011 [258] KAN, M
timing analysis.
Reddit All Comments All Reddit comments (as ~ 1.7 NLP, [259]
JSON 2015 Stuck_In_the_Matrix
Corpus of 2015). billion research
930
Dialogues extracted from thousand Dialogue
Ubuntu Dialogue [260]
Ubuntu chat stream on dialogues, CSV Systems 2015 Lowe, R. et al.
Corpus
IRC. 7.1 million Research
utterances
DSTC2
The Dialog State Tracking
contains
Challenges 2 & 3
~3.2k Henderson, Matthew
(DSTC2&3) were research
Dialog State Tracking Transcription of spoken calls – Dialogue [261] and Thomson,
challenge focused on Json 2014
Challenge dialogs with labelling DSTC3 state tracking Blaise and Williams,
improving the state of the
contains Jason D
art in tracking the state of
~2.3k
spoken dialog systems.
calls
Legal
Default Created
Dataset Name Brief description Preprocessing Instances Format Reference Creator
Task (updated)
Filtered data from Court

NLP, [262]
FreeLaw Listener, part of the FreeLaw Cleaned and normalized text 4,940,710 Json 2020 T. Hoppe
linguistics
project.
L. Zheng; N.
NLP, Guha; B.
Corpus of legal and Cleaned, normalized, and [263][264]
Pile of Law ~50,000,000 Json linguistics, 2022 Anderson; P.
administrative data privatized
sentiment Henderson; D.
Ho
All official, book-published

state and federal United A. Aizman; S.
States case law — every Chapman; J.
Caselaw Access NLP, [265]
volume or case designated Cleaned and normalized text ~10,000 Json 2022 Cushman; K.
Project linguistics
as an official report of Dulin; H.
decisions by a court within Eidolon; et all
the United States.
Other text
Created
(updated)
Classification,
Web of Science Hierarchical Datasets [266][267] K. Kowsari et
None. 46,985 Text 2017
Dataset for Text Classification Categorization al.
Summarization,
Federal Court of
Legal Case [268][269] F. Galgani et
Australia cases from None. 4,000 Text 2012
Reports
2006 to 2009.
citation analysis al.
Blog entries of 19,320 Blogger self-provided Sentiment analysis,

Blogger Authorship [270][271] J. Schler et
people from gender, age, industry, 681,288 Text summarization, 2006
Corpus al.
blogger.com. and astrological sign. classification
Social Structure of Large dataset of the
100 colleges Network analysis, [272][273]
Facebook social structure of None. Text 2012 A. Traud et al.
covered clustering
Networks Facebook.
Stories and
Dataset for the
associated questions Natural language
Machine [274][275] M. Richardson
for testing None. 660 Text processing, machine 2013
Comprehension of et al.
comprehension of comprehension
Text
text.
Naturally occurring
The Penn Text is parsed into Natural language [276][277] M. Marcus et
text annotated for ~ 1M words Text 1995
Treebank Project semantic trees. processing, summarization al.
linguistic structure.
Task given is to
determine, from Features extracted
features given, which include word stems. [278]
DEXTER Dataset 2600 Text Classification 2008 Reuters
articles are about Distractor features
corporate included.
acquisitions.
Google Books N- N-grams from a very 2.2 TB of Classification, clustering, [279][280]

None. Text 2011 Google
grams large corpus of books text regression
Collected for
experiments in
In addition to normal
Authorship Attribution
texts, syntactically [281][282] K. Luyckx et
Personae Corpus and Personality 145 Text Classification, regression 2008
annotated texts are al.
Prediction. Consists
given.
of 145 Dutch-
language essays.
Archives of social
media websites, Text extracted and
~100,000,000 [283][284] J.
PushShift including Reddit, normalized from Json NLP, sentiment, linguistics 2022
posts Baumgartner
Twitter, and WARCs
Hackernews.
Categorization task
for free text Word frequency has [285][286] P. Ciarelli et
CNAE-9 Dataset 1080 Text Classification 2012
descriptions of been extracted. al.
Brazilian companies.
Sentiment of each
Sentiment Labeled 3000 sentiment sentence has been Classification, sentiment [287][288]
3000 Text 2015 D. Kotzias
Sentences Dataset labeled sentences. hand labeled as analysis
positive or negative.
Dataset to predict the
number of comments
BlogFeedback Many features of [289][290]
a post will receive 60,021 Text Regression 2014 K. Buza
Dataset each post extracted.
based on features of
that post.
Image captions
matched with newly Entailment class
Stanford Natural
constructed labels, syntactic Natural language
Language [291] S. Bowman et
sentences to form parsing by the 570,000 Text inference/recognizing 2015
Inference (SNLI) al.
entailment, Stanford PCFG textual entailment
Corpus
contradiction, or parser
neutral pairs.
A multilingual
collection of short
DSL Corpus excerpts of 294,000 Discriminating between [292] Tan, Liling et
None Text 2017
Collection (DSLCC) journalistic texts in phrases similar languages al.
similar languages and
dialects.
Urban Dictionary Corpus of words, User names NLP, Machine [293]
2,580,925 CSV 2016 May Anonymous
Dataset votes and definitions anonymised comprehension
JSON
and NIF
Wikipedia abstracts Alignment of Wikidata [2] (http
11M aligned [294] H. Elsahar et
T-REx aligned with Wikidata triples with Wikipedia s://hady NLP, Relation Extraction 2018
triples al.
entities abstracts elsahar.
github.i
o/t-rex/)
~1M
General Language
Benchmark of nine sentences [295][296][297]
Understanding Various NLU 2018 Wang et al.
tasks and sentence
Evaluation (GLUE)
pairs
Contract
Understanding The Atticus
Atticus Dataset Dataset of legal CSV Project (http
~13,000 Natural language
(CUAD) (formerly contracts with rich and 2021 s://www.atticu
labels processing, QnA
known as Atticus expert annotations PDF sprojectai.org/
Open Contract cuad)
Dataset (AOK))
Vietnamese Image 19,250 CSV Natural language

Vietnamese Image [298]
Captioning Dataset captions for and processing, Computer 2020 Lam et al.
Captioning Dataset
(UIT-ViIC) 3,850 images PDF vision
26,850
Vietnamese
Vietnamese Names Vietnamese
Names annotated Natural language [299]
annotated with full names CSV 2020 To et al.
with Genders (UIT- processing
Genders annotated
ViNames)
with genders
10,000
Vietnamese
Vietnamese
Vietnamese users'
Constructive and
Constructive and comments on Natural Language [300]
Toxic Speech CSV 2021 Nguyen et al.
Toxic Speech online Processing
Detection Dataset
Detection Dataset newspapers
(UIT-ViCTSD)
on 10
domains
Sound data
These datasets consist of sounds and sound features used for tasks such as speech recognition and speech synthesis.
Speech
Created
(updated)
English:
Unsupervised
5h, 12
Zero Resource Spontaneous speech discovery of
speakers; WAV (audio [301][302] Versteegh et
Speech Challenge (English), Read speech None, raw WAV files. speech 2015
Xitsonga: only) al.
2015 (Xitsonga). features/subword
2h30, 24
units/word units
speakers
Voice features extracted,

Multiple recordings of disease scored by
Parkinson Speech Classification, [303][304] B. E. Sakar et
people with and without physician using unified 1,040 Text 2013
Dataset regression al.
Parkinson's Disease. Parkinson's disease
rating scale.
Spoken Arabic digits Time-series of mel-
[305][306] M. Bedda et
Spoken Arabic Digits from 44 male and 44 frequency cepstrum 8,800 Text Classification 2010
al.
female. coefficients.
Features extracted from [307][308]

ISOLET Dataset Spoken letter names. 7797 Text Classification 1994 R. Cole et al.
sounds.
Applied 12-degree linear

Nine male speakers prediction analysis to it to
Japanese Vowels [309][310]
uttered two Japanese obtain a discrete-time 640 Text Classification 1999 M. Kudo et al.
Dataset
vowels successively. series with 12 cepstrum
coefficients.
Parkinson's Multiple recordings of
Sound features [311][312] A. Tsanas et
Telemonitoring people with and without 5875 Text Classification 2009
extracted. al.
Dataset Parkinson's Disease.
Recordings of 630
speakers of eight major
Speech
dialects of American Speech is lexically and [313][314] J. Garofolo et
TIMIT 6300 Text recognition, 1986
English, each reading ten phonemically transcribed. al.
classification.
phonetically rich
sentences.
Speech
A single-speaker, Modern
Synthesis,
Standard Arabic (MSA)
Speech is Speech
speech corpus with
Arabic Speech orthographically and Recognition, [315]
phonetic and ~1900 Text, WAV 2016 N. Halabi
Corpus phonetically transcribed Corpus
orthographic transcripts
with stress marks. Alignment,
aligned to phoneme
Speech Therapy,
level.
Education.
A public domain
database of English: MP3 with 2017 June
Validation by other users Speech [316]
Common Voice crowdsourced data 1,118 corresponding (2019 Mozilla
. recognition
across a wide range of hours text files December)
dialects.
A single-speaker corpus
of English public-domain Quality check,
Speech [317] Keith Ito,
LJSpeech audiobook recordings, normalized transcription 13,100 CSV, WAV 2017
synthesis Linda Johnson
split into short clips at alongside the original.
punctuation marks.
Music
Created
(updated)
Audio features of music Geographic
Geographic Origin of Audio features extracted [318][319]
samples from different 1,059 Text classification, 2014 F. Zhou et al.
Music Data Set using MARSYAS software.
locations. clustering
Audio features from one Classification, [320][321] T. Bertin-

Million Song Dataset Audio features extracted. 1M Text 2011
million different songs. clustering Mahieux et al.
Multi-track popular music MP4, Source [322]

MUSDB18 Raw audio 150 2017 Z. Rafii et al.
recordings WAV Separation
Audio under Creative
Commons from 100k
songs (343 days, 1TiB) Raw audio and audio Text, Classification, [323] M. Defferrard et
Free Music Archive 106,574 2017
with a hierarchy of 161 features. MP3 recommendation al.
genres, metadata, user
data, free-form text.
Bach Choral Harmony [324][325] D. Radicioni et

Bach chorale chords. Audio features extracted. 5665 Text Classification 2014
Dataset al.
Other sounds
Created
(updated)
Sorted into folders by

Labeled sound Sound
class of events as well
recordings of sounds
as metadata in a [326][327] J. Salamon et
UrbanSound like air conditioners, 1,059 Classification 2014
car horns and children
JSON file and (WAV) al.
annotations in a CSV
playing.
file.
10-second sound
snippets from 128-d PCA'd VGG-ish
Text (CSV) and TensorFlow [328] J. Gemmeke
AudioSet YouTube videos, and features every 1 2,084,320 Classification 2017
Record files et al., Google
an ontology of over second.
500 labels.
Queen Mary
Audio from
University
Bird Audio environmental
2016 [329][330] and IEEE
Detection monitoring stations, 17,000+ Classification
(2018) Signal
challenge plus crowdsourced
Processing
recordings
Society
Audio from WSJ0 Wichern, G.,

WSJ0 Hipster mixed with noise Noise clips matched to Audio source [331] et al.,
28,000 Sound (WAV) 2019
Ambient Mixtures recorded in the San WSJ0 clips separation Whisper and
Francisco Bay Area MERL
4,981 audio samples

of 15 to 30 seconds
K. Drossos,
long, each audio Automated
Sound (WAV) and text [332][333] S. Lipping,
Clotho sample having five 24,905 audio 2020
(CSV) and T.
different captions of captioning
Virtanen
eight to 20 words
long.
Signal data
Datasets containing electric signal information requiring some sort of signal processing for further analysis.
Electrical
Created
(updated)
Split into a publicly

Dataset detailing the spread available set and a Center for
55,909 IP [334][335]
Witty Worm Dataset of the Witty worm and the restricted set containing Text Classification 2004 Applied Internet
addresses
infected computers. more sensitive information Data Analysis
like IP and UDP headers.
Cleaned vital signals from

Cuff-Less Blood
human patients which can 125 Hz vital signs have Classification, [336][337] M. Kachuee et
Pressure Estimation 12,000 Text 2015
be used to estimate blood been cleaned. regression al.
Dataset
pressure.
Measurements from 16
Gas Sensor Array Drift chemical sensors utilized in Extensive number of [338][339]
13,910 Text Classification 2012 A. Vergara
Dataset simulations for drift features given.
compensation.
Levels of various
Data covering the nonlinear
components as a function [340][341]
Servo Dataset relationships observed in a 167 Text Regression 1993 K. Ullrich
of other components are
servo-amplifier circuit.
given.
Indoor localization database

Classification,
UJIIndoorLoc-Mag to test indoor positioning [342][343] D. Rambla et
Train and test splits given. 40,000 Text regression, 2015
Dataset systems. Data is magnetic al.
clustering
field based.
Electrical signals from
Sensorless Drive Statistical features [344][345]
motors with defective 58,508 Text Classification 2015 M. Bator
Diagnosis Dataset extracted.
components.
Motion-tracking
Created
(updated)
Wearable Computing: Pontifical

People performing five
Classification of Body [346][347] Catholic
standard actions while None. 165,632 Text Classification 2013
Postures and University of
wearing motion trackers.
Movements (PUC-Rio) Rio de Janeiro
Features extracted from Features extracted aim at
Gesture Phase Classification, [348][349]
video of people doing studying gesture phase 9900 Text 2014 R. Madeo et a
Segmentation Dataset clustering
various gestures. segmentation.
10 normal and 10
aggressive physical
Vicon Physical Action Many parameters recorded [350][351]
actions that measure the 3000 Text Classification 2011 T. Theodoridis
Data Set Dataset by 3D tracker.
human activity tracked by
a 3D tracker.
Many sensors given, no
Daily and Sports Motor sensor data for 19 [352][353] B. Barshan et
preprocessing done on 9120 Text Classification 2013
Activities Dataset daily and sports activities. al.
signals.
Gyroscope and
Human Activity accelerometer data from Actions performed are
[354][355] J. Reyes-Ortiz
Recognition Using people wearing labeled, all signals 10,299 Text Classification 2012
et al.
Smartphones Dataset smartphones and preprocessed for noise.
performing normal actions.
Australian sign language

Australian Sign [356][357]
signs captured by motion- None. 2565 Text Classification 2002 M. Kadous
Language Signs
tracking gloves.
Weight Lifting
Five variations of the
Exercises monitored Some statistics calculated [358][359] W. Ugulino et
biceps curl exercise 39,242 Text Classification 2013
with Inertial from raw data. al.
monitored with IMUs.
Measurement Units
Two databases of surface
sEMG for Basic Hand [360][361] C. Sapsanis et
electromyographic signals None. 3000 Text Classification 2014
movements Dataset al.
of 6 hand movements.
Evaluate techniques
dealing with the effects of
REALDISP Activity [361][362]
sensor displacement in None. 1419 Text Classification 2014 O. Banos et al.
Recognition Dataset
wearable activity
recognition.
Data from multiple different

Heterogeneity Activity smart devices for humans Classification, [363][364]
None. 43,930,257 Text 2015 A. Stisen et al.
Recognition Dataset performing various clustering
activities.
Temporal wireless network
Indoor User
data that can be used to [365][366]
Movement Prediction None. 13,197 Text Classification 2016 D. Bacciu
track the movement of
from RSS Data
people in an office.
18 different types of
PAMAP2 Physical
physical activities [367]
Activity Monitoring None. 3,850,505 Text Classification 2012 A. Reiss
performed by 9 subjects
Dataset
wearing 3 IMUs.
Human Activity
Recognition from wearable,
OPPORTUNITY object, and ambient
[368][369] D. Roggen et
Activity Recognition sensors is a dataset None. 2551 Text Classification 2012
al.
Dataset devised to benchmark
human activity recognition
algorithms.
Human Activity
Recognition from wearable
devices. Distinguishes 3,150,000
Real World Activity [370]
between seven on-body None. (per Text Classification 2016 T. Sztyler et al.
Recognition Dataset
device positions and sensor)
comprises six different
kinds of sensors.
10 healthy
3D human pose estimates person and
(Kinect) of stroke patients 9 stroke
Toronto Rehab Stroke and healthy participants survivors [371][372][373] E. Dolatabadi
None. CSV Classification 2017
Pose Dataset performing a set of tasks (3500– et al.
using a stroke rehabilitation 6000
robot. frames per
person)
7805 gesture captures of

14 different social touch
gestures performed by 31
subjects. The gestures Touch gestures performed 7805
Corpus of Social [374][375]
were performed in three are segmented and gesture CSV Classification 2016 M. Jung et al.
Touch (CoST)
variations: gentle, normal labeled. captures
and rough, on a pressure
sensor grid wrapped around
a mannequin arm.
Other signals
Created
(updated)
Chemical analysis of wines

grown in the same region in 13 properties of each wine Classification, [376][377]
Wine Dataset 178 Text 1991 M. Forina et al.
Italy but derived from three are given regression
different cultivars.
Data from various sensors

Combined Cycle [378][379]
within a power plant running None 9568 Text Regression 2014 P. Tufekci et al.
Power Plant Data Set
for 6 years.
Physical data
Datasets from physical systems.
High-energy physics
Created
(updated)
Monte Carlo simulations of
28 features of each [380][381][382]
HIGGS Dataset particle accelerator 11M Text Classification 2014 D. Whiteson
collision are given.
collisions.
Monte Carlo simulations of

particle accelerator
28 features of each [381][382][383]
HEPMASS Dataset collisions. Goal is to 10,500,000 Text Classification 2016 D. Whiteson
collision are given.
separate the signal from
noise.
Systems
Created
(updated)
Yacht Hydrodynamics Yacht performance based Six features are given for [384][385]
308 Text Regression 2013 R. Lopez
Dataset on dimensions. each yacht.
5 data sets that center Integer valued features

Robot Execution [386]
around robotic failure to such as torque and other 463 Text Classification 1999 L. Seabra et al.
Failures Dataset
execute common tasks. sensor measurements.
Design description is given
Pittsburgh Bridges in terms of several Various bridge features are [387][388]
108 Text Classification 1990 Y. Reich et al.
Dataset properties of various given.
bridges.
Data about automobiles,

[389][390] J. Schimmer et
Automobile Dataset their insurance risk, and Car features extracted. 205 Text Regression 1987
al.
their normalized losses.
Eight features of each car [391] Carnegie Mellon

Auto MPG Dataset MPG data for cars. 398 Text Regression 1993
given. University
Heating and cooling
Energy Efficiency requirements given as a Classification, [392][393]
Building parameters given. 768 Text 2012 A. Xifara et al.
Dataset function of building regression
parameters.
A series of aerodynamic
Airfoil Self-Noise and acoustic tests of two Data about frequency, angle [394]
1503 Text Regression 2014 R. Lopez
Dataset and three-dimensional airfoil of attack, etc., are given.
blade sections.
Attempt to predict O-ring Several features of each

Challenger USA Space [395][396]
problems given past flight, such as launch 23 Text Regression 1993 D. Draper et al.
Shuttle O-Ring Dataset
Challenger data. temperature, are given.
Statlog (Shuttle) NASA space shuttle [397]
Nine features given. 58,000 Text Classification 2002 NASA
Dataset datasets.
Astronomy
Created
(updated)
Volcanoes on Venus –
Venus images returned by Images are labeled by [398][399]
JARtool experiment not given Images Classification 1991 M. Burl
the Magellan spacecraft. humans.
Dataset
Monte Carlo generated high- Numerous features
MAGIC Gamma [399][400]
energy gamma particle extracted from the 19,020 Text Classification 2007 R. Bock
Telescope Dataset
events. simulations.
Measurements of the
number of certain types of Many solar flare-specific Regression, [401]
Solar Flare Dataset 1389 Text 1989 G. Bradshaw
solar flare events occurring features are given. classification
in a 24-hour period.
2D maps and 3D grids from

thousands of N-body and
405,000
state-of-the-art Each map and grid has 6
2D maps 2D maps Francisco
CAMELS Multifield hydrodynamic simulations cosmological and [402]
and and 3D Regression 2021 Villaescusa-
Dataset spanning a broad range in astrophysical parameters
405,000 grids Navarro et al.
the value of the associated to it
3D grids
cosmological and
astrophysical parameters
Earth science
Created
(updated)
Volcanoes of the World Volcanic eruption data for all Details such as region, 1535 Text Regression, 2013 [403] E. Venzke et al.
known volcanic events on subregion, tectonic setting, classification
earth.
dominant rock type are
given.
Seismic activity was

Seismic-bumps Seismic activities from a [404][405]
classified as hazardous or 2584 Text Classification 2013 M. Sikora et al.
Dataset coal mine.
not.
Catchment hydrology
dataset with CSV, N. Addor et al. /
CAMELS-US hydrometeorological see Reference 671 Text, Regression 2017 [406][407] A. Newman et
timeseries and various Shapefile al.
attributes
Catchment hydrology
dataset with CSV,
[408] C. Alvarez-
CAMELS-Chile hydrometeorological see Reference 516 Text, Regression 2018
Garreton et al.
timeseries and various Shapefile
attributes
Catchment hydrology
dataset with CSV,
CAMELS-Brazil hydrometeorological see Reference 897 Text, Regression 2020 [409] V. Chagas et al.
attributes
Catchment hydrology
dataset with CSV,
CAMELS-GB hydrometeorological see Reference 671 Text, Regression 2020 [410] G. Coxon et al.
attributes
Catchment hydrology
dataset with CSV,
CAMELS-Australia hydrometeorological see Reference 222 Text, Regression 2021 [411] K. Fowler et al.
attributes
Catchment hydrology
dataset with CSV,
LamaH-CE hydrometeorological see Reference 859 Text, Regression 2021 [412] C. Klingler et al.
attributes
Other physical
Created
(updated)
Dataset of concrete
Concrete Compressive Nine features are given for [413][414]
properties and compressive 1030 Text Regression 2007 I. Yeh
Strength Dataset each sample.
strength.
Concrete Slump Test Concrete slump flow given Features of concrete given [415][416]
103 Text Regression 2009 I. Yeh
Dataset in terms of properties. such as fly ash, water, etc.
Predict if a molecule, given Arris
168 features given for each [417]
Musk Dataset the features, will be a musk 6598 Text Classification 1994 Pharmaceutical
molecule.
or a non-musk. Corp.
Semeion
Steel Plates Faults Steel plates of 7 different 27 features given for each [418]
1941 Text Classification 2010 Research
Dataset types. sample.
Center
Biological data
Datasets from biological systems.
Human
Created
(updated)
A five-step method to
A structured general- infer birth and death
purpose dataset on years, gender, and Paper[419]
life, work, and death occupation from Regression, Amoradnejad
Age Dataset 1,223,009 Text 2022
of 1.22 million community-submitted Classification Dataset[420] et al.
distinguished people. data to all language
Public domain. versions of the
Wikipedia project.
2500 images with

1500*1152 pixels
Photorealistic retinal
Synthetic Fundus useful for
images and vessel Classification, [422] C. Valenti et
segmentation and 2500 Images 2020
Dataset[421] segmentations. Public
classification of veins
Segmentation al.
domain.
and arteries on a
single background.
Measurements from
Study to examine
64 electrodes placed
EEG correlates of [423]
EEG Database on the scalp sampled 122 Text Classification 1999 H. Begleiter
genetic predisposition
at 256 Hz (3.9 ms
to alcoholism.
epoch) for 1 second.
Data from nine

subjects collected Split into four
P300 Interface using P300-based sessions for each [424][425] U. Hoffman et
1,224 Text Classification 2008
Dataset brain-computer subject. MATLAB al.
interface for disabled code given.
subjects.
Attributed of patients 75 attributes given for

Heart Disease [426][427] A. Janosi et
with and without heart each patient with 303 Text Classification 1988
Data Set al.
disease. some missing values.
Breast Cancer Dataset of features of
Wisconsin breast masses. 10 features for each [428][429] W. Wolberg et
569 Text Classification 1995
(Diagnostic) Diagnoses by sample are given. al.
Dataset physician is given.
United States
National Survey on Large scale survey on Department of
Classification, [430]
Drug Use and health and drug use in None. 55,268 Text 2012 Health and
regression
Health the United States. Human
Services
Lung cancer dataset

Lung Cancer 56 features are given [431][432]
without attribute 32 Text Classification 1992 Z. Hong et al.
Dataset for each case
definitions
Data for a group of
Arrhythmia patients, of which 276 features for each [433][434]
452 Text Classification 1998 H. Altay et al.
Dataset some have cardiac instance.
arrhythmia.
9 years of
Diabetes 130-US
readmission data
hospitals for years Many features of each Classification, [435][436]
across 130 US 100,000 Text 2014 J. Clore et al.
1999–2008 readmission are given. clustering
hospitals for patients
Dataset
with diabetes.
Features extracted
Diabetic Features extracted
from images of eyes [437][438]
Retinopathy and conditions 1151 Text Classification 2014 B. Antal et al.
with and without
Debrecen Dataset diagnosed.
diabetic retinopathy.
Methods to evaluate
segmentation and
Diabetic Features retinopathy
indexing techniques in Images, Classification, [439][440] Messidor
Retinopathy grade and risk of 1200 2008
the field of retinal Text Segmentation Project
Messidor Dataset macular edema
ophthalmology
(MESSIDOR)
Seven biological
Liver Disorders Data for people with [441][442] Bupa Medical
features given for 345 Text Classification 1990
Dataset liver disorders. Research Ltd.
each patient.
10 databases of
Thyroid Disease [443][444]
thyroid disease patient None. 7200 Text Classification 1987 R. Quinlan
Dataset
data.
Large number of
Mesothelioma Mesothelioma patient features, including [445][446] A. Tanrikulu et
Dataset data. asbestos exposure, al.
are given.
2D human pose
Parkinson's Vision- estimates of Camera shake has
Classification, [447][448][449]
Based Pose Parkinson's patients been removed from 134 Text 2017 M. Li et al.
regression
Estimation Dataset performing a variety of trajectories.
tasks.
KEGG Metabolic Network of metabolic Detailed features for 65,554 Text Classification, 2011 [450] M. Naeem et
Reaction Network pathways. A reaction each network node clustering, al.
regression
(Undirected) network and a relation and pathway are
Dataset network are given. given.
Human sperm images

Cropped around single
from 235 patients with
Modified Human sperm head.
male factor infertility, S. Javadi and
Sperm Morphology Magnification .npy [451][452]
labeled for normal or 1,540 Classification 2019 S.A.
Analysis Dataset normalized. Training, files
abnormal sperm Mirroshandel
(MHSMA) validation, and test
acrosome, head,
set splits created.
vacuole, and tail.
Animal
Created
(updated)
Marine
Physical measurements of
[453] Research
Abalone Dataset Abalone. Weather patterns None. 4177 Text Regression 1995
Laboratories –
and location are also given.
Taroona
Animals are classed into 7

Artificial dataset covering 7 [454]
Zoo Dataset categories and features are 101 Text Classification 1990 R. Forsyth
classes of animals.
given for each.
503 sponges in the
Demospongiae Demosponge class are [455] E. Armengol et
Data about marine sponges. 503 Text Classification 2010
Dataset described by various al.
features.
PLF data inventory (cows, List is

Farm animals data pigs; location, acceleration, Labeled datasets. constantly Text Classification 2020 [456] V. Bloch
etc.). updated
Primate splice-junction
Splice-junction Gene gene sequences (DNA) with [432]
None. 3190 Text Classification 1992 G. Towell et al.
Sequences Dataset associated imperfect
domain theory.
Expression levels of 77
Mice Protein Classification, [457][458]
proteins measured in the None. 1080 Text 2015 C. Higuera et al.
Expression Dataset Clustering
cerebral cortex of mice.
Fungi
Created
(updated)
UCI Mushroom Mushroom attributes and Many properties of each [459]

8124 Text Classification 1987 J. Schlimmer
Dataset classification. mushroom are given.
Simulated data from larger
Secondary Mushroom Mushroom attributes and and more realistic primary [460][461]
61069 Text Classification 2020 D. Wagner et al.
Dataset classification mushroom entries. Fully
reproducible.
Plant
Created
(updated)
Forest fires and their 13 features of each fire are [462][463]
Forest Fires Dataset 517 Text Regression 2008 P. Cortez et al.
properties. extracted.
Three types of iris plants

Iris Dataset are described by 4 different None. 150 Text Classification 1936 [464][465] R. Fisher
attributes.
Sixteen samples of leaf Shape descriptor, fine-scale

Plant Species Leaves [466][467]
each of one-hundred plant margin, and texture 1600 Text Classification 2012 J. Cope et al.
Dataset
species. histograms are given.
35 features for each plant
Database of diseased are given. Plants are [468] R. Michalski et
Soybean Dataset 307 Text Classification 1988
soybean plants. classified into 19 al.
categories.
Measurements of
geometrical properties of Classification, [469][470] Charytanowicz
Seeds Dataset None. 210 Text 2012
kernels belonging to three clustering et al.
different varieties of wheat.
Data for predicting forest

Many geographical features [471][472] J. Blackard et
Covertype Dataset cover type strictly from 581,012 Text Classification 1998
given. al.
cartographic variables.
Data for a plant signaling
Abscisic Acid
network. Goal is to Causal- [473] J. Jenkens et
Signaling Network None. 300 Text 2008
determine set of rules that discovery al.
Dataset
governs the network.
20 photos of leaves for Images, Classification, [474][475] T. Munisami et

Folio Dataset None. 637 2015
each of 32 species. text clustering al.
17 category dataset of Train/test splits, labeled Images, [174][476] M-E Nilsback et

Oxford Flower Dataset 1360 Classification 2006
flowers. images, text al.
Plant Seedlings 12 category dataset of plant Labelled images, Classification, [477]
5544 Images 2017 Giselsson et al.
Dataset seedlings. segmented images, detection
Database with images of 100x100 pixels, White Images 2017– [478][479] Mihai Oltean,
Fruits 360 dataset 82213 Classification
120 fruits and vegetables. background. (jpg) 2019 Horea Muresan
Microbe
Created
(updated)
Various features of the

Ecoli Dataset Protein localization sites. protein localizations sites 336 Text Classification 1996 [480][481] K. Nakai et al.
are given.
Identification of
Various mass spectrometer [482][483]
MicroMass Dataset microorganisms from mass- 931 Text Classification 2013 P. Mahe et al.
features.
spectrometry data.
Predictions of Cellular
Eight features given per [484][485]
Yeast Dataset localization sites of 1484 Text Classification 1996 K. Nakai et al.
instance.
proteins.
Drug discovery
Created
(updated)
Prediction of outcome of Chemical descriptors of [486]
Tox21 Dataset 12707 Text Classification 2016 A. Mayr et al.
biological assays. molecules are given.
Anomaly data
Default Created
Task (updated)
Data are ordered,

timestamped, single-
Comma 2016
Numenta Anomaly valued metrics. All Anomaly [487]
None 50+ files separated (continually Numenta
Benchmark (NAB) data files contain detection
values updated)
anomalies, unless
otherwise noted.
Each file represents a

single experiment and
There are two markups
contains a single
for Outlier detection Iurii D. Katser
anomaly. The dataset Comma 2020
Skoltech Anomaly (point anomalies) and 30+ files Anomaly [488] [489] and
represents a separated (continually
Benchmark (SKAB) Changepoint detection (v0.9) detection Vyacheslav
multivariate time values updated)
(collective anomalies) O. Kozitsin
series collected from
problems
the sensors installed
on the testbed.
2016
On the Evaluation Most data files are
treated for missing (possibly
of Unsupervised adapted from UCI
values, numerical updated
Outlier Detection: Machine Learning 1000+ Anomaly [490]
attributes only, different ARFF with new Campos et al.
Measures, Repository data, some files detection
percentages of datasets
Datasets, and an are collected from the
anomalies, labels and/or
Empirical Study literature.
results)
Question answering data

This section includes datasets that deals with structured data.
Created
(updated)
This dataset
contains a large
collection of Open
A large collection of Neural SPARQL
Question to Templates and
DBpedia Neural SPARQL specially instances for
Hartmann,
Question design for Open training Neural Question [491][492]
894,499 Question-query pairs 2018 Soru, and
Answering Domain Neural SPARQL Answering
Marx et al.
(DBNQA) Dataset Question Answering Machines; it was
over DBpedia pre-processed by
Knowledgebase. semi-automatic
annotation tools as
well as by three
SPARQL experts.
This dataset
comprises over
A large collection of 23,000 human-
Vietnamese
Vietnamese generated question-
Question Question [493] Nguyen et
questions for answer pairs based 23,074 Question-answer pairs 2020
Answering Dataset Answering al.
evaluating MRC on 5,109 passages
(UIT-ViQuAD)
models. of 174 Vietnamese
articles from
Wikipedia.
A collection of
Vietnamese This corpus
Vietnamese Question
Multiple-Choice includes 2,783
multiple-choice Answering/Machine [494] Nguyen et
Machine Reading Vietnamese 2,783 Question-answer pairs 2020
questions for Reading al.
Comprehension multiple-choice
evaluating MRC Comprehension
Corpus(ViMMRC) questions.
models.
Context, Question, Rewrite,

Answer, Answer_URL,
Conversation_no, Turn_no,
Conversation_source
Further details are

Open-Domain This dataset provided in the
project's GitHub Anantha
Question An end-to-end open- includes 14,000
Question [495] and
Answering Goes domain question conversations with repository (https://githu 2021
Answering Vakulenko
Conversational via answering. 81,000 question-
Question Rewriting answer pairs.
b.com/apple/ml-qrecc) et al.
and respective
Hugging Face dataset
card (https://huggingfa
ce.co/datasets/svakul
enk0/qrecc).
Question-answer Question [496] Khashabi et

UnifiedQA Processed dataset 2020
data Answering al.
Dialog or instruction prompted data

This section includes datasets that ...
Dataset Brief Created
Preprocessing Instances Format Default Task Reference Creator
Name description (updated)
Taskmaster-1 and
Taskmaster-2:
conversation id,
utterances, Instruction id
Taskmaster-3:
"The conversation id,
Taskmaster
utterances, vertical,
corpus Taskmaster-1: goal-oriented
consists of conversational dataset. It includes scenario, instructions.
THREE 13,215 task-based dialogs
datasets, comprising six domains. For further details
Taskmaster-1 check the project's
(TM-1), Taskmaster-2: 17,289 dialogs
Taskmaster-2 GitHub repository (http
(TM-2), and
in the seven domains s://github.com/google- Dialog/Instruction
Byrne and
Taskmaster 2019 [498] Krishnamoorthi
Taskmaster-3 (restaurants, food ordering, research-datasets/Tas prompted
et al.
(TM-3), movies, hotels, flights, music kmaster) or the
comprising and sports).
over 55,000 Hugging Face dataset
spoken and cards (taskmaster-1 (h
written task- Taskmaster-3: 23,757 movie
ttps://huggingface.co/d
oriented ticketing dialogs.
dialogs in atasets/taskmaster1),
over a dozen taskmaster-2 (https://h
domains."[497] uggingface.co/dataset
s/taskmaster2),
taskmaster-3 (https://h
uggingface.co/dataset
s/taskmaster3)).
Check format details in the

A labeled
project's worksheet (https://
dataset for Dialog/Instruction [499]
DrRepair Pre-processed data worksheets.codalab.org/wo 2020 Michihiro et al.
program prompted
rksheets/0x01838644724a4
repair.
33c932bef4cb5c42fbd).
Each task consists of

input/output, and a task
definition.
Additionally, each ask

contains a task
definition.
Large dataset Further information is

Natural that covers a provided in the GitHub
Input/Output and [500]
Instructions wider range of repository (https://githu 2022 Wang et al.
task definition
v2 reasoning
abilities
b.com/allenai/natural-i
nstructions) of the
project and the
Hugging Face data
card (https://huggingfa
ce.co/datasets/Muenni
ghoff/natural-instructio
ns).
Information about this

" LAMBADA dataset's format is
is a collection available in the
of narrative HuggingFace dataset card
passages (https://huggingface.co/dat
sharing the asets/lambada) and the
characteristic project's website (https://ze
that human nodo.org/record/2630551#.
subjects are Y7uPquzMKNi).
able to guess
their last word The dataset can be [502]
LAMBADA if they are 2016 Paperno et al.
downloaded here (http
exposed to
the whole s://zenodo.org/record/
passage, but 2630551/files/lambad
not if they a-dataset.tar.gz), and
only see the the rejected data here
last sentence
preceding the (https://zenodo.org/rec
target ord/2630551/files/reje
word."[501] cted-data1.tar.gz).
FLAN A re-preprocessed version of the 2021 [503] Wei et al.

FLAN dataset with updates since
the original FLAN dataset was
released is available in Hugging
Face (https://huggingface.co/datase
ts/Muennighoff/flan):
1. test data (https://huggingface.c

o/datasets/Muennighoff/flan/tre
e/main/test)
2. train data (https://huggingface.c
o/datasets/Muennighoff/flan/tre
e/main/train)
3. validation data (https://huggingfa
ce.co/datasets/Muennighoff/flan/
tree/main/validation)
The scripts to process the

data are available in the
GitHub repo mentioned on
the paper:
https://github.com/google-
research/FLAN/tree/main/flan.
Another FLAN GitHub repo (h

ttps://github.com/Muennighoff/
FLAN) was created as well.
This is the one associated
with the dataset card in
Hugging Face.
Cybersecurity
Brief Default Created
Dataset Name Preprocessing Instances Format Reference
description Task (updated)
Data can be downloaded

The ATT&CK
from these two GitHub
is a globally-
repositories: version 2.1 (ht
accessible
tps://github.com/mitre-attac
knowledge [504] M
MITRE ATTACK k/attack-stix-data/archive/r
base of A
efs/heads/master.zip) and
adversary
version 2.0 (https://github.c
tactics and
om/mitre/cti/archive/refs/he
techniques.
ads/master.zip)
from CAPEC's website (htt
ps://capec.mitre.org/data/ar
chive/capec_latest.zip):
Common
Attack Mechanisms of Attack
CAPEC
Pattern (https://capec.mitre.or [505] C
Enumeration g/data/csv/1000.csv.zi
and
Classification
p) Domains of Attack
(https://capec.mitre.or
g/data/csv/3000.csv.zi
p)
CVE is a list
of publicly
disclosed
cybersecurity
vulnerabilities
from: Allitems (https://cve. [506]
CVE that is free to C
mitre.org/data/downloads/al
search, use,
litems.csv)
and
incorporate
into products
and services.

from:
Software
Development (https://c
Common we.mitre.org/data/csv/
Weakness 699.csv.zip) Hardware [507]
CWE C
Enumeration
data. Design (https://cwe.mit
re.org/1194.csv.zip)
Research Concepts (h
ttps://cwe.mitre.org/dat
a/csv/1000.csv.zip)
The GitHub repository of

Annotated the project (https://github.c
database of om/statnlp-research/statnlp [508]
MalwareTextDB K
malware -datasets/tree/master/datas
texts. et) contains the data to
download.
USENIX Security Collection of This data is not 1995 (https://www.usenix.or [509] U

Symposium security pre-processed. g/legacy/publications/librar S
proceedings proceedings y/proceedings/security95/in S
from USENIX dex.html), 1996 (https://ww
Security w.usenix.org/legacy/publica
Symposium – tions/library/proceedings/se
technical c96/), 1997 (https://www.us
sessions enix.org/legacy/publication
from 1995 to s/library/proceedings/ana9
2022. 7/technical.html), 1998 (htt
ps://www.usenix.org/legac
y/publications/library/proce
edings/sec98/technical.htm
l), 1999 (https://www.useni
x.org/legacy/events/sec99/
technical.html), 2000 (http
s://www.usenix.org/legacy/
events/sec2000/tech.html),
2001 (https://www.usenix.or
g/legacy/events/sec2001/te
ch.html), 2002 (https://ww
w.usenix.org/legacy/publica
tions/library/proceedings/se
c02/tech.html), 2003 (http
publications/library/proceed
ings/sec03/tech.html),
2004 (https://www.usenix.or
g/legacy/events/sec04/tec
h/), 2005 (https://www.usen
ix.org/legacy/events/sec05/
tech/), 2006 (https://www.u
senix.org/legacy/events/se
c06/tech/), 2007 (https://w
ww.usenix.org/legacy/event
s/sec07/tech/), 2008 (http
events/sec08/tech/#wed),
2009 (https://www.use
nix.org/legacy/events/
sec09/tech/), 2010 (htt
ps://www.usenix.org/le
gacy/events/sec10/tec
h/) 2011 (https://static.
usenix.org/event/sec1
1/tech/), 2012 (https://
www.usenix.org/confe
rence/usenixsecurity1
2/technical-sessions),
nix.org/conference/us
enixsecurity13/technic
al-sessions), 2014 (htt
ps://www.usenix.org/c
onference/usenixsecu
rity14/technical-sessio
ns), 2015 (https://www.
usenix.org/conferenc
e/usenixsecurity15/tec
hnical-sessions), 2016
(https://www.usenix.or
g/conference/usenixse
curity16/technical-ses
sions), 2017 (https://w
ww.usenix.org/confere
nce/usenixsecurity17/t
echnical-sessions),
nix.org/conference/us
enixsecurity18/technic
al-sessions), 2019 (htt
ps://www.usenix.org/c
onference/usenixsecu
rity19/technical-sessio
ns), 2020 (https://www.
usenix.org/conferenc
e/usenixsecurity20/tec
hnical-sessions), 2021
(https://www.usenix.or
g/conference/usenixse
curity21/technical-ses
sions), 2022 (https://w
ww.usenix.org/confere
nce/usenixsecurity22/t
echnical-sessions).
APTNotes Collection of This data is not The GitHub repository (http [510] A
public pre-processed. s://github.com/aptnotes/dat
documents, a) of the project contains a
whitepapers
and articles file with links to the data
about APT stored in box.
campaigns.
All the Data files can also be
documents
arepublicly
downloaded here (http
available s://github.com/ameza1
data. 3/APTNotesData/).
All articles available here (h

arXiv Collection of
This data is not ttps://github.com/ameza13/ [511]
Cryptography and articles about a
pre-processed. Cryptography-and-Securit
Security papers cybersecurity
y).
Small
collection of
security
Security eBooks eBooks, and This data is not [512][513][514][515][516][517][518][519][520][521][522][523]
for free security pre-processed.
presentations
publicly
available.
Repository of
worldwide
National Cyber
strategy This data is not [524]
Security strategy
documents pre-processed.
repository
about
cybersecurity.
Y
Data about
C
cybersecurity Tokenization,
Cyber Security Y
strategies meaningless- [525]
Natural Language W
from more frequent words
Processing Y
than 75 removal.
X
countries.
X
Sample of
APT reports,
All data is available in this
malware, Raw and
APT Reports GitHub (https://github.com/ [526]
technology, tokenize data b
collection blackorbird/APT_REPORT)
and available.
repository.
intelligence
collection
Data available in the

project's website (https://sit
es.google.com/site/offense
valsharedtask/olid).
Offensive
Language [527] Z
Identification Data is also available a
Dataset (OLID) here (https://github.co
m/ameza13/OLIDdata
set).
Threat reports (https://www.

ncsc.gov.uk/section/keep-u
p-to-date/threat-reports),
reports and advisory (http
s://www.ncsc.gov.uk/sectio
n/keep-up-to-date/reports-a
dvisories), news (https://w
ww.ncsc.gov.uk/section/ke
ep-up-to-date/ncsc-news),
Cyber reports blog-posts (https://www.ncs
from the National This data is not c.gov.uk/section/keep-up-to [528]
Cyber Security pre-processed. -date/ncsc-blog), speeches
Centre (https://www.ncsc.gov.uk/s
ection/keep-up-to-date/all-s
peeches).
Alternate list of reports

(https://github.com/bee
3202/cybersecurity-re
ports-ncsc).
APT reports by This data is not [529]

Kaspersky pre-processed.
Newsletters (https://thecyb
erwire.com/newsletters),
This data is not podcasts (https://thecyber [530]
The cyberwire
pre-processed. wire.com/podcasts), and
stories (https://thecyberwir
e.com/stories).
News (https://www.databre
aches.net/news/), list of
news from Aug 2022 to Feb
Databreaches This data is not [531]
2023 (https://github.com/be
news pre-processed.
e3202/cybersecurity-data-s
ources/blob/main/DATABR
EACHES.md)
News (https://cybernews.c
om/news/), curated list of
This data is not news (https://github.com/b [532]
Cybernews
pre-processed. ee3202/cybersecurity-data-
sources/blob/main/CYBER
NEWS.md)
News (https://www.hipaajou
This data is not [533]
Hipaajournal rnal.com/category/hipaa-co
pre-processed.
mpliance-news/)
This data is not News (https://www.bleeping [534]

Bleepingcomputer
pre-processed. computer.com/)
Cybercrime news (https://th

Therecord erecord.media/news/cyberc
pre-processed.
rime/)
Hacking news (https://www.
Hackread hackread.com/hacking-new
pre-processed.
s/)
APT reports (https://secure

list.com/category/apt-report
s/), archive (https://secureli
st.com/category/archive/),
DDOS reports (https://secu
relist.com/category/ddos-re
ports/), incidents (https://se
curelist.com/category/incid
ents/), Kaspersky security
bulletin (https://securelist.c
om/category/kaspersky-se
curity-bulletin/), industrial
Securelist threats (https://securelist.c
pre-processed.
om/category/industrial-thre
ats/), malware-reports (http
s://securelist.com/categor
y/malware-reports/),
opinions (https://securelist.
com/category/opinions/),
publications (https://securel
ist.com/category/publicatio
ns/), research (https://secu
relist.com/category/researc
h/), and SAS (https://secur
elist.com/category/sas/).
The Stucco Project's website with data

project information (https://stucco.
collects data github.io/data/)Reviewed
Stucco project not typically source with links to data
pre-processed
integrated sources (https://github.co
into security m/bee3202/cybersecurity-d
systems. ata-sources)
Website with Technical information (http
technical s://www.farsightsecurity.co
information, m/technical/), research (htt
Farsightsecurity reports, and ps://www.farsightsecurity.c
pre-processed
more about om/research/), reports (http
security s://www.farsightsecurity.co
topics. m/reports/).
Website with Papers per category (http

academic s://www.schneier.com/acad
Schneier papers about emic/), papers archive by
pre-processed
security date (https://www.schneier.
topics. com/academic/archive/).
Website with Reviwed list of Trendmicro

research, research, news, and
news, and This data is not pespectives (https://github. [541]
Trendmicro
perspectives pre-processed com/bee3202/cybersecurity
bout security -data-sources/blob/main/TR
topics. ENDMICRO.md).
data breaches (https://theh
ackernews.com/search/lab
el/data%20breach),
cyberattacks (https://theha
ckernews.com/search/labe
News about
This data is not l/Cyber%20Attack), [542]
The Hacker News cybersecurity
pre-processed vulnerabilities (https://theha
topics.
ckernews.com/search/labe
l/Vulnerability), malware
news (https://thehackernew
s.com/search/label/Malwar
e).
curated list of news (http

Security s://github.com/bee3202/cy
Krebsonsecurity news and bersecurity-data-sources/bl
pre-processed
investigation ob/main/krebsonsecurity.m
d)
Matrix of
Mitre Defend Defend json files [544]
artifacts
Mitre Atlas Mitre Atlas is This data is not [545]
a knowledge pre-processed
base of
adversary
tactics,
techniques,
and case
studies for
machine
learning (ML)
systems
based on
real-world
observations.
MITRE
Engage is a
framework for
planning and
discussing
adversary
engagement
operations
Mitre Engage that
pre-processed
empowers
you to
engage your
adversaries
and achieve
your
cybersecurity
goals.
Hacking Tutorials
pre-processed
Climate and sustainability

Default Created
Task (updated)
Direct link to reports (http

Database of
s://www.tcfdhub.org/reports)
company reports
This data is not pre- Curated list of reports (http [548] TCFD Knowledge
TCFD reports that include TCFD-
processed s://github.com/bee3202/cyb Hub
related
ersecurity-data-sources/blo
disclosures.
b/main/TCFDreports.md)
Curated list of repots (http

A listing of
Corporate Social s://github.com/bee3202/cyb
responsibility This data is not pre- [549]
Responsibility ersecurity-data-sources/blo ResponsibilityReports
reports on the processed
Reports b/main/RESPONSABILITY
internet.
REPORTS.md)
A collection of
comprehensive
assessment Reports (https://www.ipcc.c
The
reports about h/reports/)Curated list of
Intergovernmental This data is not pre- [550]
knowledge on reports (https://github.com/b IPCC
Panel on Climate processed
climate change, its ee3202/cybersecurity-data-s
Change (IPCC)
causes, potential ources/blob/main/IPCC.md)
impacts and
response options
Alliance for Curated list of blog posts (ht

Research on This data is not pre- tps://github.com/bee3202/c [551] ARCS
Corporate processed ybersecurity-data-sources/bl
Sustainability ob/main/arcs.md)
Guides (https://www.accoun
tingforsustainability.org/cont
ent/a4s/corporate/en/knowle
dge-hub.html?tab1=guides),
case studies (https://www.a
ccountingforsustainability.or
g/content/a4s/corporate/en/
ESG corpus: knowledge-hub.html?tab1=c
Knowledge Hub of This data is not pre- ase-studies), blogs (https:// [552] Mehra et al.
the Accounting for processed www.accountingforsustainab
Sustainability ility.org/content/a4s/corporat
e/en/knowledge-hub.html?ta
b1=blogs), and reports &
surveys (https://www.accou
ntingforsustainability.org/co
ntent/a4s/corporate/en/know
ledge-hub.html?tab1=report
s).
Each claim is
accompanied by five
A dataset adopting manually annotated
the FEVER evidence sentences Dataset HF card (https://hu
methodology that retrieved from the ggingface.co/datasets/clima
consists of 1,535 English Wikipedia that te_fever), and project's [554]
CLIMATE-FEVER support, refute or do Diggelmann et al.
real-world claims GitHub repository (https://git
regarding climate- not give enough hub.com/tdiggelm/climate-fe
change collected information to validate ver-dataset).
on the internet. the claim totalling in
7,675 claim-evidence
pairs.[553]
The dataset is made
Climate news DB (http://ww
A dataset for NLP up of a number of data
w.climate-news-db.com/),
Climate News and climate artifacts (JSON, [555]
Project's GitHub repository ADGEfficiency
dataset change media JSONL & CSV text
(https://github.com/ADGEffi
researchers files & SQLite
ciency/climate-news-db)
database)
Climatext is a
HF dataset (https://huggingf
dataset for
ace.co/datasets/mwong/cli [556]
Climatext sentence-based University of Zurich
matetext-evidence-related-e
climate change
valuation/tree/main/data)
topic detection.
Curated list of climate

articles (https://github.com/
bee3202/cybersecurity-data-
Collection of sources/blob/main/climate-t
articles and news This data is not pre- ech.md)Curated list of [557]
GreenBiz
about climate and processed sustainability articles (http
sustainability s://github.com/bee3202/cyb
ersecurity-data-sources/blo
b/main/sustainability-strateg
y.md)
List of pre-prints Curated list of pre-prints (htt

Top research pre-
from researchers This data is not pre- ps://github.com/bee3202/cli [558]
prints in climate Maurice Tamman
in the reuters hot processed mate/blob/main/preprints_ap
and sustainability
list p_dimentions_ai.md)
Curated list of corporate
This data is not pre- sustainability blogs (https:// [559]
ARCS
processed github.com/bee3202/climat
e/blob/main/arcs.md)
Website with
articles about This data is not pre- [560]
GreenBiz GreenBiz
climate and processed
sustainability
Curated list of articles (http

This data is not pre- s://github.com/bee3202/clim [561]
CSRWIRE CSRWIRE
processed ate/blob/main/csrwire_all.m
d)
Articles about
climate (https://ww
w.cdp.net/en/clima
te), water (https:// This data is not pre- [562]
CDP CDP
www.cdp.net/en/wa processed
ter), and forests (ht
tps://www.cdp.net/
en/forests)
Code data
Brief Default Created
Dataset Name Preprocessing Instances Format Reference Creator
description Task (updated)
Curated lis of repositories from GitHub: 61 (ht

tps://github.com/bee3202/cybersecurity-data-
sources/blob/main/git_others.61.md) 62 (http
s://github.com/bee3202/cybersecurity-data-s
ources/blob/main/git_others.62.md) 63 (http
This data is not
GitHub repositories ources/blob/main/git_others.69.md) 70 (http
pre-processed
ources/blob/main/git_others.71.md) , 72 (http
ources/blob/main/git_others.72.md), 73 (http
ources/blob/main/git_others_main.md)
Curated list of repositories (https://github.co

IBM Public GitHub This data is not
m/bee3202/cybersecurity-data-sources/blob/
repositories pre-processed
main/CODEPUBLIC.md) from GitHub
RedHat Public GitHub This data is not
repositories pre-processed
main/CODERHPUBLIC.md) from GitHub
Curated list of files (https://github.com/bee32
StackExchange Public This data is not 02/cybersecurity-data-sources/blob/main/CO
Archive.org files pre-processed DESEPPUBLIC.md) from Archive.org (http
s://archive.org/)
Curated list of repositories from Gitlab: 1 (htt

ps://github.com/bee3202/cybersecurity-data-
This data is not sources/blob/main/CODELABPUBLIC.md) 2
Gitlab Public repositories
pre-processed (https://github.com/bee3202/cybersecurity-da
ta-sources/blob/main/CODELABPUBLIC2.m
d)

Ansible Collections public This data is not m/bee3202/cybersecurity-data-sources/blob/
repositories pre-processed main/code/CODEANSIBLEPUBLIC.0.md)
from GitHub.
Curated list of repositories from Hugging
Face: 1 (https://github.com/bee3202/cyberse
curity-data-sources/blob/main/code/GITHUB
CODE_RAWPUBLIC.md) 2 (https://github.co
main/code/GITHUBCODE_CLEANPUBLIC.m
d) 3 (https://github.com/bee3202/cybersecurit
y-data-sources/blob/main/code/CODEPARRO
T_TRAINV2NEARDEDUP.md) 4 (https://githu
b.com/bee3202/cybersecurity-data-sources/bl
ob/main/code/CODEPARROT_TRAINV2NEA
RDEDUPVALID.md) 5 (https://github.com/be
e3202/cybersecurity-data-sources/blob/main/
code/CODEPARROT_TRAINNEARDEDUPLI
CodeParrot GitHub Code This data is not
CATIONPUBLIC.md) 6 (https://github.com/be
Dataset pre-processed
e3202/cybersecurity-data-sources/blob/main/
code/CODEPARROT_TRAINNEARDEDUPLI
CATIONPUBLICVALID.md) 7 (https://github.c
om/bee3202/cybersecurity-data-sources/blob/
main/code/CODEPARROT_TRAINMOREPU
BLIC.md) 8 (https://github.com/bee3202/cybe
rsecurity-data-sources/blob/main/code/CODE
PARROT_TRAINMOREVALIDPUBLIC.md) 9
(https://github.com/bee3202/cybersecurity-da
ta-sources/blob/main/code/CODEPARROT_C
LEANTRAINPUBLIC.md) 10 (https://github.c
om/bee3202/cybersecurity-data-sources/blob/
main/code/CODEPARROT_CLEANVALIDPU
BLIC.md)
The
Community
Distribution of
This data is not List of GitHub repositories of the project (http
OKD Kubernetes
pre-processed s://github.com/orgs/okd-project/repositories)
that powers
Red Hat
OpenShift
The developer
and operations List of GitHub repositories of the project (http
OpenShift friendly s://github.com/bee3202/open-shift-repos/blo
Kubernetes b/main/pages_openshift.md)
distro
List of GitHub repositories of the project (http
This data is not
Kubernetes s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_kubernetes.md)
GitHub home
of the Red Hat This data is not
Red Hat Developer s://github.com/bee3202/open-shift-repos/blo
Developer pre-processed
b/main/pages_redhat_developer.md)
program
Red Hat
This data is not
s://github.com/bee3202/open-shift-repos/blo
Workshops pre-processed
b/main/pages_redhat_workshops.md)

This data is not
Kubernetes SIGs s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_kubernetes_sigs.md)
This data is not
Konveyor s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_konveyor.md)

This data is not
RedHat Marketplace s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_redhat_marketplace.md)

Redhat blog
pre-processed
Kubernetes io
pre-processed

Docs Openshift
pre-processed

cncf io
pre-processed
List of publicly
data link (https://github.com/bee3202/kubern
available This data is not
Kubernetes presentations etes_presentations/archive/refs/heads/main.z
Kubernetes pre-processed
ip)
presentations
Red Hat Open Innovation This data is not s://github.com/bee3202/open-shift-repos/blo
Labs pre-processed b/main/pages_redhat_open_innovation_labs.
md)

This data is not
Red Hat Demos s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_RedHatDemos.md)

This data is not
Red Hat OpenShift Online s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_openshift-online.md)
This data is not
Software Collections s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_software_collections.md)

This data is not
Red Hat Insights s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_redhat_insights.md)

This data is not
Red Hat Government s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_redhat_government.md)
This data is not
Red Hat Consulting s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_redhat_consulting.md)

Red Hat Communities of This data is not s://github.com/bee3202/open-shift-repos/blo
Practice pre-processed b/main/pages_redhat_communities_of_practi
ce.md)

This data is not
Red Hat Partner Tech s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_redhat_partner_tech.md)
This data is not
Red Hat Documentation s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_redhat_documentation.md)

This data is not
IBM s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_IBM.md)

This data is not
IBM Cloud s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_IBM_cloud.md)
This data is not
Build Lab Team s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_build_lab_team.md)

This data is not
Terraform IBM Modules s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_terraform-ibm-modules.md)

This data is not
Cloud Schematics s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_Cloud-Schematics.md)
This data is not
OCP Power Demos s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_ocp-power-demos.md)

This data is not
IBM App Modernization s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_IBMAppModernization.md)

This data is not
Kubernetes OperatorHub s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_k8s-operatorhub.md)
Cloud Native Computing This data is not
Foundation (CNCF) pre-processed
b/main/pages_cncf.md)

Operator Framework s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/pages_operator-framework.md)
List of GitHub repositories in artifacthub.io (ht

GitHub repositories This data is not
tps://github.com/bee3202/artifacthub_packag
referenced in artifacthub.io pre-processed
es/blob/main/artifacthub_git_repos.md)
Red Hat Communities of This data is not
Practice pre-processed
b/main/pages_redhat_cop.md)

This data is not
Red Hat partner s://github.com/redhat-partner-tech?tab=reposi
pre-processed
tories)
List of GitHub repositories for the project (htt

This data is not
IBM Repositories ps://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/p_ibm_repositories.md)
Build Lab Team This data is not List of GitHub repositories for the project (htt
pre-processed ps://github.com/orgs/ibm-build-lab/repositorie
s)

This data is not
Operator Framework ps://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/operator_framework.md)
This data is not
GitHub repositories ps://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/individual_git_repos.md)
This data is not List of GitHub repositories of the project (http

Red Hat
pre-processed s://www.redhat.com/en)

This data is not
Kubernetes Patterns s://github.com/bee3202/open-shift-repos/blo
pre-processed
b/main/kubernetes-patterns.md)
Kubernetes Deployment & This data is not s://resources.linuxfoundation.org/LF+Project
Security Patterns pre-processed s/CNCF/TheNewStack_Book2_KubernetesDe
ploymentAndSecurityPatterns.pdf)

Kubernetes for Full-Stack This data is not
Developers pre-processed
b/main/kubernetes-fs-developers.md)
GitHub repository of the project (https://githu

Load Balancer Cloudwatch This data is not
b.com/bee3202/open-shift-repos/blob/main/lo
Metrics pre-processed
ad-balancer-cloudwatch-metrics.md)
This data is not [3] (https://www.dynatrace.com/support/help/
Dynatrace
pre-processed how-to-use-dynatrace/metrics/built-in-metrics)
GitHub repository of the project (https://githu

This data is not
AIOps Challenge 2020 Data b.com/NetManAIOps/AIOps-Challenge-2020-
pre-processed
Data)
This data is not List of repositories (https://github.com/logpai/

Loghub
pre-processed loghub)
List of HTML pages (https://github.com/bee3
This data is not
HTML Pages 202/open-shift-repos/blob/main/html_pages.m
pre-processed
d)

Opensift ebooks
pre-processed
Kubernetes Patterns (https://www.redhat.co

m/rhdc/managed-files/cm-oreilly-kubernetes-p
atterns-ebook-f19824-201910-en_1.pdf),
Kubernetes Deployment (https://resources.lin
This data is not uxfoundation.org/LF+Projects/CNCF/TheNew
Kubernetes ebooks
pre-processed Stack_Book2_KubernetesDeploymentAndSe
curityPatterns.pdf), Kubernetes for Full-Stack
Developers (https://assets.digitalocean.com/
books/kubernetes-for-full-stack-developers.pd
f)
Kubernetes for Full-Stack Developers (http
Kubernetes for Full-Stack This data is not
s://assets.digitalocean.com/books/kubernete
Developers pre-processed
s-for-full-stack-developers.pdf)
List of repositories (https://github.com/bee32

List of public and licensed This data is not
02/code_dataset/tree/main/licensed_batch_
Github repositories pre-processed
A)
Multivariate data
Financial
Created
(updated)
Weekly data of stocks from Calculated values included Comma Classification,

Dow Jones Index the first and second such as percentage change 750 separated regression, 2014 [569][570] M. Brown et al.
quarters of 2011. and a lags. values Time series
Attribute names are

Credit card applications
removed as well as Comma
Statlog (Australian either accepted or rejected [571][572]
identifying information. 690 separated Classification 1987 R. Quinlan
Credit Approval) and attributes about the
Factors have been values
application.
relabeled.
Auction data from various Contains all bids, bidderID,
Regression, [573][574] G. Shmueli et
eBay auction data eBay.com objects over bid times, and opening ~ 550 Text 2012
classification al.
various length auctions prices.
Binary credit classification

Statlog (German Credit Various financial features of [575]
into "good" or "bad" with 690 Text Classification 1994 H. Hofmann
Data) each person are given.
many features
Many attributes of the

Data from a large marketing
Bank Marketing clients contacted are given. [576][577]
campaign carried out by a 45,211 Text Classification 2012 S. Moro et al.
Dataset If the client subscribed to
large bank .
the bank is also given.
Several stock indexes
Istanbul Stock Classification, [578][579]
tracked for almost two None. 536 Text 2013 O. Akbilgic
Exchange Dataset regression
years.
Default of Credit Card Credit default data for Various features about each [580][581]
30,000 Text Classification 2016 I. Yeh
Clients Taiwanese creditors. account are given.
Weather
Created
(updated)
Data about 1024 different Classification, [582]

Cloud DataSet Image features extracted. 1024 Text 1989 P. Collard
clouds. clustering
Oceanographic and surface

meteorological readings Pacific Marine
12 weather attributes are [583]
El Nino Dataset taken from a series of 178080 Text Regression 1999 Environmental
measured at each buoy.
buoys positioned throughout Laboratory
the equatorial Pacific.
Time-series of greenhouse
Greenhouse Gas gas concentrations at 2921
Observing Network grid cells in California None. 2921 Text Regression 2015 [584] D. Lucas
Dataset created using simulations of
the weather.
Atmospheric CO2 from

Continuous air samples in
Continuous Air [585] Mauna Loa
Hawaii, USA. 44 years of None. 44 years Text Regression 2001
Samples at Mauna Observatory
records.
Loa Observatory
Radar data from the

ionosphere. Task is to [444][586] Johns Hopkins
Ionosphere Dataset Many radar features given. 351 Text Classification 1989
classify into good and bad University
radar returns.
Many features given,

Ozone Level Detection Two ground ozone level [587][588]
including weather conditions 2536 Text Classification 2008 K. Zhang et al.
Dataset datasets.
at time of measurement.
Census
Created
(updated)
Census data from 1994

Comma
containing demographic [589] United States
Adult Dataset Cleaned and anonymized. 48,842 separated Classification 1996
features of adults and their Census Bureau
values
income.
Weighted census data from Comma

Split into training and test [590][591] United States
Census-Income (KDD) the 1994 and 1995 Current 299,285 separated Classification 2000
sets. Census Bureau
Population Surveys. values
Census data from the Los
IPUMS Census Classification, [592]
Angeles and Long Beach None 256,932 Text 1999 IPUMS
Database regression
areas.
Partial data from 1990 US Results randomized and Classification, [593] United States
US Census Data 1990 2,458,285 Text 1990
census. useful attributes selected. regression Census Bureau
Transit
Created
(updated)
Many features, including
Hourly and daily count of [594][595]
Bike Sharing Dataset weather, length of trip, etc., 17,389 Text Regression 2013 H. Fanaee-T
rental bikes in a large city.
are given.
New York City

Trip data for yellow and Gives pick up and drop off
New York City Taxi Classification, [596] Taxi and
green taxis in New York locations, fares, and other 6 years Text 2015
Trip Data clustering Limousine
City. details of trips.
Commission
Taxi Service Many features given, Clustering,

Trajectories of all taxis in a [597][598] M. Ferreira et
Trajectory ECML including start and stop 1,710,671 Text causal- 2015
large city. al.
PKDD points. discovery
7,094,304
from 207
Speed from loop detectors Comma
Average speed in 5 minutes sensors Regression, [599]
METR-LA in the highway of Los separated 2014 Jagadish et al.
timesteps. and Forecasting
Angeles County. values
34,272
timesteps
39,000
Speed, flow, occupancy individual
Regression,
and other metrics from loop Metric usually aggregated detectors, Comma California
Forecasting, (updated [600]
PeMS detectors and other sensors via Average into 5 minutes each separated Department of
Nowcasting, realtime)
in the freeway of the State timesteps. containing values Transportation
Interpolation
of California, U.S.A.. years of
timeseries
Internet
Created
(updated)
Large collection of
Webpages from webpages and how they clustering, [601]
None. 3.5B Text 2013 V. Granville
Common Crawl 2012 are connected via classification
hyperlinks
Features encode
Internet Dataset for predicting if a
geometry of ads and [602][603]
Advertisements given image is an 3279 Text Classification 1998 N. Kushmerick
phrases occurring in the
Dataset advertisement or not.
URL.
Internet Usage General demographics of Classification, [604]
None. 10,104 Text 1999 D. Cook
Dataset internet users. clustering
120 days of URL data Many features of each [605][606]

URL Dataset 2,396,130 Text Classification 2009 J. Ma
from a large conference. URL are given.
Phishing Websites Dataset of phishing Many features of each [607] R. Mustafa et

Dataset websites. site are given. al.
Online transactions for a Details of each Classification, [608]
Online Retail Dataset 541,909 Text 2015 D. Chen
UK online retailer. transaction given. clustering
Freebase is an online
Freebase Simple Topics from Freebase Classification, [609][610]
effort to structure all large Text 2011 Freebase
Topic Dump have been extracted. clustering
human knowledge.
The text of farm ads from

SVMlight sparse vectors
websites. Binary approval [611][612] C. Masterharm
Farm Ads Dataset of text words in ads 4143 Text Classification 2011
or disapproval by content et al.
calculated.
owners is given.
Natural
Various (removing HTML
Assembling several large Language
and Javascript from 825 GiB JSON [615][613]
The Pile datasets of diverse and Processing, 2021 Gao et al.
unstructured texts
websites, removing English text Lines[613][614] Text
duplicated sentences)
Prediction
An open-source
recreation of the WebText Natural
corpus. The text is web Extracted non-HTML 8,013,769 Language
[616][617] A. Gokaslan,
OpenWebText content extracted from content, deduplicated, Documents, Text Processing, 2019
V. Cohen
URLs shared on Reddit and tokenized. 38GB Text
with at least three Prediction
upvotes.
Games
Created
(updated)
Attributes of each hand are
5 card hands from a given, including the Poker Regression, [618]
Poker Hand Dataset 1,025,010 Text 2007 R. Cattral
standard 52 card deck. hands formed by the cards classification
it contains.
Contains all legal 8-ply

positions in the game of
connect-4 in which neither [619]
Connect-4 Dataset None. 67,557 Text Classification 1995 J. Tromp
player has won yet, and in
which the next move is not
forced.
Endgame Database for

Chess (King-Rook vs. [620][621]
White King and Rook None. 28,056 Text Classification 1994 M. Bain et al.
King) Dataset
against Black King.
Chess (King-Rook vs. King+Rook versus [622]
None. 3196 Text Classification 1989 R. Holte
King-Pawn) Dataset King+Pawn on a7.
Tic-Tac-Toe Endgame Binary classification for win [623]

None. 958 Text Classification 1991 D. Aha
Dataset conditions in tic-tac-toe.
Other multivariate
Created
(updated)
Median home values of

Boston with associated [624] D. Harrison et
Housing Data Set None. 506 Text Regression 1993
home and neighborhood al.
attributes.
structured terminology for

art and other material
The Getty [625]
culture, archival materials, None. large Text Classification 2015 Getty Center
Vocabularies
visual surrogates, and
bibliographic materials.
User click log for news
Yahoo! Front Page articles displayed in the
Conjoint analysis with a 45,811,883 Regression, [626][627]
Today Module User Featured Tab of the Today Text 2009 Chu et al.
bilinear model. user visits clustering
Click Log Module on Yahoo! Front
Page.
Biological, chemical, 22K
British
British Oceanographic physical and geophysical variables, Regression, [628]
Various. Text 2015 Oceanographic
Data Centre data for oceans. 22K many clustering
Data Centre
variables tracked. instances
Voting data for all USA Beyond the raw voting
Congressional Voting [629]
representatives on 16 data, various other 435 Text Classification 1987 J. Schlimmer
Records Dataset
issues. features are provided.
Entree Chicago Record of user interactions Details of each users

Regression, [630]
Recommendation with Entree Chicago usage of the app are 50,672 Text 2000 R. Burke
recommendation
Dataset recommendation system. recorded in detail.
Insurance Company Many features of each

Information on customers Regression, [631][632] P. van der
Benchmark (COIL customer and the services 9,000 Text 2000
of an insurance company. classification Putten
2000) they use.
Data about applicant's
Data from applicants to [633][634] V. Rajkovic et
Nursery Dataset family and various other 12,960 Text Classification 1997
nursery schools. al.
factors included.
Data describing attributed

Clustering, [635] S. Sounders et
University Dataset of a large number of None. 285 Text 1988
classification al.
universities.
Data from blood

Blood Transfusion
transfusion service center. [636][637]
Service Center None. 748 Text Classification 2008 I. Yeh
Gives data on donors
Dataset
return rate, frequency, etc.
Record Linkage Large dataset of records. Blocking procedure
[638][639] University of
Comparison Patterns Task is to link relevant applied to select only 5,749,132 Text Classification 2011
Mainz
Dataset records together. certain record pairs.
Nomao collects data about

places from many different
Nomao Dataset sources. Task is to detect Duplicates labeled. 34,465 Text Classification 2012 [640][641] Nomao Labs
items that describe the
same place.
Several features for each Clustering, [642]

Movie Dataset Data for 10,000 movies. 10,000 Text 1999 G. Wiederhold
movie are given. classification
Information about students
Open University Classification,
and their interactions with a [643][644]
Learning Analytics None. ~ 30,000 Text clustering, 2015 J. Kuzilek et al.
virtual learning
Dataset regression
environment.
Aggregation per Classification,

Telecommunications [645] G. Barlacchi et
Mobile phone records geographical grid cells and large Text Clustering, 2015
activity and interactions al.
every 15 minutes. Regression
Curated repositories of datasets

As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of
datasets to make them easier to use for machine learning research.
OpenML:[646] Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating
algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms.
PMLB:[647] A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification
and regression datasets in a standardized format that are accessible through a Python API.
Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and
counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic.
Appen: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question
answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.[648][649]
See also
Comparison of deep learning software
List of manual image annotation tools
List of biological databases
References
1. Wissner-Gross, A. "Datasets Over Algorithms" (https://edge.org/resp 4. Abney, Steven (17 September 2007). Semisupervised Learning for
onse-detail/26587). Edge.com. Retrieved 8 January 2016. Computational Linguistics (https://books.google.com/books?id=VC
2. Weiss, G. M.; Provost, F. (1 September 2003). "Learning When d67cGB_rAC&pg=PP1). CRC Press. ISBN 978-1-4200-1080-0.
Training Data are Costly: The Effect of Class Distribution on Tree 5. Žliobaitė, Indrė; Bifet, Albert; Pfahringer, Bernhard; Holmes, Geoff
Induction" (https://www.jair.org/index.php/jair/article/download/1034 (2011). "Active Learning with Evolving Streaming Data". Machine
6/24739). Journal of Artificial Intelligence Research. AI Access Learning and Knowledge Discovery in Databases. Berlin,
Foundation. 19: 315–354. doi:10.1613/jair.1199 (https://doi.org/10.1 Heidelberg: Springer Berlin Heidelberg. pp. 597–612.
613%2Fjair.1199). ISSN 1076-9757 (https://www.worldcat.org/issn/ doi:10.1007/978-3-642-23808-6_39 (https://doi.org/10.1007%2F978
1076-9757). S2CID 2344521 (https://api.semanticscholar.org/Corpu -3-642-23808-6_39). ISBN 978-3-642-23807-9. ISSN 0302-9743 (ht
sID:2344521). tps://www.worldcat.org/issn/0302-9743).
3. Turney, Peter (2000). "Types of cost in inductive concept learning".
arXiv:cs/0212034 (https://arxiv.org/abs/cs/0212034).
6. Zafeiriou, S.; Kollias, D.; Nicolaou, M.A.; Papaioannou, A.; Zhao, G.; 17. Nguyen, Duy; et al. (2006). "Real-time face detection and lip feature
Kotsia, I. (2017). "Aff-Wild: Valence and Arousal in-the-wild extraction using field-programmable gate arrays". IEEE
Challenge" (http://openaccess.thecvf.com/content_cvpr_2017_work Transactions on Systems, Man, and Cybernetics – Part B:
shops/w33/papers/Zafeiriou_Aff-Wild_Valence_and_CVPR_2017_ Cybernetics. 36 (4): 902–912. CiteSeerX 10.1.1.156.9848 (https://cit
paper.pdf) (PDF). Computer Vision and Pattern Recognition eseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.156.9848).
Workshops (CVPRW), 2017: 1980–1987. doi:10.1109/tsmcb.2005.862728 (https://doi.org/10.1109%2Ftsmcb.
doi:10.1109/CVPRW.2017.248 (https://doi.org/10.1109%2FCVPR 2005.862728). PMID 16903373 (https://pubmed.ncbi.nlm.nih.gov/16
W.2017.248). ISBN 978-1-5386-0733-6. S2CID 3107614 (https://ap 903373). S2CID 7334355 (https://api.semanticscholar.org/CorpusI
i.semanticscholar.org/CorpusID:3107614). D:7334355).
7. Kollias, D.; Tzirakis, P.; Nicolaou, M.A.; Papaioannou, A.; Zhao, G.; 18. Kanade, Takeo, Jeffrey F. Cohn, and Yingli Tian. "Comprehensive
Schuller, B.; Kotsia, I.; Zafeiriou, S. (2019). "Deep Affect Prediction database for facial expression analysis (http://www.ri.cmu.edu/pub_
in-the-wild: Aff-Wild Database and Challenge, Deep Architectures, files/pub2/kanade_takeo_2000_1/kanade_takeo_2000_1.pdf)."
and Beyond" (https://rdcu.be/bmGm2). International Journal of Automatic Face and Gesture Recognition, 2000. Proceedings.
Computer Vision. 127 (6–7): 907–929. doi:10.1007/s11263-019- Fourth IEEE International Conference on. IEEE, 2000.
01158-4 (https://doi.org/10.1007%2Fs11263-019-01158-4). 19. Zeng, Zhihong; et al. (2009). "A survey of affect recognition
S2CID 13679040 (https://api.semanticscholar.org/CorpusID:136790 methods: Audio, visual, and spontaneous expressions". IEEE
40). Transactions on Pattern Analysis and Machine Intelligence. 31 (1):
8. Kollias, D.; Zafeiriou, S. (2019). "Expression, affect, action unit 39–58. CiteSeerX 10.1.1.144.217 (https://citeseerx.ist.psu.edu/view
recognition: Aff-wild2, multi-task learning and arcface" (https://bmvc doc/summary?doi=10.1.1.144.217). doi:10.1109/tpami.2008.52 (http
2019.org/wp-content/uploads/papers/0399-paper.pdf) (PDF). British s://doi.org/10.1109%2Ftpami.2008.52). PMID 19029545 (https://pub
Machine Vision Conference (BMVC), 2019. arXiv:1910.04855 (http med.ncbi.nlm.nih.gov/19029545).
s://arxiv.org/abs/1910.04855). 20. Lyons, Michael; Kamachi, Miyuki; Gyoba, Jiro (1998). "Facial
9. Kollias, D.; Schulc, A.; Hajiyev, E.; Zafeiriou, S. (2020). "Analysing expression images". The Japanese Female Facial Expression
affective behavior in the first abaw 2020 competition" (https://www.c (JAFFE) Database. doi:10.5281/zenodo.3451524 (https://doi.org/1
omputer.org/csdl/proceedings-article/fg/2020/307900a794/1kecIYu9 0.5281%2Fzenodo.3451524).
wL6). IEEE International Conference on Automatic Face and 21. Lyons, Michael; Akamatsu, Shigeru; Kamachi, Miyuki; Gyoba, Jiro
Gesture Recognition (FG), 2020: 637–643. arXiv:2001.11409 (http "Coding facial expressions with Gabor wavelets (https://zenodo.org/
s://arxiv.org/abs/2001.11409). doi:10.1109/FG47880.2020.00126 (ht record/3430156)." Automatic Face and Gesture Recognition, 1998.
tps://doi.org/10.1109%2FFG47880.2020.00126). ISBN 978-1-7281- Proceedings. Third IEEE International Conference on. IEEE, 1998.
3079-8. S2CID 210966051 (https://api.semanticscholar.org/CorpusI 22. Ng, Hong-Wei, and Stefan Winkler. "A data-driven approach to
D:210966051).
cleaning large face datasets (http://vintage.winklerbros.net/Publicati
10. Phillips, P. Jonathon; et al. (1998). "The FERET database and ons/icip2014a.pdf)." Image Processing (ICIP), 2014 IEEE
evaluation procedure for face-recognition algorithms". Image and International Conference on. IEEE, 2014.
Vision Computing. 16 (5): 295–306. doi:10.1016/s0262- 23. RoyChowdhury, Aruni; Lin, Tsung-Yu; Maji, Subhransu; Learned-
8856(97)00070-x (https://doi.org/10.1016%2Fs0262-8856%2897% Miller, Erik (2015). "One-to-many face recognition with bilinear
2900070-x). CNNs". arXiv:1506.01342 (https://arxiv.org/abs/1506.01342) [cs.CV
11. Wiskott, Laurenz; et al. (1997). "Face recognition by elastic bunch (https://arxiv.org/archive/cs.CV)].
graph matching". IEEE Transactions on Pattern Analysis and 24. Jesorsky, Oliver, Klaus J. Kirchberg, and Robert W. Frischholz.
Machine Intelligence. 19 (7): 775–779. CiteSeerX 10.1.1.44.2321 (h
"Robust face detection using the hausdorff distance." Audio-and
ttps://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.44.2321). video-based biometric person authentication. Springer Berlin
doi:10.1109/34.598235 (https://doi.org/10.1109%2F34.598235). Heidelberg, 2001.
S2CID 30523165 (https://api.semanticscholar.org/CorpusID:305231
65). 25. Huang, Gary B., et al. Labeled faces in the wild: A database for
studying face recognition in unconstrained environments (https://ha
12. Livingstone, Steven R.; Russo, Frank A. (2018). "The Ryerson l.inria.fr/docs/00/32/19/23/PDF/Huang_long_eccv2008-lfw.pdf). Vol.
Audio-Visual Database of Emotional Speech and Song 1. No. 2. Technical Report 07-49, University of Massachusetts,
(RAVDESS): A dynamic, multimodal set of facial and vocal Amherst, 2007.
expressions in North American English" (https://www.ncbi.nlm.nih.g
ov/pmc/articles/PMC5955500). PLOS ONE. 13 (5): e0196391. 26. Bhatt, Rajen B., et al. "Efficient skin region segmentation using low
Bibcode:2018PLoSO..1396391L (https://ui.adsabs.harvard.edu/abs/ complexity fuzzy decision tree model (http://citeseerx.ist.psu.edu/vie
2018PLoSO..1396391L). doi:10.1371/journal.pone.0196391 (http wdoc/download?doi=10.1.1.708.9158&rep=rep1&type=pdf)." India
s://doi.org/10.1371%2Fjournal.pone.0196391). PMC 5955500 (http Conference (INDICON), 2009 Annual IEEE. IEEE, 2009.
s://www.ncbi.nlm.nih.gov/pmc/articles/PMC5955500). 27. Lingala, Mounika; et al. (2014). "Fuzzy logic color detection: Blue
PMID 29768426 (https://pubmed.ncbi.nlm.nih.gov/29768426). areas in melanoma dermoscopy images" (https://www.ncbi.nlm.nih.
13. Livingstone, Steven R.; Russo, Frank A. (2018). "Emotion". The gov/pmc/articles/PMC4287461). Computerized Medical Imaging
Ryerson Audio-Visual Database of Emotional Speech and Song and Graphics. 38 (5): 403–410.
(RAVDESS). doi:10.5281/zenodo.1188976 (https://doi.org/10.528 doi:10.1016/j.compmedimag.2014.03.007 (https://doi.org/10.1016%
1%2Fzenodo.1188976). 2Fj.compmedimag.2014.03.007). PMC 4287461 (https://www.ncbi.n
lm.nih.gov/pmc/articles/PMC4287461). PMID 24786720 (https://pub
14. Grgic, Mislav; Delac, Kresimir; Grgic, Sonja (2011). "SCface– med.ncbi.nlm.nih.gov/24786720).
surveillance cameras face database". Multimedia Tools and
Applications. 51 (3): 863–879. doi:10.1007/s11042-009-0417-2 (htt 28. Maes, Chris, et al. "Feature detection on 3D face surfaces for pose
ps://doi.org/10.1007%2Fs11042-009-0417-2). S2CID 207218990 (h normalisation and recognition (https://lirias.kuleuven.be/retrieve/135
ttps://api.semanticscholar.org/CorpusID:207218990). 678)." Biometrics: Theory Applications and Systems (BTAS), 2010
Fourth IEEE International Conference on. IEEE, 2010.
15. Wallace, Roy, et al. "Inter-session variability modelling and joint
factor analysis for face authentication (https://repository.ubn.ru.nl/bit 29. Savran, Arman, et al. "Bosphorus database for 3D face analysis (htt
stream/handle/2066/94489/94489.pdf)." Biometrics (IJCB), 2011 ps://web.archive.org/web/20190222192331/http://pdfs.semanticsch
International Joint Conference on. IEEE, 2011. olar.org/4254/fbba3846008f50671edc9cf70b99d7304543.pdf)."
Biometrics and Identity Management. Springer Berlin Heidelberg,
16. Georghiades, A. "Yale face database". Center For Computational 2008. 47–56.
Vision And Control At Yale University,
http://CVC.yale.edu/Projects/Yalefaces/Yalefa. 2: 1997. {{cite 30. Heseltine, Thomas, Nick Pears, and Jim Austin. "Three-
journal}}: External link in |journal= (help) dimensional face recognition: An eigensurface approach (http://epri
nts.whiterose.ac.uk/1526/01/austinj4.pdf)." Image Processing,
2004. ICIP'04. 2004 International Conference on. Vol. 2. IEEE,
2004.
31. Ge, Yun; et al. (2011). "3D Novel Face Sample Modeling for Face
Recognition". Journal of Multimedia. 6 (5): 467–475.
CiteSeerX 10.1.1.461.9710 (https://citeseerx.ist.psu.edu/viewdoc/su
mmary?doi=10.1.1.461.9710). doi:10.4304/jmm.6.5.467-475 (https://
doi.org/10.4304%2Fjmm.6.5.467-475).
32. Wang, Yueming; Liu, Jianzhuang; Tang, Xiaoou (2010). "Robust 3D 45. Patron-Perez, A.; Marszalek, M.; Reid, I.; Zisserman, A. (2012).
face recognition by local shape difference boosting". IEEE "Structured learning of human interactions in TV shows". IEEE
Transactions on Pattern Analysis and Machine Intelligence. 32 (10): Transactions on Pattern Analysis and Machine Intelligence. 34 (12):
1858–1870. CiteSeerX 10.1.1.471.2424 (https://citeseerx.ist.psu.ed 2441–2453. doi:10.1109/tpami.2012.24 (https://doi.org/10.1109%2F
u/viewdoc/summary?doi=10.1.1.471.2424). tpami.2012.24). PMID 23079467 (https://pubmed.ncbi.nlm.nih.gov/2
doi:10.1109/tpami.2009.200 (https://doi.org/10.1109%2Ftpami.200 3079467). S2CID 6060568 (https://api.semanticscholar.org/CorpusI
9.200). PMID 20724762 (https://pubmed.ncbi.nlm.nih.gov/2072476 D:6060568).
2). S2CID 15263913 (https://api.semanticscholar.org/CorpusID:152 46. Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., & Bajcsy, R. (January
63913). 2013). Berkeley MHAD: A comprehensive multimodal human action
33. Zhong, Cheng, Zhenan Sun, and Tieniu Tan. "Robust 3D face database (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.
recognition using learned visual codebook (http://citeseerx.ist.psu.e 1.432.5113&rep=rep1&type=pdf). In Applications of Computer
du/viewdoc/download?doi=10.1.1.580.8534&rep=rep1&type=pdf)." Vision (WACV), 2013 IEEE Workshop on (pp. 53–60). IEEE.
Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE 47. Jiang, Y. G., et al. "THUMOS challenge: Action recognition with a
Conference on. IEEE, 2007. large number of classes." ICCV Workshop on Action Recognition
34. Zhao, G.; Huang, X.; Taini, M.; Li, S. Z.; Pietikäinen, M. (2011). with a Large Number of Classes, http://crcv.ucf.edu/ICCV13-Action-
"Facial expression recognition from near-infrared videos" (http://ww Workshop. 2013.
w.academia.edu/download/42229488/Image_and_Vision_Computi 48. Simonyan, Karen, and Andrew Zisserman. "Two-stream
ng20160206-29020-1auzaon.pdf) (PDF). Image and Vision convolutional networks for action recognition in videos (https://pape
Computing. 29 (9): 607–619. doi:10.1016/j.imavis.2011.07.002 (http rs.nips.cc/paper/5353-two-stream-convolutional-networks-for-action
s://doi.org/10.1016%2Fj.imavis.2011.07.002). -recognition-in-videos.pdf)." Advances in Neural Information
35. Soyel, Hamit, and Hasan Demirel. "Facial expression recognition Processing Systems. 2014.
using 3D facial feature distances (https://pdfs.semanticscholar.org/cf 49. Stoian, Andrei; Ferecatu, Marin; Benois-Pineau, Jenny; Crucianu,
81/4b618fcbc9a556cdce225e74a8806867ba84.pdf)." Image Michel (2016). "Fast Action Localization in Large-Scale Video
Analysis and Recognition. Springer Berlin Heidelberg, 2007. 831– Archives". IEEE Transactions on Circuits and Systems for Video
838. Technology. 26 (10): 1917–1930.
36. Bowyer, Kevin W.; Chang, Kyong; Flynn, Patrick (2006). "A survey doi:10.1109/TCSVT.2015.2475835 (https://doi.org/10.1109%2FTC
of approaches and challenges in 3D and multi-modal 3D+ 2D face SVT.2015.2475835). S2CID 31537462 (https://api.semanticscholar.
recognition". Computer Vision and Image Understanding. 101 (1): org/CorpusID:31537462).
1–15. CiteSeerX 10.1.1.134.8784 (https://citeseerx.ist.psu.edu/view 50. Krishna, Ranjay; Zhu, Yuke; Groth, Oliver; Johnson, Justin; Hata,
doc/summary?doi=10.1.1.134.8784). Kenji; Kravitz, Joshua; Chen, Stephanie; Kalantidis, Yannis; Li, Li-
doi:10.1016/j.cviu.2005.05.005 (https://doi.org/10.1016%2Fj.cviu.20 Jia; Shamma, David A; Bernstein, Michael S; Fei-Fei, Li (2017).
05.05.005). "Visual Genome: Connecting Language and Vision Using
37. Tan, Xiaoyang; Triggs, Bill (2010). "Enhanced local texture feature Crowdsourced Dense Image Annotations". International Journal of
sets for face recognition under difficult lighting conditions". IEEE Computer Vision. 123: 32–73. arXiv:1602.07332 (https://arxiv.org/ab
Transactions on Image Processing. 19 (6): 1635–1650. s/1602.07332). doi:10.1007/s11263-016-0981-7 (https://doi.org/10.1
Bibcode:2010ITIP...19.1635T (https://ui.adsabs.harvard.edu/abs/20 007%2Fs11263-016-0981-7). S2CID 4492210 (https://api.semantic
10ITIP...19.1635T). CiteSeerX 10.1.1.105.3355 (https://citeseerx.ist. scholar.org/CorpusID:4492210).
psu.edu/viewdoc/summary?doi=10.1.1.105.3355). 51. Karayev, S., et al. "A category-level 3-D object dataset: putting the
doi:10.1109/tip.2010.2042645 (https://doi.org/10.1109%2Ftip.2010. Kinect to work (http://alliejanoch.com/iccvw2011.pdf)." Proceedings
2042645). PMID 20172829 (https://pubmed.ncbi.nlm.nih.gov/20172 of the IEEE International Conference on Computer Vision
829). S2CID 4943234 (https://api.semanticscholar.org/CorpusID:49 Workshops. 2011.
43234). 52. Tighe, Joseph, and Svetlana Lazebnik. "Superparsing: scalable
38. Mousavi, Mir Hashem, Karim Faez, and Amin Asghari. "Three nonparametric image parsing with superpixels (http://152.2.128.56/
dimensional face recognition using SVM classifier (https://ieeexplor ~jtighe/Papers/ECCV10/eccv10-jtighe.pdf) Archived (https://web.arc
e.ieee.org/abstract/document/4529822/)." Computer and hive.org/web/20190806022752/http://152.2.128.56/~jtighe/Papers/E
Information Science, 2008. ICIS 08. Seventh IEEE/ACIS CCV10/eccv10-jtighe.pdf) 6 August 2019 at the Wayback Machine."
International Conference on. IEEE, 2008. Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010.
39. Amberg, Brian, Reinhard Knothe, and Thomas Vetter. "Expression 352–365.
invariant 3D face recognition with a morphable model (https://gravi 53. Arbelaez, P.; Maire, M; Fowlkes, C; Malik, J (May 2011). "Contour
s.dmi.unibas.ch/publications/2008/FG08_Amberg.pdf)." Automatic Detection and Hierarchical Image Segmentation" (http://www.eecs.
Face & Gesture Recognition, 2008. FG'08. 8th IEEE International berkeley.edu/Research/Projects/CS/vision/grouping/papers/amfm_
Conference on. IEEE, 2008. pami2010.pdf) (PDF). IEEE Transactions on Pattern Analysis and
40. İrfanoğlu, M. O., Berk Gökberk, and Lale Akarun. "3D shape-based Machine Intelligence. 33 (5): 898–916. doi:10.1109/tpami.2010.161
face recognition using automatically registered facial surfaces (http (https://doi.org/10.1109%2Ftpami.2010.161). PMID 20733228 (http
s://www.researchgate.net/profile/Berk_Gokberk/publication/409070 s://pubmed.ncbi.nlm.nih.gov/20733228). S2CID 206764694 (https://
4_3D_Shape-based_face_recognition_using_automatically_regist api.semanticscholar.org/CorpusID:206764694). Retrieved
ered_facial_surfaces/links/0fcfd50ee9450e057a000000.pdf)." 27 February 2016.
Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th 54. Lin, Tsung-Yi; Maire, Michael; Belongie, Serge; Bourdev, Lubomir;
International Conference on. Vol. 4. IEEE, 2004. Girshick, Ross; Hays, James; Perona, Pietro; Ramanan, Deva;
41. Beumier, Charles; Acheroy, Marc (2001). "Face verification from 3D Lawrence Zitnick, C.; Dollár, Piotr (2014). "Microsoft COCO:
and grey level clues". Pattern Recognition Letters. 22 (12): 1321– Common Objects in Context". arXiv:1405.0312 (https://arxiv.org/ab
1329. Bibcode:2001PaReL..22.1321B (https://ui.adsabs.harvard.ed s/1405.0312) [cs.CV (https://arxiv.org/archive/cs.CV)].
u/abs/2001PaReL..22.1321B). doi:10.1016/s0167-8655(01)00077- 55. Russakovsky, Olga; et al. (2015). "Imagenet large scale visual
0 (https://doi.org/10.1016%2Fs0167-8655%2801%2900077-0). recognition challenge". International Journal of Computer Vision.
42. Afifi, Mahmoud; Abdelhamed, Abdelrahman (13 June 2017). 115 (3): 211–252. arXiv:1409.0575 (https://arxiv.org/abs/1409.057
"AFIF4: Deep Gender Classification based on AdaBoost-based 5). doi:10.1007/s11263-015-0816-y (https://doi.org/10.1007%2Fs11
Fusion of Isolated Facial Features and Foggy Faces". 263-015-0816-y). hdl:1721.1/104944 (https://hdl.handle.net/1721.
arXiv:1706.04277 (https://arxiv.org/abs/1706.04277) [cs.CV (https:// 1%2F104944). S2CID 2930547 (https://api.semanticscholar.org/Cor
arxiv.org/archive/cs.CV)]. pusID:2930547).
43. "SoF dataset" (https://sites.google.com/view/sof-dataset). 56. "COCO – Common Objects in Context" (https://cocodataset.org/).
sites.google.com. Retrieved 18 November 2017. cocodataset.org.
44. "IMDb-WIKI" (https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/). 57. Xiao, Jianxiong, et al. "Sun database: Large-scale scene
data.vision.ee.ethz.ch. Retrieved 13 March 2018. recognition from abbey to zoo." Computer vision and pattern
recognition (CVPR), 2010 IEEE conference on. IEEE, 2010.
58. Donahue, Jeff; Jia, Yangqing; Vinyals, Oriol; Hoffman, Judy; Zhang, 73. M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R.
Ning; Tzeng, Eric; Darrell, Trevor (2013). "DeCAF: A Deep Benenson, U. Franke, S. Roth, and B. Schiele, "The Cityscapes
Convolutional Activation Feature for Generic Visual Recognition". Dataset (https://www.cityscapes-dataset.com/wordpress/wp-conten
arXiv:1310.1531 (https://arxiv.org/abs/1310.1531) [cs.CV (https://arx t/papercite-data/pdf/cordts2015cvprw.pdf)." In CVPR Workshop on
iv.org/archive/cs.CV)]. The Future of Datasets in Vision, 2015.
59. Deng, Jia, et al. "Imagenet: A large-scale hierarchical image 74. Everingham, Mark; et al. (2010). "The pascal visual object classes
database (https://www.researchgate.net/profile/Li_Jia_Li/publicatio (voc) challenge" (https://www.research.ed.ac.uk/portal/en/publicatio
n/221361415_ImageNet_a_Large-Scale_Hierarchical_Image_Data ns/the-pascal-visual-object-classes-voc-challenge(88a29de3-6220-
base/links/00b495388120dbc339000000/ImageNet-a-Large-Scale- 442b-ab2d-284210cf72d6).html). International Journal of Computer
Hierarchical-Image-Database.pdf)."Computer Vision and Pattern Vision. 88 (2): 303–338. doi:10.1007/s11263-009-0275-4 (https://do
Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009. i.org/10.1007%2Fs11263-009-0275-4).
60. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet hdl:20.500.11820/88a29de3-6220-442b-ab2d-284210cf72d6 (http
classification with deep convolutional neural networks (http://paper s://hdl.handle.net/20.500.11820%2F88a29de3-6220-442b-ab2d-28
s.nips.cc/paper/4824-imagenet-classification-with-deep-convolution 4210cf72d6). S2CID 4246903 (https://api.semanticscholar.org/Corp
al-neural-networks.pdf)." Advances in neural information usID:4246903).
processing systems. 2012. 75. Felzenszwalb, Pedro F.; et al. (2010). "Object detection with
61. Russakovsky, Olga; Deng, Jia; Su, Hao; Krause, Jonathan; discriminatively trained part-based models". IEEE Transactions on
Satheesh, Sanjeev; et al. (11 April 2015). "ImageNet Large Scale Pattern Analysis and Machine Intelligence. 32 (9): 1627–1645.
Visual Recognition Challenge". International Journal of Computer CiteSeerX 10.1.1.153.2745 (https://citeseerx.ist.psu.edu/viewdoc/su
Vision. 115 (3): 211–252. arXiv:1409.0575 (https://arxiv.org/abs/140 mmary?doi=10.1.1.153.2745). doi:10.1109/tpami.2009.167 (https://d
9.0575). doi:10.1007/s11263-015-0816-y (https://doi.org/10.1007%2 oi.org/10.1109%2Ftpami.2009.167). PMID 20634557 (https://pubme
Fs11263-015-0816-y). hdl:1721.1/104944 (https://hdl.handle.net/17 d.ncbi.nlm.nih.gov/20634557). S2CID 3198903 (https://api.semantic
21.1%2F104944). S2CID 2930547 (https://api.semanticscholar.org/ scholar.org/CorpusID:3198903).
CorpusID:2930547). 76. Gong, Yunchao, and Svetlana Lazebnik. "Iterative quantization: A
62. Ivan Krasin, Tom Duerig, Neil Alldrin, Andreas Veit, Sami Abu-El- procrustean approach to learning binary codes." Computer Vision
Haija, Serge Belongie, David Cai, Zheyun Feng, Vittorio Ferrari, and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE,
Victor Gomes, Abhinav Gupta, Dhyanesh Narayanan, Chen Sun, 2011.
Gal Chechik, Kevin Murphy. "OpenImages: A public dataset for 77. "CINIC-10 dataset" (http://www.bayeswatch.com/2018/10/09/CINI
large-scale multi-label and multi-class image classification, 2017. C/). Luke N. Darlow, Elliot J. Crowley, Antreas Antoniou, Amos J.
Available from https://github.com/openimages." Storkey (2018) CINIC-10 is not ImageNet or CIFAR-10. 9 October
63. Vyas, Apoorv, et al. "Commercial Block Detection in Broadcast 2018. Retrieved 13 November 2018.
News Videos (https://dl.acm.org/citation.cfm?id=2683546)." 78. fashion-mnist: A MNIST-like fashion product database. Benchmark
Proceedings of the 2014 Indian Conference on Computer Vision :point_right (https://github.com/zalandoresearch/fashion-mnist),
Graphics and Image Processing. ACM, 2014. Zalando Research, 7 October 2017, retrieved 7 October 2017
64. Hauptmann, Alexander G., and Michael J. Witbrock. "Story 79. "notMNIST dataset" (http://yaroslavvb.blogspot.com/2011/09/notmni
segmentation and detection of commercials in broadcast news st-dataset.html). Machine Learning, etc. 8 September 2011.
video (https://pdfs.semanticscholar.org/5c21/6db7892fa3f515d816f8 Retrieved 13 October 2017.
4893bfab1137f0b2.pdf)." Research and Technology Advances in 80. Houben, Sebastian, et al. "Detection of traffic signs in real-world
Digital Libraries, 1998. ADL 98. Proceedings. IEEE International images: The German Traffic Sign Detection Benchmark (https://ww
Forum on. IEEE, 1998. w.researchgate.net/profile/Sebastian_Houben/publication/2423466
65. Tung, Anthony KH, Xin Xu, and Beng Chin Ooi. "Curler: finding and 25_Detection_of_Traffic_Signs_in_Real-World_Images_The_Germ
visualizing nonlinear correlation clusters (https://www.researchgate. an_Traffic_Sign_Detection_Benchmark/links/0046352a03ec384e9
net/profile/Anthony_Tung/publication/221214229_CURLER_Findin 7000000/Detection-of-Traffic-Signs-in-Real-World-Images-The-Ger
g_and_Visualizing_Nonlinear_Correlated_Clusters/links/55b8691a man-Traffic-Sign-Detection-Benchmark.pdf)." Neural Networks
08aed621de05cd92.pdf)." Proceedings of the 2005 ACM SIGMOD (IJCNN), The 2013 International Joint Conference on. IEEE, 2013.
international conference on Management of data. ACM, 2005. 81. Mathias, Mayeul, et al. "Traffic sign recognition—How far are we
66. Jarrett, Kevin, et al. "What is the best multi-stage architecture for from the solution? (http://www.varcity.eu/paper/ijcnn2013_mathias_t
object recognition? (https://ieeexplore.ieee.org/abstract/document/5 rafficsign.pdf)." Neural Networks (IJCNN), The 2013 International
459469/)." Computer Vision, 2009 IEEE 12th International Joint Conference on. IEEE, 2013.
Conference on. IEEE, 2009. 82. Geiger, Andreas, Philip Lenz, and Raquel Urtasun. "Are we ready
67. Lazebnik, Svetlana, Cordelia Schmid, and Jean Ponce. "Beyond for autonomous driving? the kitti vision benchmark suite (https://ww
bags of features: Spatial pyramid matching for recognizing natural w.cvlibs.net/publications/Geiger2012CVPR.pdf)." Computer Vision
scene categories (https://hal.inria.fr/inria-00548585/documen and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE,
t)."Computer Vision and Pattern Recognition, 2006 IEEE Computer 2012.
Society Conference on. Vol. 2. IEEE, 2006. 83. Sturm, Jürgen, et al. "A benchmark for the evaluation of RGB-D
68. Griffin, G., A. Holub, and P. Perona. Caltech-256 object category SLAM systems (http://jsturm.de/publications/data/sturm12iros.pdf)."
dataset California Inst. Technol., Tech. Rep. 7694, 2007. Available: Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ
http://authors.library.caltech.edu/7694, 2007. International Conference on. IEEE, 2012.
69. Baeza-Yates, Ricardo, and Berthier Ribeiro-Neto. Modern 84. The KITTI Vision Benchmark Suite (https://www.youtube.com/watc
🐺
information retrieval. Vol. 463. New York: ACM press, 1999. h?v=KXpZ6B1YB_k) on YouTube
70. COYO-700M: Image-Text Pair Dataset (https://github.com/kakao 85. Chaladze, G., Kalatozishvili, L. (2017). Linnaeus 5
brain/coyo-dataset), Kakao Brain, 3 November 2022, retrieved dataset. Chaladze.com. Retrieved 13 November 2017, from
3 November 2022 http://chaladze.com/l5/
71. Fu, Xiping, et al. "NOKMeans: Non-Orthogonal K-means Hashing 86. Kragh, Mikkel F.; et al. (2017). "FieldSAFE – Dataset for Obstacle
(https://pdfs.semanticscholar.org/9da2/abae3072fd9fcff0e13b8f00fc Detection in Agriculture" (https://vision.eng.au.dk/fieldsafe).
21f22d0085.pdf)." Computer Vision—ACCV 2014. Springer Sensors. 17 (11): 2579. arXiv:1709.03526 (https://arxiv.org/abs/170
International Publishing, 2014. 162–177. 9.03526). Bibcode:2017Senso..17.2579K (https://ui.adsabs.harvar
72. Heitz, Geremy; et al. (2009). "Shape-based object localization for d.edu/abs/2017Senso..17.2579K). doi:10.3390/s17112579 (https://d
descriptive classification". International Journal of Computer Vision. oi.org/10.3390%2Fs17112579). PMC 5713196 (https://www.ncbi.nl
84 (1): 40–62. CiteSeerX 10.1.1.142.280 (https://citeseerx.ist.psu.ed m.nih.gov/pmc/articles/PMC5713196). PMID 29120383 (https://pub
u/viewdoc/summary?doi=10.1.1.142.280). doi:10.1007/s11263-009- med.ncbi.nlm.nih.gov/29120383).
0228-y (https://doi.org/10.1007%2Fs11263-009-0228-y). 87. Afifi, Mahmoud (12 November 2017). "Gender recognition and
S2CID 646320 (https://api.semanticscholar.org/CorpusID:646320). biometric identification using a large dataset of hand images".
arXiv:1711.04322 (https://arxiv.org/abs/1711.04322) [cs.CV (https://
arxiv.org/archive/cs.CV)].
88. Lomonaco, Vincenzo; Maltoni, Davide (18 October 2017). 103. Behrendt, Karsten; Novak, Libor; Botros, Rami (May 2017). "A deep
"CORe50: a New Dataset and Benchmark for Continuous Object learning approach to traffic lights: Detection, tracking, and
Recognition". arXiv:1705.03550 (https://arxiv.org/abs/1705.03550) classification" (https://ieeexplore.ieee.org/document/7989163).
[cs.CV (https://arxiv.org/archive/cs.CV)]. 2017 IEEE International Conference on Robotics and Automation
89. She, Qi; Feng, Fan; Hao, Xinyue; Yang, Qihan; Lan, Chuanlin; (ICRA): 1370–1377. doi:10.1109/ICRA.2017.7989163 (https://doi.or
Lomonaco, Vincenzo; Shi, Xuesong; Wang, Zhengwei; Guo, Yao; g/10.1109%2FICRA.2017.7989163). ISBN 978-1-5090-4633-1.
Zhang, Yimin; Qiao, Fei; Chan, Rosa H.M. (15 November 2019). S2CID 6257133 (https://api.semanticscholar.org/CorpusID:625713
"OpenLORIS-Object: A Robotic Vision Dataset and Benchmark for 3).
Lifelong Deep Learning". arXiv:1911.06487v2 (https://arxiv.org/abs/ 104. "FRSign Dataset" (https://frsign.irt-systemx.fr/). frsign.irt-systemx.fr.
1911.06487v2) [cs.CV (https://arxiv.org/archive/cs.CV)]. Retrieved 5 May 2023.
90. Morozov, Alexei; Sushkova, Olga (13 June 2019). "THz and thermal 105. Harb, Jeanine; Rébéna, Nicolas; Chosidow, Raphaël; Roblin,
video data set" (http://www.fullvision.ru/monitoring/description_eng. Grégoire; Potarusov, Roman; Hajri, Hatem (5 February 2020).
php). Development of the multi-agent logic programming approach "FRSign: A Large-Scale Traffic Light Dataset for Autonomous
to a human behaviour analysis in a multi-channel video Trains". arXiv:2002.05665 (https://arxiv.org/abs/2002.05665) [cs.CY
surveillance. Moscow: IRE RAS. Retrieved 19 July 2019. (https://arxiv.org/archive/cs.CY)].
91. Morozov, Alexei; Sushkova, Olga; Kershner, Ivan; Polupanov, 106. "ifs-rwth-aachen/GERALD" (https://github.com/ifs-rwth-aachen/GER
Alexander (9 July 2019). "Development of a method of terahertz ALD). Chair and Institute for Rail Vehicles and Transport Systems.
intelligent video surveillance based on the semantic fusion of 30 April 2023. Retrieved 5 May 2023.
terahertz and 3D video images" (http://ceur-ws.org/Vol-2391/paper1 107. Leibner, Philipp; Hampel, Fabian; Schindler, Christian (3 April
9.pdf) (PDF). CEUR. 2391: paper19. Retrieved 19 July 2019. 2023). "GERALD: A novel dataset for the detection of German
92. "Papers with Code - Daimler Monocular Pedestrian Detection mainline railway signals" (https://journals.sagepub.com/doi/abs/10.
Dataset" (https://paperswithcode.com/dataset/daimler-monocular-p 1177/09544097231166472). Proceedings of the Institution of
edestrian-detection). paperswithcode.com. Retrieved 5 May 2023. Mechanical Engineers, Part F: Journal of Rail and Rapid Transit:
93. Enzweiler, Markus; Gavrila, Dariu M. (December 2009). "Monocular 095440972311664. doi:10.1177/09544097231166472 (https://doi.o
Pedestrian Detection: Survey and Experiments" (https://ieeexplore.i rg/10.1177%2F09544097231166472). ISSN 0954-4097 (https://ww
eee.org/document/4657363). IEEE Transactions on Pattern w.worldcat.org/issn/0954-4097). S2CID 257939937 (https://api.sem
Analysis and Machine Intelligence. 31 (12): 2179–2195. anticscholar.org/CorpusID:257939937).
doi:10.1109/TPAMI.2008.260 (https://doi.org/10.1109%2FTPAMI.20 108. Wojek, Christian; Walk, Stefan; Schiele, Bernt (June 2009). "Multi-
08.260). ISSN 1939-3539 (https://www.worldcat.org/issn/1939-353 cue onboard pedestrian detection" (https://ieeexplore.ieee.org/docu
9). PMID 19834140 (https://pubmed.ncbi.nlm.nih.gov/19834140). ment/5206638). 2009 IEEE Conference on Computer Vision and
S2CID 1192198 (https://api.semanticscholar.org/CorpusID:119219 Pattern Recognition: 794–801. doi:10.1109/CVPR.2009.5206638
8). (https://doi.org/10.1109%2FCVPR.2009.5206638). ISBN 978-1-
94. Yin, Guojun; Liu, Bin; Zhu, Huihui; Gong, Tao; Yu, Nenghai (28 July 4244-3992-8. S2CID 18000078 (https://api.semanticscholar.org/Cor
2020). "A Large Scale Urban Surveillance Video Dataset for pusID:18000078).
Multiple-Object Tracking and Behavior Analysis". arXiv:1904.11784 109. Toprak, Tuğçe; Aydın, Burak; Belenlioğlu, Burak; Güzeliş, Cüneyt;
(https://arxiv.org/abs/1904.11784) [cs.CV (https://arxiv.org/archive/c Selver, M. Alper (5 April 2020). "Railway Pedestrian Dataset
s.CV)]. (RAWPED)" (https://zenodo.org/record/3741742).
95. "Object Recognition in Video Dataset" (https://mi.eng.cam.ac.uk/res doi:10.1109/TVT.2020.2983825 (https://doi.org/10.1109%2FTVT.20
earch/projects/VideoRec/CamVid/). mi.eng.cam.ac.uk. Retrieved 20.2983825). S2CID 216510283 (https://api.semanticscholar.org/C
5 May 2023. orpusID:216510283). Retrieved 5 May 2023.
96. Brostow, Gabriel J.; Shotton, Jamie; Fauqueur, Julien; Cipolla, 110. Toprak, Tugce; Belenlioglu, Burak; Aydın, Burak; Guzelis, Cuneyt;
Roberto (2008). "Segmentation and Recognition Using Structure Selver, M. Alper (May 2020). "Conditional Weighted Ensemble of
from Motion Point Clouds" (https://link.springer.com/chapter/10.100 Transferred Models for Camera Based Onboard Pedestrian
7/978-3-540-88682-2_5). Computer Vision – ECCV 2008. Lecture Detection in Railway Driver Support Systems" (https://ieeexplore.ie
Notes in Computer Science. Springer. 5302: 44–57. ee.org/document/9050835). IEEE Transactions on Vehicular
doi:10.1007/978-3-540-88682-2_5 (https://doi.org/10.1007%2F978- Technology. 69 (5): 5041–5054. doi:10.1109/TVT.2020.2983825 (ht
3-540-88682-2_5). ISBN 978-3-540-88681-5. tps://doi.org/10.1109%2FTVT.2020.2983825). ISSN 1939-9359 (htt
97. Brostow, Gabriel J.; Fauqueur, Julien; Cipolla, Roberto (15 January ps://www.worldcat.org/issn/1939-9359). S2CID 216510283 (https://
2009). "Semantic object classes in video: A high-definition ground api.semanticscholar.org/CorpusID:216510283).
truth database" (https://www.sciencedirect.com/science/article/abs/p 111. Tilly, Roman; Neumaier, Philipp; Schwalbe, Karsten; Klasek, Pavel;
ii/S0167865508001220). Pattern Recognition Letters. 30 (2): 88– Tagiew, Rustam; Denzler, Patrick; Klockau, Tobias; Boekhoff,
97. Bibcode:2009PaReL..30...88B (https://ui.adsabs.harvard.edu/ab Martin; Köppel, Martin (2023). "Open Sensor Data for Rail 2023" (in
s/2009PaReL..30...88B). doi:10.1016/j.patrec.2008.04.005 (https://d German). doi:10.57806/9mv146r0 (https://doi.org/10.57806%2F9mv
oi.org/10.1016%2Fj.patrec.2008.04.005). ISSN 0167-8655 (https:// 146r0).
www.worldcat.org/issn/0167-8655). 112. Tagiew, Rustam; Köppel, Martin; Schwalbe, Karsten; Denzler,
98. "WildDash 2 Benchmark" (https://wilddash.cc/railsem19). Patrick; Neumaier, Philipp; Klockau, Tobias; Boekhoff, Martin;
wilddash.cc. Retrieved 5 May 2023. Klasek, Pavel; Tilly, Roman (4 May 2023). "OSDaR23: Open
99. Zendel, Oliver; Murschitz, Markus; Zeilinger, Marcel; Steininger, Sensor Data for Rail 2023". arXiv:2305.03001 (https://arxiv.org/abs/
Daniel; Abbasi, Sara; Beleznai, Csaba (June 2019). "RailSem19: A 2305.03001) [cs.CV (https://arxiv.org/archive/cs.CV)].
Dataset for Semantic Rail Scene Understanding" (https://ieeexplor 113. "Home" (https://www.argoverse.org/). Argoverse. Retrieved 5 May
e.ieee.org/document/9025646). 2019 IEEE/CVF Conference on 2023.
Computer Vision and Pattern Recognition Workshops (CVPRW): 114. Chang, Ming-Fang; Lambert, John; Sangkloy, Patsorn; Singh,
1221–1229. doi:10.1109/CVPRW.2019.00161 (https://doi.org/10.11 Jagjeet; Bak, Slawomir; Hartnett, Andrew; Wang, De; Carr, Peter;
09%2FCVPRW.2019.00161). ISBN 978-1-7281-2506-0. Lucey, Simon; Ramanan, Deva; Hays, James (6 November 2019).
S2CID 198166233 (https://api.semanticscholar.org/CorpusID:19816 "Argoverse: 3D Tracking and Forecasting with Rich Maps".
6233). arXiv:1911.02620 (https://arxiv.org/abs/1911.02620) [cs.CV (https://
100. "The Boreas Dataset" (https://www.boreas.utias.utoronto.ca/#/). arxiv.org/archive/cs.CV)].
www.boreas.utias.utoronto.ca. Retrieved 5 May 2023. 115. Botta, M., A. Giordana, and L. Saitta. "Learning fuzzy concept
101. Burnett, Keenan; Yoon, David J.; Wu, Yuchen; Li, Andrew Zou; definitions (https://pdfs.semanticscholar.org/9f0e/1349d1422f1b455
Zhang, Haowei; Lu, Shichen; Qian, Jingxing; Tseng, Wei-Kang; b8ccc26ebf7b114b8db20.pdf)." Fuzzy Systems, 1993., Second
Lambert, Andrew; Leung, Keith Y. K.; Schoellig, Angela P.; Barfoot, IEEE International Conference on. IEEE, 1993.
Timothy D. (26 January 2023). "Boreas: A Multi-Season 116. Frey, Peter W.; Slate, David J. (1991). "Letter recognition using
Autonomous Driving Dataset". arXiv:2203.10168 (https://arxiv.org/a Holland-style adaptive classifiers" (https://doi.org/10.1007%2Fbf00
bs/2203.10168) [cs.RO (https://arxiv.org/archive/cs.RO)]. 114162). Machine Learning. 6 (2): 161–182.
102. "Bosch Small Traffic Lights Dataset" (https://hci.iwr.uni-heidelberg.d doi:10.1007/bf00114162 (https://doi.org/10.1007%2Fbf00114162).
e/content/bosch-small-traffic-lights-dataset). hci.iwr.uni-
heidelberg.de. 1 March 2017. Retrieved 5 May 2023.
117. Peltonen, Jaakko; Klami, Arto; Kaski, Samuel (2004). "Improved 132. Kussul, Ernst; Baidyk, Tatiana (2004). "Improved method of
learning of Riemannian metrics for exploratory analysis". Neural handwritten digit recognition tested on MNIST database". Image
Networks. 17 (8): 1087–1100. CiteSeerX 10.1.1.59.4865 (https://cite and Vision Computing. 22 (12): 971–981.
seerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.59.4865). doi:10.1016/j.imavis.2004.03.008 (https://doi.org/10.1016%2Fj.imav
doi:10.1016/j.neunet.2004.06.008 (https://doi.org/10.1016%2Fj.neu is.2004.03.008).
net.2004.06.008). PMID 15555853 (https://pubmed.ncbi.nlm.nih.go 133. Xu, Lei; Krzyżak, Adam; Suen, Ching Y. (1992). "Methods of
v/15555853). combining multiple classifiers and their applications to handwriting
118. Liu, Cheng-Lin; Yin, Fei; Wang, Da-Han; Wang, Qiu-Feng (January recognition". IEEE Transactions on Systems, Man and Cybernetics.
2013). "Online and offline handwritten Chinese character 22 (3): 418–435. doi:10.1109/21.155943 (https://doi.org/10.1109%2
recognition: Benchmarking on new databases". Pattern F21.155943). hdl:10338.dmlcz/135217 (https://hdl.handle.net/1033
Recognition. 46 (1): 155–162. Bibcode:2013PatRe..46..155L (http 8.dmlcz%2F135217).
s://ui.adsabs.harvard.edu/abs/2013PatRe..46..155L). 134. Alimoglu, Fevzi, et al. "Combining multiple classifiers for pen-based
doi:10.1016/j.patcog.2012.06.021 (https://doi.org/10.1016%2Fj.patc handwritten digit recognition (http://citeseerx.ist.psu.edu/viewdoc/su
og.2012.06.021). mmary?doi=10.1.1.25.6299)." (1996).
119. Wang, D.; Liu, C.; Yu, J.; Zhou, X. (2009). "CASIA-OLHWDB1: A 135. Tang, E. Ke; et al. (2005). "Linear dimensionality reduction using
Database of Online Handwritten Chinese Characters". 2009 10th relevance weighted LDA". Pattern Recognition. 38 (4): 485–493.
International Conference on Document Analysis and Recognition: Bibcode:2005PatRe..38..485T (https://ui.adsabs.harvard.edu/abs/2
1206–1210. doi:10.1109/ICDAR.2009.163 (https://doi.org/10.110 005PatRe..38..485T). doi:10.1016/j.patcog.2004.09.005 (https://doi.
9%2FICDAR.2009.163). ISBN 978-1-4244-4500-4. org/10.1016%2Fj.patcog.2004.09.005). S2CID 10580110 (https://ap
S2CID 5705532 (https://api.semanticscholar.org/CorpusID:570553 i.semanticscholar.org/CorpusID:10580110).
2). 136. Hong, Yi, et al. "Learning a mixture of sparse distance metrics for
120. Williams, Ben H., Marc Toussaint, and Amos J. Storkey. Extracting classification and dimensionality reduction (https://pages.ucsd.edu/
motion primitives from natural handwriting data (https://www.era.lib. ~ztu/publication/iccv11_sparsemetric.pdf)." Computer Vision
ed.ac.uk/bitstream/handle/1842/3221/BH%20Williams%20PhD%20 (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
thesis%2009.pdf?sequence=1). Springer Berlin Heidelberg, 2006. 137. Thoma, Martin (2017). "The HASYv2 dataset". arXiv:1701.08380 (ht
121. Meier, Franziska, et al. "Movement segmentation using a primitive tps://arxiv.org/abs/1701.08380) [cs.CV (https://arxiv.org/archive/cs.C
library (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.3 V)].
95.8598&rep=rep1&type=pdf)."Intelligent Robots and Systems 138. Karki, Manohar; Liu, Qun; DiBiano, Robert; Basu, Saikat;
(IROS), 2011 IEEE/RSJ International Conference on. IEEE, 2011. Mukhopadhyay, Supratik (20 June 2018). "Pixel-level
122. T. E. de Campos, B. R. Babu and M. Varma. Character recognition Reconstruction and Classification for Noisy Handwritten Bangla
in natural images (http://personal.ee.surrey.ac.uk/Personal/T.Decam Characters". arXiv:1806.08037 (https://arxiv.org/abs/1806.08037)
pos/papers/decampos_etal_visapp2009.pdf). In Proceedings of the [cs.CV (https://arxiv.org/archive/cs.CV)].
International Conference on Computer Vision Theory and 139. Liu, Qun; Collier, Edward; Mukhopadhyay, Supratik (2019),
Applications (VISAPP), Lisbon, Portugal, February 2009
"PCGAN-CHAR: Progressively Trained Classifier Generative
123. Cohen, Gregory; Afshar, Saeed; Tapson, Jonathan; André van Adversarial Networks for Classification of Noisy Handwritten
Schaik (2017). "EMNIST: An extension of MNIST to handwritten Bangla Characters", Digital Libraries at the Crossroads of Digital
letters". arXiv:1702.05373v1 (https://arxiv.org/abs/1702.05373v1) Information for the Future, Springer International Publishing, pp. 3–
[cs.CV (https://arxiv.org/archive/cs.CV)]. 15, arXiv:1908.08987 (https://arxiv.org/abs/1908.08987),
124. "The EMNIST Dataset" (https://www.nist.gov/itl/products-and-servic doi:10.1007/978-3-030-34058-2_1 (https://doi.org/10.1007%2F978-
es/emnist-dataset). NIST. 4 April 2017. 3-030-34058-2_1), ISBN 978-3-030-34057-5, S2CID 201665955 (ht
125. Cohen, Gregory; Afshar, Saeed; Tapson, Jonathan; André van tps://api.semanticscholar.org/CorpusID:201665955)
Schaik (2017). "EMNIST: An extension of MNIST to handwritten 140. "iSAID" (https://captain-whu.github.io/iSAID/index.html). captain-
letters". arXiv:1702.05373 (https://arxiv.org/abs/1702.05373) [cs.CV whu.github.io. Retrieved 30 November 2021.
(https://arxiv.org/archive/cs.CV)]. 141. Zamir, Syed & Arora, Aditya & Gupta, Akshita & Khan, Salman &
126. Llorens, David, et al. "The UJIpenchars Database: a Pen-Based Sun, Guolei & Khan, Fahad & Zhu, Fan & Shao, Ling & Xia, Gui-
Database of Isolated Handwritten Characters (https://web.archive.or Song & Bai, Xiang. (2019). iSAID: A Large-scale Dataset for
g/web/20190806015012/https://pdfs.semanticscholar.org/24cf/ef150 Instance Segmentation in Aerial Images. website (https://captain-wh
94c59322560377bbf8e4185245c654f.pdf)." LREC. 2008. u.github.io/iSAID/index.html)
127. Calderara, Simone; Prati, Andrea; Cucchiara, Rita (2011). "Mixtures 142. Yuan, Jiangye; Gleason, Shaun S.; Cheriyadat, Anil M. (2013).
of von mises distributions for people trajectory shape analysis". "Systematic benchmarking of aerial image segmentation". IEEE
IEEE Transactions on Circuits and Systems for Video Technology. Geoscience and Remote Sensing Letters. 10 (6): 1527–1531.
21 (4): 457–471. doi:10.1109/tcsvt.2011.2125550 (https://doi.org/10. Bibcode:2013IGRSL..10.1527Y (https://ui.adsabs.harvard.edu/abs/
1109%2Ftcsvt.2011.2125550). S2CID 1427766 (https://api.semanti 2013IGRSL..10.1527Y). doi:10.1109/lgrs.2013.2261453 (https://doi.
cscholar.org/CorpusID:1427766). org/10.1109%2Flgrs.2013.2261453). S2CID 629629 (https://api.se
128. Guyon, Isabelle, et al. "Result analysis of the nips 2003 feature manticscholar.org/CorpusID:629629).
selection challenge (http://papers.nips.cc/paper/2728-result-analysi 143. Vatsavai, Ranga Raju. "Object based image classification: state of
s-of-the-nips-2003-feature-selection-challenge.pdf)." Advances in the art and computational challenges (https://dl.acm.org/citation.cf
neural information processing systems. 2004. m?id=2534927)." Proceedings of the 2nd ACM SIGSPATIAL
129. Lake, B. M.; Salakhutdinov, R.; Tenenbaum, J. B. (11 December International Workshop on Analytics for Big Geospatial Data. ACM,
2015). "Human-level concept learning through probabilistic 2013.
program induction" (https://doi.org/10.1126%2Fscience.aab3050). 144. Butenuth, Matthias, et al. "Integrating pedestrian simulation,
Science. 350 (6266): 1332–1338. Bibcode:2015Sci...350.1332L (htt tracking and event detection for crowd analysis (http://www.hartman
ps://ui.adsabs.harvard.edu/abs/2015Sci...350.1332L). n-alberts.de/dirk/pub/proceedings2011e.pdf)." Computer Vision
doi:10.1126/science.aab3050 (https://doi.org/10.1126%2Fscience.a Workshops (ICCV Workshops), 2011 IEEE International
ab3050). ISSN 0036-8075 (https://www.worldcat.org/issn/0036-807 Conference on. IEEE, 2011.
5). PMID 26659050 (https://pubmed.ncbi.nlm.nih.gov/26659050). 145. Fradi, Hajer, and Jean-Luc Dugelay. "Low level crowd analysis
130. Lake, Brenden (9 November 2019), Omniglot data set for one-shot using frame-wise normalized feature for people counting (http://ww
learning (https://github.com/brendenlake/omniglot), retrieved w.eurecom.fr/fr/publication/3841/download/mm-publi-3841.pdf)."
10 November 2019 Information Forensics and Security (WIFS), 2012 IEEE International
131. LeCun, Yann; et al. (1998). "Gradient-based learning applied to Workshop on. IEEE, 2012.
document recognition". Proceedings of the IEEE. 86 (11): 2278– 146. Johnson, Brian Alan, Ryutaro Tateishi, and Nguyen Thanh Hoan.
2324. CiteSeerX 10.1.1.32.9552 (https://citeseerx.ist.psu.edu/viewd "A hybrid pansharpening approach and multiscale object-based
oc/summary?doi=10.1.1.32.9552). doi:10.1109/5.726791 (https://do image analysis for mapping diseased pine and oak trees (http://cite
i.org/10.1109%2F5.726791). S2CID 14542261 (https://api.semantic seerx.ist.psu.edu/viewdoc/download?doi=10.1.1.826.9200&rep=rep
scholar.org/CorpusID:14542261). 1&type=pdf)." International journal of remote sensing34.20 (2013):
6969–6982.
147. Mohd Pozi, Muhammad Syafiq; Sulaiman, Md Nasir; Mustapha, 161. Waszak et al. "Semantic Segmentation in Underwater Ship
Norwati; Perumal, Thinagaran (2015). "A new classification model Inspections: Benchmark and Data Set (https://ieeexplore.ieee.org/d
for a class imbalanced data set using genetic programming and ocument/9998080)." IEEE Journal of Oceanic Engineering. IEEE,
support vector machines: Case study for wilt disease classification" 2022.
(https://www.tandfonline.com/doi/abs/10.1080/2150704X.2015.106 162. Ebadi, Ashkan; Paul, Patrick; Auer, Sofia; Tremblay, Stéphane (12
2159). Remote Sensing Letters. 6 (7): 568–577. November 2021). "NRC-GAMMA: Introducing a Novel Large Gas
doi:10.1080/2150704X.2015.1062159 (https://doi.org/10.1080%2F2 Meter Image Dataset". arXiv:2111.06827 (https://arxiv.org/abs/2111.
150704X.2015.1062159). S2CID 58788630 (https://api.semanticsc 06827) [cs.CV (https://arxiv.org/archive/cs.CV)].
holar.org/CorpusID:58788630).
163. Canada, Government of Canada National Research Council
148. Gallego, A.-J.; Pertusa, A.; Gil, P. "Automatic Ship Classification (2021). "The gas meter image dataset (NRC-GAMMA) - NRC
from Optical Aerial Images with Convolutional Neural Networks (htt Digital Repository" (https://nrc-digital-repository.canada.ca/eng/vie
ps://www.mdpi.com/2072-4292/10/4/511)." Remote Sensing. 2018; w/object/?id=ba1fc493-e65f-4c0a-ab31-ecbcdf00bfa4). nrc-digital-
10(4):511. repository.canada.ca. doi:10.4224/3c8s-z290 (https://doi.org/10.422
149. Gallego, A.-J.; Pertusa, A.; Gil, P. "MAritime SATellite Imagery 4%2F3c8s-z290). Retrieved 2 December 2021.
dataset". Available: https://www.iuii.ua.es/datasets/masati/, 2018. 164. Rabah, Chaima Ben; Coatrieux, Gouenou; Abdelfattah, Riadh
150. Johnson, Brian; Tateishi, Ryutaro; Xie, Zhixiao (2012). "Using (October 2020). "The Supatlantique Scanned Documents Database
geographically weighted variables for image classification". for Digital Image Forensics Purposes" (https://dx.doi.org/10.1109/ici
Remote Sensing Letters. 3 (6): 491–499. p40778.2020.9190665). 2020 IEEE International Conference on
doi:10.1080/01431161.2011.629637 (https://doi.org/10.1080%2F01 Image Processing (ICIP). IEEE: 2096–2100.
431161.2011.629637). S2CID 122543681 (https://api.semanticscho doi:10.1109/icip40778.2020.9190665 (https://doi.org/10.1109%2Fic
lar.org/CorpusID:122543681). ip40778.2020.9190665). ISBN 978-1-7281-6395-6.
151. Chatterjee, Sankhadeep, et al. "Forest Type Classification: A Hybrid S2CID 224881147 (https://api.semanticscholar.org/CorpusID:22488
NN-GA Model Based Approach (https://www.researchgate.net/profil 1147).
e/Sankhadeep_Chatterjee/publication/282605325_Forest_Type_Cl 165. Mills, Kyle; Tamblyn, Isaac (16 May 2018), Big graphene dataset,
assification_A_Hybrid_NN-GA_Model_Based_Approach/links/574 National Research Council of Canada,
93cb308ae5c51e29e6f1b/Forest-Type-Classification-A-Hybrid-NN- doi:10.4224/c8sc04578j.data (https://doi.org/10.4224%2Fc8sc0457
GA-Model-Based-Approach.pdf)." Information Systems Design and 8j.data)
Intelligent Applications. Springer India, 2016. 227–236. 166. Mills, Kyle; Spanner, Michael; Tamblyn, Isaac (16 May 2018).
152. Diegert, Carl. "A combinatorial method for tracing objects using "Quantum simulation". Quantum simulations of an electron in a two
semantics of their shape (https://www.osti.gov/servlets/purl/127883 dimensional potential well. National Research Council of Canada.
7)." Applied Imagery Pattern Recognition Workshop (AIPR), 2010 doi:10.4224/PhysRevA.96.042113.data (https://doi.org/10.4224%2F
IEEE 39th. IEEE, 2010. PhysRevA.96.042113.data).
153. Razakarivony, Sebastien, and Frédéric Jurie. "Small target 167. Rohrbach, M.; Amin, S.; Andriluka, M.; Schiele, B. (2012). "A
detection combining foreground and background manifolds (https:// database for fine grained activity detection of cooking activities".
hal.archives-ouvertes.fr/hal-00943444/file/13_mva-detection.pdf)." 2012 IEEE Conference on Computer Vision and Pattern
IAPR International Conference on Machine Vision Applications. Recognition. IEEE. pp. 1194–1201. doi:10.1109/cvpr.2012.6247801
2013. (https://doi.org/10.1109%2Fcvpr.2012.6247801). ISBN 978-1-4673-
154. "SpaceNet" (http://explore.digitalglobe.com/spacenet). 1228-8.
explore.digitalglobe.com. Retrieved 13 March 2018. 168. Kuehne, Hilde, Ali Arslan, and Thomas Serre. "The language of
155. Etten, Adam Van (5 January 2017). "Getting Started With SpaceNet actions: Recovering the syntax and semantics of goal-directed
Data" (https://medium.com/the-downlinq/getting-started-with-spacen human activities (https://www.cv-foundation.org/openaccess/content
et-data-827fd2ec9f53). The DownLinQ. Retrieved 13 March 2018. _cvpr_2014/papers/Kuehne_The_Language_of_2014_CVPR_pap
er.pdf)."Proceedings of the IEEE Conference on Computer Vision
156. Vakalopoulou, M.; Bus, N.; Karantzalosa, K.; Paragios, N. (July
2017). Integrating edge/boundary priors with classification scores and Pattern Recognition. 2014.
for building detection in very high resolution data. 2017 IEEE 169. Sviatoslav, Voloshynovskiy, et al. "Towards Reproducible results in
International Geoscience and Remote Sensing Symposium authentication based on physical non-cloneable functions: The
(IGARSS). pp. 3309–3312. doi:10.1109/IGARSS.2017.8127705 (htt Forensic Authentication Microstructure Optical Set (FAMOS). (http://
ps://doi.org/10.1109%2FIGARSS.2017.8127705). ISBN 978-1- vision.unige.ch/publications/postscript/2012/2012.WIFS.database.p
5090-4951-6. S2CID 8297433 (https://api.semanticscholar.org/Corp df)"Proc. Proceedings of IEEE International Workshop on
usID:8297433). Information Forensics and Security. 2012.
157. Yang, Yi; Newsam, Shawn (2010). Bag-of-visual-words and spatial 170. Olga, Taran and Shideh, Rezaeifar, et al. "PharmaPack: mobile
extensions for land-use classification. Proceedings of the 18th fine-grained recognition of pharma packages (https://archive-ouvert
SIGSPATIAL International Conference on Advances in Geographic e.unige.ch/unige:97444/ATTACHMENT01)."Proc. European Signal
Information Systems – GIS '10. New York, New York, USA: ACM Processing Conference (EUSIPCO). 2017.
Press. doi:10.1145/1869790.1869829 (https://doi.org/10.1145%2F1 171. Khosla, Aditya, et al. "Novel dataset for fine-grained image
869790.1869829). ISBN 9781450304283. S2CID 993769 (https://a categorization: Stanford dogs (https://people.csail.mit.edu/khosla/pa
pi.semanticscholar.org/CorpusID:993769). pers/fgvc2011.pdf)."Proc. CVPR Workshop on Fine-Grained Visual
158. Basu, Saikat; Ganguly, Sangram; Mukhopadhyay, Supratik; Categorization (FGVC). 2011.
DiBiano, Robert; Karki, Manohar; Nemani, Ramakrishna (3 172. Parkhi, Omkar M., et al. "Cats and dogs (http://www.robots.ox.ac.uk:
November 2015). DeepSat: a learning framework for satellite 5000/~vgg/publications/2012/parkhi12a/parkhi12a.pdf)."Computer
imagery. ACM. p. 37. doi:10.1145/2820783.2820816 (https://doi.org/ Vision and Pattern Recognition (CVPR), 2012 IEEE Conference
10.1145%2F2820783.2820816). ISBN 9781450339674. on. IEEE, 2012.
S2CID 4387134 (https://api.semanticscholar.org/CorpusID:438713 173. Biggs, Benjamin; Boyne, Oliver; Charles, James; Fitzgibbon,
4). Andrew; Cipolla, Roberto (2020). Computer Vision – ECCV 2020.
159. Liu, Qun; Basu, Saikat; Ganguly, Sangram; Mukhopadhyay, Lecture Notes in Computer Science. Vol. 12356. arXiv:2007.11110
Supratik; DiBiano, Robert; Karki, Manohar; Nemani, Ramakrishna (https://arxiv.org/abs/2007.11110). doi:10.1007/978-3-030-58621-8
(21 November 2019). "DeepSat V2: feature augmented (https://doi.org/10.1007%2F978-3-030-58621-8). ISBN 978-3-030-
convolutional neural nets for satellite image classification". Remote 58620-1. S2CID 227173931 (https://api.semanticscholar.org/Corpu
Sensing Letters. 11 (2): 156–165. arXiv:1911.07747 (https://arxiv.or sID:227173931).
g/abs/1911.07747). doi:10.1080/2150704x.2019.1693071 (https://d 174. Razavian, Ali, et al. "CNN features off-the-shelf: an astounding
oi.org/10.1080%2F2150704x.2019.1693071). ISSN 2150-704X (htt baseline for recognition (https://www.cv-foundation.org/openaccess/
ps://www.worldcat.org/issn/2150-704X). S2CID 208138097 (https:// content_cvpr_workshops_2014/W15/papers/Razavian_CNN_Feat
api.semanticscholar.org/CorpusID:208138097). ures_Off-the-Shelf_2014_CVPR_paper.pdf)." Proceedings of the
160. Md Jahidul Islam, et al. "Semantic Segmentation of Underwater IEEE Conference on Computer Vision and Pattern Recognition
Imagery: Dataset and Benchmark (https://ieeexplore.ieee.org/abstra Workshops. 2014.
ct/document/9340821)." 2020 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS). IEEE, 2020.
175. Ortega, Michael; et al. (1998). "Supporting ranked boolean similarity 192. Taj-Eddin, I. A. T. F.; Afifi, M.; Korashy, M.; Hamdy, D.; Nasser, M.;
queries in MARS". IEEE Transactions on Knowledge and Data Derbaz, S. (July 2016). A new compression technique for
Engineering. 10 (6): 905–925. CiteSeerX 10.1.1.36.6079 (https://cit surveillance videos: Evaluation using new dataset. 2016 Sixth
eseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.6079). International Conference on Digital Information and Communication
doi:10.1109/69.738357 (https://doi.org/10.1109%2F69.738357). Technology and Its Applications (DICTAP). pp. 159–164.
176. He, Xuming, Richard S. Zemel, and Miguel Á. Carreira-Perpiñán. doi:10.1109/DICTAP.2016.7544020 (https://doi.org/10.1109%2FDI
"Multiscale conditional random fields for image labeling (ftp://www-v CTAP.2016.7544020). ISBN 978-1-4673-9609-7. S2CID 8698850
host.cs.toronto.edu/public_html/public_html/dist/zemel/Papers/cvpr (https://api.semanticscholar.org/CorpusID:8698850).
04.pdf)." Computer vision and pattern recognition, 2004. CVPR 193. Tabak, Michael A.; Norouzzadeh, Mohammad S.; Wolfson, David
2004. Proceedings of the 2004 IEEE computer society conference W.; Sweeney, Steven J.; Vercauteren, Kurt C.; Snow, Nathan P.;
on. Vol. 2. IEEE, 2004. Halseth, Joseph M.; Di Salvo, Paul A.; Lewis, Jesse S.; White,
177. Deneke, Tewodros, et al. "Video transcoding time prediction for Michael D.; Teton, Ben; Beasley, James C.; Schlichting, Peter E.;
proactive load balancing (https://ieeexplore.ieee.org/abstract/docum Boughton, Raoul K.; Wight, Bethany; Newkirk, Eric S.; Ivan, Jacob
ent/6890256/)." Multimedia and Expo (ICME), 2014 IEEE S.; Odell, Eric A.; Brook, Ryan K.; Lukacs, Paul M.; Moeller, Anna
International Conference on. IEEE, 2014. K.; Mandeville, Elizabeth G.; Clune, Jeff; Miller, Ryan S.;
Photopoulou, Theoni (2018). "Machine learning to classify animal
178. Ting-Hao (Kenneth) Huang, Francis Ferraro, Nasrin Mostafazadeh,
species in camera trap images: Applications in ecology" (https://doi.
Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick,
Xiaodong He, Pushmeet Kohli, Dhruv Batra, C. Lawrence Zitnick, org/10.1111%2F2041-210X.13120). Methods in Ecology and
Evolution. 10 (4): 585–590. doi:10.1111/2041-210X.13120 (https://d
Devi Parikh, Lucy Vanderwende, Michel Galley, Margaret Mitchell
(13 April 2016). "Visual Storytelling". arXiv:1604.03968 (https://arxi oi.org/10.1111%2F2041-210X.13120). ISSN 2041-210X (https://ww
v.org/abs/1604.03968) [cs.CL (https://arxiv.org/archive/cs.CL)]. w.worldcat.org/issn/2041-210X).
194. Taj-Eddin, Islam A. T. F.; Afifi, Mahmoud; Korashy, Mostafa; Ahmed,
179. Wah, Catherine, et al. "The caltech-ucsd birds-200-2011 dataset (htt
ps://authors.library.caltech.edu/27452/1/CUB_200_2011.pdf)." Ali H.; Ng, Yoke Cheng; Hernandez, Evelyng; Abdel-Latif, Salma M.
(November 2017). "Can we see photosynthesis? Magnifying the
(2011).
tiny color changes of plant green leaves using Eulerian video
180. Duan, Kun, et al. "Discovering localized attributes for fine-grained magnification". Journal of Electronic Imaging. 26 (6): 060501.
recognition (http://vision.soic.indiana.edu/papers/attributes2012cvp arXiv:1706.03867 (https://arxiv.org/abs/1706.03867).
r.pdf)." Computer Vision and Pattern Recognition (CVPR), 2012 Bibcode:2017JEI....26f0501T (https://ui.adsabs.harvard.edu/abs/20
IEEE Conference on. IEEE, 2012. 17JEI....26f0501T). doi:10.1117/1.jei.26.6.060501 (https://doi.org/1
181. "YouTube-8M Dataset" (https://research.google.com/youtube8m/). 0.1117%2F1.jei.26.6.060501). ISSN 1017-9909 (https://www.worldc
research.google.com. Retrieved 1 October 2016. at.org/issn/1017-9909). S2CID 12367169 (https://api.semanticschol
182. Abu-El-Haija, Sami; Kothari, Nisarg; Lee, Joonseok; Natsev, Paul; ar.org/CorpusID:12367169).
Toderici, George; Varadarajan, Balakrishnan; Vijayanarasimhan, 195. "Mathematical Mathematics Memes" (https://www.kaggle.com/abdel
Sudheendra (27 September 2016). "YouTube-8M: A Large-Scale ghanibelgaid/mathematical-mathematics-memes).
Video Classification Benchmark". arXiv:1609.08675 (https://arxiv.or 196. Karras, Tero; Laine, Samuli; Aila, Timo (June 2019). "A Style-Based
g/abs/1609.08675) [cs.CV (https://arxiv.org/archive/cs.CV)]. Generator Architecture for Generative Adversarial Networks" (http
183. "YFCC100M Dataset" (http://mmcommons.org). mmcommons.org. s://dx.doi.org/10.1109/cvpr.2019.00453). 2019 IEEE/CVF
Yahoo-ICSI-LLNL. Retrieved 1 June 2017. Conference on Computer Vision and Pattern Recognition (CVPR).
184. Bart Thomee; David A Shamma; Gerald Friedland; Benjamin IEEE: 4396–4405. arXiv:1812.04948 (https://arxiv.org/abs/1812.049
Elizalde; Karl Ni; Douglas Poland; Damian Borth; Li-Jia Li (25 April 48). doi:10.1109/cvpr.2019.00453 (https://doi.org/10.1109%2Fcvpr.2
2016). "Yfcc100m: The new data in multimedia research". 019.00453). ISBN 978-1-7281-3293-8. S2CID 54482423 (https://ap
Communications of the ACM. 59 (2): 64–73. arXiv:1503.01817 (http i.semanticscholar.org/CorpusID:54482423).
s://arxiv.org/abs/1503.01817). doi:10.1145/2812802 (https://doi.org/ 197. McAuley, Julian; Targett, Christopher; Shi, Qinfeng; Anton van den
10.1145%2F2812802). S2CID 207230134 (https://api.semanticsch Hengel (2015). "Image-based Recommendations on Styles and
olar.org/CorpusID:207230134). Substitutes". arXiv:1506.04757 (https://arxiv.org/abs/1506.04757)
185. Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, "LIRIS- [cs.CV (https://arxiv.org/archive/cs.CV)].
ACCEDE: A Video Database for Affective Content Analysis (https:// 198. "Amazon review data" (https://nijianmo.github.io/amazon/index.htm
hal.archives-ouvertes.fr/hal-01375518/document)," in IEEE l). nijianmo.github.io. Retrieved 8 October 2021.
Transactions on Affective Computing, 2015. 199. Ganesan, Kavita; Zhai, Chengxiang (2012). "Opinion-based entity
186. Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, "Deep ranking". Information Retrieval. 15 (2): 116–150.
Learning vs. Kernel Methods: Performance for Emotion Prediction doi:10.1007/s10791-011-9174-8 (https://doi.org/10.1007%2Fs1079
in Videos (https://hal.archives-ouvertes.fr/hal-01193144/documen 1-011-9174-8). hdl:2142/15252 (https://hdl.handle.net/2142%2F152
t)," in 2015 Humaine Association Conference on Affective 52). S2CID 16258727 (https://api.semanticscholar.org/CorpusID:16
Computing and Intelligent Interaction (ACII), 2015. 258727).
187. M. Sjöberg, Y. Baveye, H. Wang, V. L. Quang, B. Ionescu, E. 200. Lv, Yuanhua, Dimitrios Lymberopoulos, and Qiang Wu. "An
Dellandréa, M. Schedl, C.-H. Demarty, and L. Chen, "The exploration of ranking heuristics in mobile local search (http://citese
mediaeval 2015 affective impact of movies task (https://www.resear erx.ist.psu.edu/viewdoc/download?doi=10.1.1.599.1442&rep=rep1
chgate.net/profile/Hanli_Wang2/publication/309704559_The_Medi &type=pdf)." Proceedings of the 35th international ACM SIGIR
aEval_2015_Affective_Impact_of_Movies_Task/links/581dada308a conference on Research and development in information retrieval.
e12715af33bc8/The-MediaEval-2015-Affective-Impact-of-Movies-T ACM, 2012.
ask.pdf)," in MediaEval 2015 Workshop, 2015. 201. Harper, F. Maxwell; Konstan, Joseph A. (2015). "The MovieLens
188. S. Johnson and M. Everingham, "Clustered Pose and Nonlinear Datasets: History and Context". ACM Transactions on Interactive
Appearance Models for Human Pose Estimation (http://sam.johnso Intelligent Systems. 5 (4): 19. doi:10.1145/2827872 (https://doi.org/1
n.io/research/publications/johnson10bmvc.pdf)", in Proceedings of 0.1145%2F2827872). S2CID 16619709 (https://api.semanticschola
the 21st British Machine Vision Conference (BMVC2010) r.org/CorpusID:16619709).
189. S. Johnson and M. Everingham, "Learning Effective Human Pose 202. Koenigstein, Noam, Gideon Dror, and Yehuda Koren. "Yahoo!
Estimation from Inaccurate Annotation (http://sam.johnson.io/resear music recommendations: modeling music ratings with temporal
ch/publications/johnson11cvpr.pdf)", In Proceedings of IEEE dynamics and item taxonomy (https://www.researchgate.net/profile/
Conference on Computer Vision and Pattern Recognition Noam_Koenigstein/publication/221141054_Yahoo_music_recomm
(CVPR2011) endations_Modeling_music_ratings_with_temporal_dynamics_and
190. Afifi, Mahmoud; Hussain, Khaled F. (2 November 2017). "The _item_taxonomy/links/5404184a0cf2c48563b03c68/Yahoo-music-r
Achievement of Higher Flexibility in Multiple Choice-based Tests ecommendations-Modeling-music-ratings-with-temporal-dynamics-
Using Image Classification Techniques". arXiv:1711.00972 (https:// and-item-taxonomy.pdf)." Proceedings of the fifth ACM conference
arxiv.org/abs/1711.00972) [cs.CV (https://arxiv.org/archive/cs.CV)]. on Recommender systems. ACM, 2011.
191. "MCQ Dataset" (https://sites.google.com/view/mcq-dataset/mcqe-da
taset). sites.google.com. Retrieved 18 November 2017.
203. McFee, Brian, et al. "The million song dataset challenge (https://bm 217. Amini, Massih R.; Usunier, Nicolas; Goutte, Cyril (2009). "Learning
cfee.github.io/papers/msdchallenge.pdf)." Proceedings of the 21st from Multiple Partially Observed Views – an Application to
international conference companion on World Wide Web. ACM, Multilingual Text Categorization" (http://papers.nips.cc/paper/3690-l
2012. earning-from-multiple-partially-observed-views-an-application-to-m
204. Bohanec, Marko, and Vladislav Rajkovic. "Knowledge acquisition ultilingual-text-categorization). Advances in Neural Information
and explanation for multi-attribute decision making (https://www.res Processing Systems. 22: 28–36.
earchgate.net/profile/Marko_Bohanec/publication/246614940_KNO 218. Liu, Ming; et al. (2015). "VRCA: a clustering algorithm for massive
WLEDGE_ACQUISITION_AND_EXPLANATION_FOR_MULTI-AT amount of texts" (https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/
TRIBUTE_DECISION_MAKING/links/02e7e532152f452d8700000 paper/download/10903/10990). Proceedings of the 24th
0.pdf)." 8th Intl Workshop on Expert Systems and their Applications. International Conference on Artificial Intelligence. AAAI Press.
1988. 219. Al-Harbi, S; Almuhareb, A; Al-Thubaity, A; Khorsheed, M. S.; Al-
205. Tan, Peter J., and David L. Dowe. "MML inference of decision Rajeh, A (2008). "Automatic Arabic Text Classification".
graphs with multi-way joins (http://www.csse.monash.edu.au/~dld/P Proceedings of the 9th International Conference on the Statistical
ublications/2002/Tan+Dowe2002_MMLDecisionGraphs.ps)." Analysis of Textual Data, Lyon, France.
Australian Joint Conference on Artificial Intelligence. 2002. 220. "Relationship and Entity Extraction Evaluation Dataset: Dstl/re3d"
206. "Quantifying comedy on YouTube: why the number of o's in your (https://github.com/dstl/re3d). GitHub. 17 December 2018.
LOL matter" (https://metatext.io/datasets). Metatext NLP Database. 221. "The Examiner – SpamClickBait Catalogue" (https://www.kaggle.co
Retrieved 26 October 2020. m/therohk/examine-the-examiner).
207. Kim, Byung Joo (2012). "A Classifier for Big Data" (https://link.sprin 222. "A Million News Headlines" (https://www.kaggle.com/therohk/millio
ger.com/chapter/10.1007/978-3-642-32692-9_63). Convergence n-headlines).
and Hybrid Information Technology. Communications in Computer
223. "One Week of Global News Feeds" (https://www.kaggle.com/theroh
and Information Science. Vol. 310. pp. 505–512. doi:10.1007/978-3-
k/global-news-week).
642-32692-9_63 (https://doi.org/10.1007%2F978-3-642-32692-9_6
3). ISBN 978-3-642-32691-2. 224. Kulkarni, Rohit (2018), Reuters News-Wire Archive, Harvard
Dataverse, doi:10.7910/DVN/XDB74W (https://doi.org/10.7910%2F
208. Pérezgonzález, Jose D.; Gilbey, Andrew (2011). "Predicting Skytrax DVN%2FXDB74W)
airport rankings from customer reviews" (https://www.ingentaconnec
t.com/content/hsp/cam/2011/00000005/00000004/art00007). 225. "IrishTimes – the Waxy-Wany News" (https://www.kaggle.com/thero
Journal of Airport Management. 5 (4): 335–339. hk/ireland-historical-news).
209. Loh, Wei-Yin, and Yu-Shan Shih. "Split selection methods for 226. "News Headlines Dataset For Sarcasm Detection" (https://kaggle.c
classification trees (http://www3.stat.sinica.edu.tw/statistica/oldpdf/A om/rmisra/news-headlines-dataset-for-sarcasm-detection).
7n41.pdf)." Statistica sinica(1997): 815–840. kaggle.com. Retrieved 27 April 2019.
210. Lim, Tjen-Sien; Loh, Wei-Yin; Shih, Yu-Shan (2000). "A comparison 227. Klimt, Bryan, and Yiming Yang. "Introducing the Enron Corpus (http
of prediction accuracy, complexity, and training time of thirty-three s://bklimt.com/papers/2004_klimt_ceas.pdf)." CEAS. 2004.
old and new classification algorithms". Machine Learning. 40 (3): 228. Kossinets, Gueorgi; Kleinberg, Jon; Watts, Duncan (2008). "The
203–228. doi:10.1023/a:1007608224229 (https://doi.org/10.1023%2 Structure of Information Pathways in a Social Communication
Fa%3A1007608224229). S2CID 17030953 (https://api.semanticsch Network". arXiv:0806.3201 (https://arxiv.org/abs/0806.3201)
olar.org/CorpusID:17030953). [physics.soc-ph (https://arxiv.org/archive/physics.soc-ph)].
211. Kiet Van Nguyen, Vu Duc Nguyen, Phu X. V. Nguyen, Tham T. H. 229. Androutsopoulos, Ion; Koutsias, John; Chandrinos, Konstantinos V.;
Truong, Ngan Luu-Thuy Nguyen. "UIT-VSFC: Vietnamese Paliouras, George; Spyropoulos, Constantine D. (2000). "An
Students’ Feedback Corpus for Sentiment Analysis (https://ieeexplo evaluation of Naive Bayesian anti-spam filtering". In Potamias, G.;
re.ieee.org/document/8573337) Moustakis, V.; van Someren, M. (eds.). Proceedings of the
212. Ho, Vong Anh; Nguyen, Duong Huynh-Cong; Nguyen, Danh Workshop on Machine Learning in the New Information Age. 11th
Hoang; Pham, Linh Thi-Van; Nguyen, Duc-Vu; Nguyen, Kiet Van; European Conference on Machine Learning, Barcelona, Spain.
Nguyen, Ngan Luu-Thuy (2020). "Emotion Recognition for Vol. 11. pp. 9–17. arXiv:cs/0006013 (https://arxiv.org/abs/cs/000601
Vietnamese Social Media Text" (https://link.springer.com/chapter/1 3). Bibcode:2000cs........6013A (https://ui.adsabs.harvard.edu/abs/2
0.1007/978-981-15-6168-9_27). Computational Linguistics. 000cs........6013A).
Communications in Computer and Information Science. Vol. 1215. 230. Bratko, Andrej; et al. (2006). "Spam filtering using statistical data
pp. 319–333. arXiv:1911.09339 (https://arxiv.org/abs/1911.09339). compression models" (http://www.jmlr.org/papers/volume7/bratko06
doi:10.1007/978-981-15-6168-9_27 (https://doi.org/10.1007%2F978 a/bratko06a.pdf) (PDF). The Journal of Machine Learning
-981-15-6168-9_27). ISBN 978-981-15-6167-2. S2CID 208202333 Research. 7: 2673–2698.
(https://api.semanticscholar.org/CorpusID:208202333). 231. Almeida, Tiago A., José María G. Hidalgo, and Akebo Yamakami.
213. Nhung Thi-Hong Nguyen, Phuong Ha-Dieu Phan, Luan Thanh "Contributions to the study of SMS spam filtering: new collection
Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen (24 April 2021). and results (http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
"Vietnamese Open-domain Complaint Detection in E-Commerce doceng11.pdf)."Proceedings of the 11th ACM symposium on
Websites". arXiv:2104.11969 (https://arxiv.org/abs/2104.11969) Document engineering. ACM, 2011.
[cs.CL (https://arxiv.org/archive/cs.CL)]. 232. Delany; Jane, Sarah; Buckley, Mark; Greene, Derek (2012). "SMS
214. Phu Gia Hoang, Canh Duc Luu, Khanh Quoc Tran, Kiet Van spam filtering: methods and data" (https://arrow.dit.ie/cgi/viewconten
Nguyen, Ngan Luu-Thuy Nguyen (26 January 2023). "ViHOS: Hate t.cgi?article=1022&context=scschcomart). Expert Systems with
Speech Spans Detection for Vietnamese". arXiv:2301.10186 (http Applications. 39 (10): 9899–9908. doi:10.1016/j.eswa.2012.02.053
s://arxiv.org/abs/2301.10186) [cs.CL (https://arxiv.org/archive/cs.C (https://doi.org/10.1016%2Fj.eswa.2012.02.053). S2CID 15546924
L)]. (https://api.semanticscholar.org/CorpusID:15546924).
215. Dermouche, Mohamed; Velcin, Julien; Khouas, Leila; Loudcher, 233. Joachims, Thorsten. A Probabilistic Analysis of the Rocchio
Sabine (2014). "A Joint Model for Topic-Sentiment Evolution over Algorithm with TFIDF for Text Categorization (https://apps.dtic.mil/dt
Time". 2014 IEEE International Conference on Data Mining. IEEE. ic/tr/fulltext/u2/a307731.pdf). No. CMU-CS-96-118. Carnegie-
pp. 773–778. doi:10.1109/icdm.2014.82 (https://doi.org/10.1109%2 mellon univ pittsburgh pa dept of computer science, 1996.
Ficdm.2014.82). ISBN 978-1-4799-4302-9. 234. Dimitrakakis, Christos, and Samy Bengio. Online Policy Adaptation
216. Rose, Tony; Stevenson, Mark; Whitehead, Miles (2002). "The for Ensemble Algorithms (https://infoscience.epfl.ch/record/82788/fil
Reuters Corpus Volume 1-from Yesterday's News to Tomorrow's es/rr02-28.pdf). No. EPFL-REPORT-82788. IDIAP, 2002.
Language Resources" (https://web.archive.org/web/201908060150 235. Dooms, S. et al. "Movietweetings: a movie rating dataset collected
15/https://pdfs.semanticscholar.org/3e4b/dc7f8904c58f8fce1993892 from twitter, 2013. Available from
99ec1ed8e1226.pdf) (PDF). LREC. 2. S2CID 9239414 (https://api.s https://github.com/sidooms/MovieTweetings."
emanticscholar.org/CorpusID:9239414). Archived from the original 236. RoyChowdhury, Aruni; Lin, Tsung-Yu; Maji, Subhransu; Learned-
(https://pdfs.semanticscholar.org/3e4b/dc7f8904c58f8fce199389299 Miller, Erik (2017). "Twitter100k: A Real-world Dataset for Weakly
ec1ed8e1226.pdf) (PDF) on 6 August 2019. Supervised Cross-Media Retrieval". arXiv:1703.06618 (https://arxiv.
org/abs/1703.06618) [cs.CV (https://arxiv.org/archive/cs.CV)].
237. "huyt16/Twitter100k" (https://github.com/huyt16/Twitter100k).
GitHub. Retrieved 26 March 2018.
238. Go, Alec; Bhayani, Richa; Huang, Lei (2009). "Twitter sentiment 256. Sordoni, Alessandro; Galley, Michel; Auli, Michael; Brockett, Chris;
classification using distant supervision". CS224N Project Report, Ji, Yangfeng; Mitchell, Margaret; Nie, Jian-Yun; Gao, Jianfeng;
Stanford. 1: 12. Dolan, Bill (2015). "A Neural Network Approach to Context-
239. Chikersal, Prerna, Soujanya Poria, and Erik Cambria. "SeNTU: Sensitive Generation of Conversational Responses".
sentiment analysis of tweets by combining a rule-based classifier arXiv:1506.06714 (https://arxiv.org/abs/1506.06714) [cs.CL (https://a
with supervised learning (https://www.aclweb.org/anthology/S15-21 rxiv.org/archive/cs.CL)].
08)." Proceedings of the International Workshop on Semantic 257. Shaoul, C. & Westbury C. (2013) A reduced redundancy USENET
Evaluation, SemEval. 2015. corpus (2005–2011) Edmonton, AB: University of Alberta
240. Zafarani, Reza, and Huan Liu. "Social computing data repository at (downloaded from
ASU." School of Computing, Informatics and Decision Systems http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.d
Engineering, Arizona State University (2009). 258. KAN, M. (2011, January). NUS Short Message Service (SMS)
241. Bisgin, Halil, Nitin Agarwal, and Xiaowei Xu. "Investigating Corpus. Retrieved from
homophily in online social networks (http://www.academia.edu/dow http://www.comp.nus.edu.sg/entrepreneurship/innovation/osr/corpus/
nload/3746109/4191a533.pdf)." Web Intelligence and Intelligent Archived (https://web.archive.org/web/20180629055042/http://www.
Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International comp.nus.edu.sg/entrepreneurship/innovation/osr/corpus/) 29 June
Conference on. Vol. 1. IEEE, 2010. 2018 at the Wayback Machine
242. McAuley, Julian J.; Leskovec, Jure. "Learning to Discover Social 259. Stuck_In_the_Matrix. (2015, July 3). I have every publicly available
Circles in Ego Networks". NIPS. 2012: 2012. Reddit comment for research. ~ 1.7 billion comments @ 250 GB
compressed. Any interest in this? [Original post]. Message posted to
243. Šubelj, Lovro; Fiala, Dalibor; Bajec, Marko (2014). "Network-based
statistical comparison of citation topology of bibliographic https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_pu
databases" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC417829 260. Lowe, Ryan; Pow, Nissan; Serban, Iulian; Pineau, Joelle (2015).
2). Scientific Reports. 4 (6496): 6496. arXiv:1502.05061 (https://arxi "The Ubuntu Dialogue Corpus: A Large Dataset for Research in
v.org/abs/1502.05061). Bibcode:2014NatSR...4E6496S (https://ui.a Unstructured Multi-Turn Dialogue Systems". arXiv:1506.08909 (http
dsabs.harvard.edu/abs/2014NatSR...4E6496S). s://arxiv.org/abs/1506.08909) [cs.CL (https://arxiv.org/archive/cs.C
doi:10.1038/srep06496 (https://doi.org/10.1038%2Fsrep06496). L)].
PMC 4178292 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4178 261. Jason Williams Antoine Raux Matthew Henderson, "[1] (https://ww
292). PMID 25263231 (https://pubmed.ncbi.nlm.nih.gov/25263231). w.microsoft.com/en-us/research/publication/the-dialog-state-trackin
244. Abdulla, N., et al. "Arabic sentiment analysis: Corpus-based and g-challenge-series-a-review/)", Dialogue & Discourse | April 2016 .
lexicon-based." Proceedings of the IEEE conference on Applied 262. Hoppe, Travis (16 December 2021), The-Pile-FreeLaw (https://githu
Electrical Engineering and Computing Technologies (AEECT). b.com/thoppe/The-Pile-FreeLaw), retrieved 11 January 2023
2013. 263. Zheng, Lucia; Guha, Neel; Anderson, Brandon R.; Henderson,
245. Abooraig, Raddad, et al. "On the automatic categorization of Arabic Peter; Ho, Daniel E. (21 June 2021). "When does pretraining help?"
articles based on their political orientation (https://www.researchgat (https://dx.doi.org/10.1145/3462757.3466088). Proceedings of the
e.net/profile/Shadi_Alzubi/publication/324487844_Automatic_categ Eighteenth International Conference on Artificial Intelligence and
orization_of_Arabic_articles_based_on_their_political_orientation/li Law. New York, NY, USA: ACM: 159–168.
nks/5c1201c9299bf139c7549e1a/Automatic-categorization-of-Arabi doi:10.1145/3462757.3466088 (https://doi.org/10.1145%2F346275
c-articles-based-on-their-political-orientation.pdf)." Third 7.3466088). ISBN 9781450385268. S2CID 233296302 (https://api.s
International Conference on Informatics Engineering and emanticscholar.org/CorpusID:233296302).
Information Science (ICIEIS2014). 2014. 264. "pile-of-law/pile-of-law · Datasets at Hugging Face" (https://huggingf
246. Kawala, François, et al. "Prédictions d'activité dans les réseaux ace.co/datasets/pile-of-law/pile-of-law). huggingface.co. 4 July
sociaux en ligne (https://hal.archives-ouvertes.fr/hal-00881395/docu 2022. Retrieved 11 January 2023.
ment)." 4ième conférence sur les modèles et l'analyse des réseaux: 265. "About | Caselaw Access Project" (https://case.law/about/).
Approches mathématiques et informatiques. 2013. case.law. Retrieved 11 January 2023.
247. Sabharwal, Ashish; Samulowitz, Horst; Tesauro, Gerald (2015). 266. K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S.
"Selecting Near-Optimal Learners via Incremental Data Allocation". Gerber and L. E. Barnes, "HDLTex: Hierarchical Deep Learning for
arXiv:1601.00024 (https://arxiv.org/abs/1601.00024) [cs.LG (https://a Text Classification", 2017 16th IEEE International Conference on
rxiv.org/archive/cs.LG)]. Machine Learning and Applications (ICMLA), pp. 364–371. doi:
248. Xu et al. "SemEval-2015 Task 1: Paraphrase and Semantic 10.1109/ICMLA.2017.0-134 (https://doi.org/10.1109/ICMLA.2017.0-
Similarity in Twitter (PIT) (https://www.aclweb.org/anthology/S15-20 134)
01)" Proceedings of the 9th International Workshop on Semantic 267. K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S.
Evaluation. 2015. Gerber and L. E. Barnes, "Web of Science Dataset",
249. Xu et al. "Extracting Lexically Divergent Paraphrases from Twitter (h doi:10.17632/9rw3vkcfy4.6 (https://doi.org/10.17632%2F9rw3vkcfy
ttps://transacl.org/ojs/index.php/tacl/article/viewFile/498/64)" 4.6)
Transactions of the Association for Computational (TACL). 2014. 268. Galgani, Filippo, Paul Compton, and Achim Hoffmann. "Combining
250. Middleton, Stuart E; Middleton, Lee; Modafferi, Stefano (2014). different summarization techniques for legal text (https://www.aclwe
"Real-Time Crisis Mapping of Natural Disasters Using Social b.org/anthology/W12-0515)." Proceedings of the Workshop on
Media" (https://eprints.soton.ac.uk/370581/1/ieee-is2014.pdf) Innovative Hybrid Approaches to the Processing of Textual Data.
(PDF). IEEE Intelligent Systems. 29 (2): 9–17. Association for Computational Linguistics, 2012.
doi:10.1109/MIS.2013.126 (https://doi.org/10.1109%2FMIS.2013.12 269. Nagwani, N. K. (2015). "Summarizing large text collection using
6). S2CID 15139204 (https://api.semanticscholar.org/CorpusID:151 topic modeling and clustering based on MapReduce framework" (ht
39204). tps://doi.org/10.1186%2Fs40537-015-0020-5). Journal of Big Data.
251. "geoparsepy" (https://pypi.org/project/geoparsepy). 2016. Python 2 (1): 1–18. doi:10.1186/s40537-015-0020-5 (https://doi.org/10.118
PyPI library 6%2Fs40537-015-0020-5).
252. Gupta, Aakash (5 December 2020). "Dutch social media collection" 270. Schler, Jonathan; et al. (2006). "Effects of Age and Gender on
(http://localhost:8080/dataset.xhtml?persistentId=doi:10.5072/FK2/ Blogging" (https://www.aaai.org/Papers/Symposia/Spring/2006/SS-
MTPTL7). doi:10.5072/FK2/MTPTL7 (https://doi.org/10.5072%2FF 06-03/SS06-03-039.pdf) (PDF). AAAI Spring Symposium:
K2%2FMTPTL7). {{cite journal}}: ; Check |url= value Computational Approaches to Analyzing Weblogs. 6.
(help) 271. Anand, Pranav, et al. "Believe Me-We Can Do This! Annotating
253. "Streamlit" (https://huggingface.co/datasets/viewer/?dataset=dutch_ Persuasive Acts in Blog Text."Computational Models of Natural
social). huggingface.co. Retrieved 18 December 2020. Argument. 2011.
254. "Dutch Social media collection" (https://kaggle.com/skylord/dutch-t 272. Traud, Amanda L., Peter J. Mucha, and Mason A. Porter. "Social
weets). kaggle.com. Retrieved 18 December 2020. structure of Facebook networks." Physica A: Statistical Mechanics
255. Forsyth, E., Lin, J., & Martell, C. (2008, June 25). The NPS Chat and its Applications391.16 (2012): 4165–4180.
Corpus. Retrieved from http://faculty.nps.edu/cmartell/NPSChat.htm
273. Richard, Emile; Savalle, Pierre-Andre; Vayatis, Nicolas (2012). 292. "DSL Corpus Collection" (http://ttg.uni-saarland.de/resources/DSLC
"Estimation of Simultaneously Sparse and Low Rank Matrices". C/). ttg.uni-saarland.de. Retrieved 22 September 2017.
arXiv:1206.6474 (https://arxiv.org/abs/1206.6474) [cs.DS (https://arx 293. "Urban Dictionary Words and Definitions" (https://www.kaggle.com/t
iv.org/archive/cs.DS)]. herohk/urban-dictionary-words-dataset).
274. Richardson, Matthew; Burges, Christopher JC; Renshaw, Erin 294. H. Elsahar, P. Vougiouklis, A. Remaci, C. Gravier, J. Hare, F.
(2013). "MCTest: A Challenge Dataset for the Open-Domain Laforest, E. Simperl, "T-REx: A Large Scale Alignment of Natural
Machine Comprehension of Text" (https://www.aclweb.org/antholog Language with Knowledge Base Triples (https://www.aclweb.org/an
y/D13-1020). EMNLP. 1. thology/L18-1544)", Proceedings of the Eleventh International
275. Weston, Jason; Bordes, Antoine; Chopra, Sumit; Rush, Alexander Conference on Language Resources and Evaluation (LREC-2018).
M.; Bart van Merriënboer; Joulin, Armand; Mikolov, Tomas (2015). 295. Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy,
"Towards AI-Complete Question Answering: A Set of Prerequisite Omer; Bowman, Samuel R. (2018). "GLUE: A Multi-Task
Toy Tasks". arXiv:1502.05698 (https://arxiv.org/abs/1502.05698) Benchmark and Analysis Platform for Natural Language
[cs.AI (https://arxiv.org/archive/cs.AI)]. Understanding". arXiv:1804.07461 (https://arxiv.org/abs/1804.0746
276. Marcus, Mitchell P.; Ann Marcinkiewicz, Mary; Santorini, Beatrice 1) [cs.CL (https://arxiv.org/archive/cs.CL)].
(1993). "Building a large annotated corpus of English: The Penn 296. "Computers Are Learning to Read—But They're Still Not So Smart"
Treebank" (http://repository.upenn.edu/cgi/viewcontent.cgi?article=1 (https://www.wired.com/story/computers-are-learning-to-read-but-th
246&context=cis_reports). Computational Linguistics. 19 (2): 313– eyre-still-not-so-smart/). Wired. Retrieved 29 December 2019.
330. 297. "GLUE Benchmark" (https://gluebenchmark.com/).
277. Collins, Michael (2003). "Head-driven statistical models for natural gluebenchmark.com. Retrieved 25 February 2019.
language parsing" (https://doi.org/10.1162%2F0891201033227533 298. Quan, Hoang Lam; Quang, Duy Le; Van Kiet, Nguyen; Ngan, Luu-
56). Computational Linguistics. 29 (4): 589–637. Thuy Nguyen. "UIT-ViIC: A Dataset for the First Evaluation on
doi:10.1162/089120103322753356 (https://doi.org/10.1162%2F089
Vietnamese Image Captioning" (https://www.springerprofessional.d
120103322753356).
e/uit-viic-a-dataset-for-the-first-evaluation-on-vietnamese-image-/18
278. Guyon, Isabelle, et al., eds. Feature extraction: foundations and 612672).
applications (https://books.google.com/books?id=FOTzBwAAQBAJ
299. To, Quoc Huy; Nguyen, Van Kiet; Nguyen, Luu Thuy Ngan; Nguyen,
&q=DEXTER). Vol. 207. Springer, 2008. Gia Tuan Anh (2020). "Gender Prediction Based on Vietnamese
279. Lin, Yuri, et al. "Syntactic annotations for the google books ngram Names with Machine Learning Techniques". Proceedings of the 4th
corpus (https://www.aclweb.org/anthology/P/P12/P12-3029.pdf)." International Conference on Natural Language Processing and
Proceedings of the ACL 2012 system demonstrations. Association Information Retrieval. pp. 55–60. arXiv:2010.10852 (https://arxiv.or
for Computational Linguistics, 2012. g/abs/2010.10852). doi:10.1145/3443279.3443309 (https://doi.org/1
280. Krishnamoorthy, Niveda; et al. (2013). "Generating Natural- 0.1145%2F3443279.3443309). ISBN 9781450377607.
Language Video Descriptions Using Text-Mined Knowledge" (http S2CID 224814110 (https://api.semanticscholar.org/CorpusID:22481
s://www.aaai.org/ocs/index.php/AAAI/AAAI13/paper/download/645 4110).
4/7204). AAAI. 1. 300. Nguyen, Luan Thanh; Van Nguyen, Kiet; Nguyen, Ngan Luu-Thuy
281. Luyckx, Kim, and Walter Daelemans. "Personae: a Corpus for (18 March 2021). "Constructive and Toxic Speech Detection for
Author and Personality Prediction from Text (http://www.academia.e Open-Domain Social Media Comments in Vietnamese". Advances
du/download/30766398/759.pdf)." LREC. 2008. and Trends in Artificial Intelligence. Artificial Intelligence Practices.
282. Solorio, Thamar, Ragib Hasan, and Mainul Mizan. "A case study of Lecture Notes in Computer Science. Vol. 12798. pp. 572–583.
sockpuppet detection in wikipedia (https://www.aclweb.org/antholog arXiv:2103.10069 (https://arxiv.org/abs/2103.10069).
y/W13-1107)." Workshop on Language Analysis in Social Media doi:10.1007/978-3-030-79457-6_49 (https://doi.org/10.1007%2F978
(LASM) at NAACL HLT. 2013. -3-030-79457-6_49). ISBN 978-3-030-79456-9. S2CID 232269671
(https://api.semanticscholar.org/CorpusID:232269671).
283. "Pushshift Files" (https://files.pushshift.io/). files.pushshift.io.
Retrieved 12 January 2023. 301. M. Versteegh, R. Thiollière, T. Schatz, X.-N. Cao, X. Anguera, A.
Jansen, and E. Dupoux (2015). "The Zero Resource Speech
284. Baumgartner, Jason; Zannettou, Savvas; Keegan, Brian; Squire,
Challenge 2015," in INTERSPEECH-2015.
Megan; Blackburn, Jeremy (23 January 2020). "The Pushshift
Reddit Dataset". arXiv:2001.08435 (https://arxiv.org/abs/2001.0843 302. M. Versteegh, X. Anguera, A. Jansen, and E. Dupoux, (2016). "The
5) [cs.SI (https://arxiv.org/archive/cs.SI)]. Zero Resource Speech Challenge 2015: Proposed Approaches
285. Ciarelli, Patrick Marques, and Elias Oliveira. "Agglomeration and and Results (https://core.ac.uk/download/pdf/82574050.pdf)," in
elimination of terms for dimensionality reduction (https://ieeexplore.i SLTU-2016.
eee.org/abstract/document/5364970/)." Intelligent Systems Design 303. Sakar, Betul Erdogdu; et al. (2013). "Collection and analysis of a
and Applications, 2009. ISDA'09. Ninth International Conference Parkinson speech dataset with multiple types of sound recordings".
on. IEEE, 2009. IEEE Journal of Biomedical and Health Informatics. 17 (4): 828–
286. Zhou, Mingyuan, Oscar Hernan Madrid Padilla, and James G. 834. doi:10.1109/jbhi.2013.2245674 (https://doi.org/10.1109%2Fjbh
Scott. "Priors for random count matrices derived from a family of i.2013.2245674). PMID 25055311 (https://pubmed.ncbi.nlm.nih.gov/
25055311). S2CID 15491516 (https://api.semanticscholar.org/Corp
negative binomial processes." Journal of the American Statistical
usID:15491516).
Association just-accepted (2015): 00–00.
304. Zhao, Shunan, et al. "Automatic detection of expressed emotion in
287. Kotzias, Dimitrios, et al. "From group to individual labels using deep
Parkinson's disease (https://www.researchgate.net/profile/Steven_L
features (http://datalab.ics.uci.edu/papers/kdd2015_dimitris.pdf)."
Proceedings of the 21th ACM SIGKDD International Conference on ivingstone2/publication/267623907_Automatic_detection_of_expre
Knowledge Discovery and Data Mining. ACM, 2015. ssed_emotion_in_Parkinson%27s_Disease/links/5453af1d0cf26d5
090a54cfe/Automatic-detection-of-expressed-emotion-in-Parkinson
288. Ning, Yue; Muthiah, Sathappan; Rangwala, Huzefa; Ramakrishnan, s-Disease.pdf)." Acoustics, Speech and Signal Processing
Naren (2016). "Modeling Precursors for Event Forecasting via (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.
Nested Multi-Instance Learning". arXiv:1602.08033 (https://arxiv.or
g/abs/1602.08033) [cs.SI (https://arxiv.org/archive/cs.SI)]. 305. Used in: Hammami, Nacereddine, and Mouldi Bedda. "Improved
tree model for Arabic speech recognition." Computer Science and
289. Buza, Krisztian. "Feedback prediction for blogs (http://www.cs.bme. Information Technology (ICCSIT), 2010 3rd IEEE International
hu/~buza/pdfs/gfkl2012_blogs.pdf)."Data analysis, machine Conference on. Vol. 5. IEEE, 2010.
learning and knowledge discovery. Springer International
306. Maaten, Laurens. "Learning discriminative fisher kernels (https://lvd
Publishing, 2014. 145–152.
maaten.github.io/publications/papers/ICML_2011.pdf)."
290. Soysal, Ömer M (2015). "Association rule mining with mostly Proceedings of the 28th International Conference on Machine
associated sequential patterns". Expert Systems with Applications. Learning (ICML-11). 2011.
42 (5): 2582–2592. doi:10.1016/j.eswa.2014.10.049 (https://doi.org/
307. Cole, Ronald, and Mark Fanty. "Spoken letter recognition (https://w
10.1016%2Fj.eswa.2014.10.049).
ww.aclweb.org/anthology/H90-1075)." Proc. Third DARPA Speech
291. Bowman, Samuel R.; Angeli, Gabor; Potts, Christopher; Manning, and Natural Language Workshop. 1990.
Christopher D. (2015). "A large annotated corpus for learning
natural language inference". arXiv:1508.05326 (https://arxiv.org/ab
s/1508.05326) [cs.CL (https://arxiv.org/archive/cs.CL)].
308. Chapelle, Olivier; Sindhwani, Vikas; Keerthi, Sathiya S. (2008). 324. Esposito, Roberto; Radicioni, Daniele P. (2009). "Carpediem:
"Optimization techniques for semi-supervised support vector Optimizing the viterbi algorithm and applications to supervised
machines" (http://www.jmlr.org/papers/volume9/chapelle08a/chapel sequential learning" (http://www.jmlr.org/papers/volume10/esposito
le08a.pdf) (PDF). The Journal of Machine Learning Research. 9: 09a/esposito09a.pdf) (PDF). The Journal of Machine Learning
203–233. Research. 10: 1851–1880.
309. Kudo, Mineichi; Toyama, Jun; Shimbo, Masaru (1999). 325. Sourati, Jamshid; et al. (2016). "Classification Active Learning
"Multidimensional curve classification using passing-through Based on Mutual Information" (https://doi.org/10.3390%2Fe180200
regions". Pattern Recognition Letters. 20 (11): 1103–1111. 51). Entropy. 18 (2): 51. Bibcode:2016Entrp..18...51S (https://ui.ads
Bibcode:1999PaReL..20.1103K (https://ui.adsabs.harvard.edu/abs/ abs.harvard.edu/abs/2016Entrp..18...51S). doi:10.3390/e18020051
1999PaReL..20.1103K). CiteSeerX 10.1.1.46.2515 (https://citeseer (https://doi.org/10.3390%2Fe18020051).
x.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.2515). 326. Salamon, Justin; Jacoby, Christopher; Bello, Juan Pablo. "A dataset
doi:10.1016/s0167-8655(99)00077-x (https://doi.org/10.1016%2Fs0 and taxonomy for urban sound research (https://www.researchgate.
167-8655%2899%2900077-x). net/profile/Justin_Salamon/publication/267269056_A_Dataset_and
310. Jaeger, Herbert; et al. (2007). "Optimization and applications of _Taxonomy_for_Urban_Sound_Research/links/544936af0cf2f6388
echo state networks with leaky-integrator neurons". Neural 0810a84/A-Dataset-and-Taxonomy-for-Urban-Sound-Research.pd
Networks. 20 (3): 335–352. doi:10.1016/j.neunet.2007.04.016 (http f)." Proceedings of the ACM International Conference on
s://doi.org/10.1016%2Fj.neunet.2007.04.016). PMID 17517495 (http Multimedia. ACM, 2014.
s://pubmed.ncbi.nlm.nih.gov/17517495). 327. Lagrange, Mathieu; Lafay, Grégoire; Rossignol, Mathias; Benetos,
311. Tsanas, Athanasios; et al. (2010). "Accurate telemonitoring of Emmanouil; Roebel, Axel (2015). "An evaluation framework for
Parkinson's disease progression by noninvasive speech tests" (htt event detection using a morphological model of acoustic scenes".
p://precedings.nature.com/documents/3920/version/1). IEEE arXiv:1502.00141 (https://arxiv.org/abs/1502.00141) [stat.ML (https://
Transactions on Biomedical Engineering (Submitted manuscript). arxiv.org/archive/stat.ML)].
57 (4): 884–893. doi:10.1109/tbme.2009.2036000 (https://doi.org/1 328. Gemmeke, Jort F., et al. "Audio Set: An ontology and human-
0.1109%2Ftbme.2009.2036000). PMID 19932995 (https://pubmed. labeled dataset for audio events." IEEE International Conference on
ncbi.nlm.nih.gov/19932995). S2CID 7382779 (https://api.semantics Acoustics, Speech, and Signal Processing (ICASSP). 2017.
cholar.org/CorpusID:7382779).
329. "Watch out, birders: Artificial intelligence has learned to spot birds
312. Clifford, Gari D.; Clifton, David (2012). "Wireless technology in from their songs" (https://www.science.org/content/article/watch-out-
disease management and medicine". Annual Review of Medicine. birders-artificial-intelligence-has-learned-spot-birds-their-songs).
63: 479–492. doi:10.1146/annurev-med-051210-114650 (https://doi. Science | AAAS. 18 July 2018. Retrieved 22 July 2018.
org/10.1146%2Fannurev-med-051210-114650). PMID 22053737 (h
330. "Bird Audio Detection challenge" (http://machine-listening.eecs.qmu
ttps://pubmed.ncbi.nlm.nih.gov/22053737). l.ac.uk/bird-audio-detection-challenge/). Machine Listening Lab at
313. Zue, Victor; Seneff, Stephanie; Glass, James (1990). "Speech Queen Mary University. 3 May 2016. Retrieved 22 July 2018.
database development at MIT: TIMIT and beyond". Speech 331. Wichern, Gordon; Antognini, Joe; Flynn, Michael; Licheng Richard
Communication. 9 (4): 351–356. doi:10.1016/0167-6393(90)90010- Zhu; McQuinn, Emmett; Crow, Dwight; Manilow, Ethan; Jonathan Le
7 (https://doi.org/10.1016%2F0167-6393%2890%2990010-7).
Roux (2019). "WHAM!: Extending Speech Separation to Noisy
314. Kapadia, Sadik, Valtcho Valtchev, and S. J. Young. "MMI training for Environments". arXiv:1907.01160 (https://arxiv.org/abs/1907.01160)
continuous phoneme recognition on the TIMIT database." [cs.SD (https://arxiv.org/archive/cs.SD)].
Acoustics, Speech, and Signal Processing, 1993. ICASSP-93.,
332. Drossos, K., Lipping, S., and Virtanen, T. "Clotho: An Audio
1993 IEEE International Conference on. Vol. 2. IEEE, 1993. Captioning Dataset" IEEE International Conference on Acoustics,
315. Halabi, Nawar (2016). Modern Standard Arabic Phonetics for Speech, and Signal Processing (ICASSP). 2020.
Speech Synthesis (http://en.arabicspeechcorpus.com/Nawar%20H 333. Drossos, K., Lipping, S., and Virtanen, T. (2019). Clotho dataset
alabi%20PhD%20Thesis%20Revised.pdf) (PDF) (PhD Thesis). (Version 1.0) [Data set]. Zenodo.
University of Southampton, School of Electronics and Computer
http://doi.org/10.5281/zenodo.3490684 (https://doi.org/10.5281/zeno
Science. do.3490684)
316. Ardila, Rosana; Branson, Megan; Davis, Kelly; Henretty, Michael;
334. The CAIDA UCSD Dataset on the Witty Worm – 19–24 March 2004,
Kohler, Michael; Meyer, Josh; Morais, Reuben; Saunders, Lindsay;
http://www.caida.org/data/passive/witty_worm_dataset.xml
Tyers, Francis M.; Weber, Gregor (13 December 2019). "Common
Voice: A Massively-Multilingual Speech Corpus". 335. Chen, Zesheng, and Chuanyi Ji. "Optimal worm-scanning method
arXiv:1912.06670v2 (https://arxiv.org/abs/1912.06670v2) [cs.CL (htt using vulnerable-host distributions (https://web.archive.org/web/201
ps://arxiv.org/archive/cs.CL)]. 90806022753/https://pdfs.semanticscholar.org/672e/7be9499fef9a7
ff6b131b650a4de7614aae8.pdf)." International Journal of Security
317. "The LJ Speech Dataset" (https://keithito.com/LJ-Speech-Dataset).
and Networks 2.1–2 (2007): 71–80.
keithito.com. Retrieved 13 April 2022.
336. Kachuee, Mohamad, et al. "Cuff-less high-accuracy calibration-free
318. Zhou, Fang, Q. Claire, and Ross D. King. "Predicting the
blood pressure estimation using pulse transit time (http://download.
geographical origin of music (https://ieeexplore.ieee.org/abstract/do
xuebalib.com/533elteIDEwk.pdf)." Circuits and Systems (ISCAS),
cument/7023456/)." Data Mining (ICDM), 2014 IEEE International 2015 IEEE International Symposium on. IEEE, 2015.
Conference on. IEEE, 2014.
337. PhysioBank, PhysioToolkit. "PhysioNet: components of a new
319. Saccenti, Edoardo; Camacho, José (2015). "On the use of the research resource for complex physiologic signals." Circulation.
observation‐wise k‐fold operation in PCA cross‐validation". Journal
v101 i23. e215-e220.
of Chemometrics. 29 (8): 467–478. doi:10.1002/cem.2726 (https://d
oi.org/10.1002%2Fcem.2726). hdl:10481/55302 (https://hdl.handle. 338. Vergara, Alexander; et al. (2012). "Chemical gas sensor drift
net/10481%2F55302). S2CID 62248957 (https://api.semanticschola compensation using classifier ensembles". Sensors and Actuators
r.org/CorpusID:62248957). B: Chemical. 166: 320–329. doi:10.1016/j.snb.2012.01.074 (https://
doi.org/10.1016%2Fj.snb.2012.01.074).
320. Bertin-Mahieux, Thierry, et al. "The million song dataset." ISMIR
2011: Proceedings of the 12th International Society for Music 339. Korotcenkov, G.; Cho, B. K. (2014). "Engineering approaches to
Information Retrieval Conference, 24–28 October 2011, Miami, improvement of conductometric gas sensor parameters. Part 2:
Florida. University of Miami, 2011. Decrease of dissipated (consumable) power and improvement
stability and reliability". Sensors and Actuators B: Chemical. 198:
321. Henaff, Mikael; et al. (2011). "Unsupervised learning of sparse
316–341. doi:10.1016/j.snb.2014.03.069 (https://doi.org/10.1016%2
features for scalable audio classification" (https://archives.ismir.net/i Fj.snb.2014.03.069).
smir2011/paper/000128.pdf) (PDF). ISMIR. 11.
340. Quinlan, John R (1992). "Learning with continuous classes" (https://
322. Rafii, Zafar (2017). "Music". MUSDB18 – a corpus for music
sci2s.ugr.es/keel/pdf/algorithm/congreso/1992-Quinlan-AI.pdf)
separation. doi:10.5281/zenodo.1117372 (https://doi.org/10.5281% (PDF). 5th Australian Joint Conference on Artificial Intelligence. 92.
2Fzenodo.1117372).
341. Merz, Christopher J.; Pazzani, Michael J. (1999). "A principal
323. Defferrard, Michaël; Benzi, Kirell; Vandergheynst, Pierre; Bresson, components approach to combining regression estimates" (https://d
Xavier (6 December 2016). "FMA: A Dataset For Music Analysis". oi.org/10.1023%2Fa%3A1007507221352). Machine Learning. 36
arXiv:1612.01840 (https://arxiv.org/abs/1612.01840) [cs.SD (https://
(1–2): 9–32. doi:10.1023/a:1007507221352 (https://doi.org/10.102
arxiv.org/archive/cs.SD)]. 3%2Fa%3A1007507221352).
342. Torres-Sospedra, Joaquin, et al. "UJIIndoorLoc-Mag: A new 353. Nathan, Ran; et al. (2012). "Using tri-axial acceleration data to
database for magnetic field-based localization problems." Indoor identify behavioral modes of free-ranging animals: general
Positioning and Indoor Navigation (IPIN), 2015 International concepts and tools illustrated for griffon vultures" (https://www.ncbi.
Conference on. IEEE, 2015. nlm.nih.gov/pmc/articles/PMC3284320). The Journal of
343. Berkvens, Rafael, Maarten Weyn, and Herbert Peremans. "Mean Experimental Biology. 215 (6): 986–996. doi:10.1242/jeb.058602 (h
Mutual Information of Probabilistic Wi-Fi Localization (https://www.r ttps://doi.org/10.1242%2Fjeb.058602). PMC 3284320 (https://www.
esearchgate.net/profile/Raf_Berkvens/publication/284154212_Mea ncbi.nlm.nih.gov/pmc/articles/PMC3284320). PMID 22357592 (http
n_Mutual_Information_of_Probabilistic_Wi-Fi_Localization/links/56 s://pubmed.ncbi.nlm.nih.gov/22357592).
4c6b7508aeab8ed5e92fcb.pdf)." Indoor Positioning and Indoor 354. Anguita, Davide, et al. "Human activity recognition on smartphones
Navigation (IPIN), 2015 International Conference on. Banff, using a multiclass hardware-friendly support vector machine (http
Canada: IPIN. 2015. s://upcommons.upc.edu/bitstream/handle/2117/101769/IWAAL201
344. Paschke, Fabian, et al. "Sensorlose Zustandsüberwachung an 2.pdf)." Ambient assisted living and home care. Springer Berlin
Synchronmotoren."Proceedings. 23. Workshop Computational Heidelberg, 2012. 216–223.
Intelligence, Dortmund, 5.-6. Dezember 2013. KIT Scientific 355. Su, Xing; Tong, Hanghang; Ji, Ping (2014). "Activity recognition
Publishing, 2013. with smartphone sensors". Tsinghua Science and Technology. 19
345. Lessmeier, Christian, et al. "Data Acquisition and Signal Analysis (3): 235–249. doi:10.1109/tst.2014.6838194 (https://doi.org/10.110
from Measured Motor Currents for Defect Detection in 9%2Ftst.2014.6838194). S2CID 62751498 (https://api.semanticsch
Electromechanical Drive Systems (https://www.researchgate.net/pr olar.org/CorpusID:62751498).
ofile/Olaf_Enge-Rosenblatt/publication/264441239_Data_Acquisiti 356. Kadous, Mohammed Waleed. Temporal classification: Extending
on_and_Signal_Analysis_from_Measured_Motor_Currents_for_De the classification paradigm to multivariate time series (https://pdfs.s
fect_Detection_in_Electromechanical_Drive_Systems/links/53df97 emanticscholar.org/4bad/c3f0ad169ed9ec7d073375e9b168fa9f6c8
e90cf2a768e49bb3b9.pdf)." f.pdf). Diss. The University of New South Wales, 2002.
346. Ugulino, Wallace, et al. "Wearable computing: Accelerometers’ data 357. Graves, Alex, et al. "Connectionist temporal classification: labelling
classification of body postures and movements (http://groupware.se unsegmented sequence data with recurrent neural networks (https://
condlab.inf.puc-rio.br/public/papers/2012.Ugulino.WearableComput mediatum.ub.tum.de/doc/1292048/file.pdf)." Proceedings of the
ing.HAR.Classifier.RIBBON.pdf) Archived (https://web.archive.org/w 23rd international conference on Machine learning. ACM, 2006.
eb/20200925222906/http://groupware.secondlab.inf.puc-rio.br/publi 358. Velloso, Eduardo, et al. "Qualitative activity recognition of weight
c/papers/2012.Ugulino.WearableComputing.HAR.Classifier.RIBBO lifting exercises (https://www.perceptualui.org/publications/velloso1
N.pdf) 25 September 2020 at the Wayback Machine." Advances in 3_ah.pdf)."Proceedings of the 4th Augmented Human International
Artificial Intelligence-SBIA 2012. Springer Berlin Heidelberg, 2012. Conference. ACM, 2013.
52–61.
359. Mortazavi, Bobak Jack, et al. "Determining the single best axis for
347. Schneider, Jan; et al. (2015). "Augmenting the senses: a review on exercise repetition recognition and counting on smartwatches (htt
sensor-based learning support" (https://www.ncbi.nlm.nih.gov/pmc/ p://www.thehabitslab.com/assets/papers/28.pdf) Archived (https://w
articles/PMC4367401). Sensors. 15 (2): 4097–4133. eb.archive.org/web/20211104043511/https://www.thehabitslab.co
Bibcode:2015Senso..15.4097S (https://ui.adsabs.harvard.edu/abs/2 m/assets/papers/28.pdf) 4 November 2021 at the Wayback
015Senso..15.4097S). doi:10.3390/s150204097 (https://doi.org/10. Machine." Wearable and Implantable Body Sensor Networks
3390%2Fs150204097). PMC 4367401 (https://www.ncbi.nlm.nih.go (BSN), 2014 11th International Conference on. IEEE, 2014.
v/pmc/articles/PMC4367401). PMID 25679313 (https://pubmed.ncb
360. Sapsanis, Christos, et al. "Improving EMG based Classification of
i.nlm.nih.gov/25679313).
basic hand movements using EMD (https://www.researchgate.net/pr
348. Madeo, Renata CB, Clodoaldo AM Lima, and Sarajane M. Peres. ofile/Christos_Sapsanis/publication/257602303_Improving_EMG_b
"Gesture unit segmentation using support vector machines: ased_classification_of_basic_hand_movements_using_EMD/links/
segmenting gestures from rest positions (https://tarjomefa.com/wp-c 56dfb7fd08ae979addef64a2/Improving-EMG-based-classification-o
ontent/uploads/2016/11/5781-English.pdf)." Proceedings of the f-basic-hand-movements-using-EMD.pdf)." Engineering in Medicine
28th Annual ACM Symposium on Applied Computing. ACM, 2013. and Biology Society (EMBC), 2013 35th Annual International
349. Lun, Roanna; Zhao, Wenbing (2015). "A survey of applications and Conference of the IEEE. IEEE, 2013.
human motion recognition with Microsoft Kinect" (https://engagedsc 361. Andrianesis, Konstantinos; Tzes, Anthony (2015). "Development
holarship.csuohio.edu/cgi/viewcontent.cgi?article=1417&context=e and control of a multifunctional prosthetic hand with shape memory
nece_facpub). International Journal of Pattern Recognition and alloy actuators". Journal of Intelligent & Robotic Systems. 78 (2):
Artificial Intelligence. 29 (5): 1555008. 257–289. doi:10.1007/s10846-014-0061-6 (https://doi.org/10.100
doi:10.1142/s0218001415550083 (https://doi.org/10.1142%2Fs021 7%2Fs10846-014-0061-6). S2CID 207174078 (https://api.semantic
8001415550083). scholar.org/CorpusID:207174078).
350. Theodoridis, Theodoros, and Huosheng Hu. "Action classification 362. Banos, Oresti; et al. (2014). "Dealing with the effects of sensor
of 3d human models using dynamic ANNs for mobile robot displacement in wearable activity recognition" (https://www.ncbi.nl
surveillance (https://cswww.sx.ac.uk/staff/hhu/Papers/ROBIO07-66. m.nih.gov/pmc/articles/PMC4118358). Sensors. 14 (6): 9995–
pdf) Archived (https://web.archive.org/web/20190806015015/https:// 10023. Bibcode:2014Senso..14.9995B (https://ui.adsabs.harvard.e
cswww.sx.ac.uk/staff/hhu/Papers/ROBIO07-66.pdf) 6 August 2019 du/abs/2014Senso..14.9995B). doi:10.3390/s140609995 (https://do
at the Wayback Machine."Robotics and Biomimetics, 2007. ROBIO i.org/10.3390%2Fs140609995). PMC 4118358 (https://www.ncbi.nl
2007. IEEE International Conference on. IEEE, 2007. m.nih.gov/pmc/articles/PMC4118358). PMID 24915181 (https://pub
351. Etemad, Seyed Ali, and Ali Arya. "3D human action recognition and med.ncbi.nlm.nih.gov/24915181).
style transformation using resilient backpropagation neural 363. Stisen, Allan, et al. "Smart Devices are Different: Assessing and
networks." Intelligent Computing and Intelligent Systems, 2009. MitigatingMobile Sensing Heterogeneities for Activity Recognition
ICIS 2009. IEEE International Conference on. Vol. 4. IEEE, 2009. (h (https://www.researchgate.net/profile/Henrik_Blunck/publication/30
ttps://ieeexplore.ieee.org/abstract/document/5357690/) 1464144_Smart_Devices_are_Different_Assessing_and_Mitigatin
352. Altun, Kerem; Barshan, Billur; Tunçel, Orkun (2010). "Comparative gMobile_Sensing_Heterogeneities_for_Activity_Recognition/links/
study on classifying human activities with miniature inertial and 585a4c4908ae3852d256f186.pdf)."Proceedings of the 13th ACM
magnetic sensors". Pattern Recognition. 43 (10): 3605–3620. Conference on Embedded Networked Sensor Systems. ACM,
Bibcode:2010PatRe..43.3605A (https://ui.adsabs.harvard.edu/abs/2 2015.
010PatRe..43.3605A). doi:10.1016/j.patcog.2010.04.019 (https://do 364. Bhattacharya, Sourav, and Nicholas D. Lane. "From Smart to Deep:
i.org/10.1016%2Fj.patcog.2010.04.019). hdl:11693/11947 (https://h Robust Activity Recognition on Smartwatches using Deep Learning
dl.handle.net/11693%2F11947). (http://discovery.ucl.ac.uk/1503672/1/deepwatch_wristsense.pdf)."
365. Bacciu, Davide; et al. (2014). "An experimental characterization of
reservoir computing in ambient assisted living applications". Neural
Computing and Applications. 24 (6): 1451–1464.
doi:10.1007/s00521-013-1364-4 (https://doi.org/10.1007%2Fs0052
1-013-1364-4). hdl:11568/237959 (https://hdl.handle.net/11568%2F
237959). S2CID 14124013 (https://api.semanticscholar.org/CorpusI
D:14124013).
366. Palumbo, Filippo; Barsocchi, Paolo; Gallicchio, Claudio; Chessa, 379. Kaya, Heysem, Pınar Tüfekci, and Fikret S. Gürgen. "Local and
Stefano; Micheli, Alessio (2013). "Multisensor Data Fusion for global learning methods for predicting power of a combined gas &
Activity Recognition Based on Reservoir Computing" (https://link.sp steam turbine." International conference on emerging trends in
ringer.com/chapter/10.1007/978-3-642-41043-7_3). Evaluating AAL computer and electronics engineering (ICETCEE'2012), Dubai.
Systems Through Competitive Benchmarking. Communications in 2012.
Computer and Information Science. Vol. 386. pp. 24–35. 380. Baldi, Pierre; Sadowski, Peter; Whiteson, Daniel (2014).
doi:10.1007/978-3-642-41043-7_3 (https://doi.org/10.1007%2F978- "Searching for exotic particles in high-energy physics with deep
3-642-41043-7_3). ISBN 978-3-642-41042-0. learning". Nature Communications. 5: 2014. arXiv:1402.4735 (http
367. Reiss, Attila, and Didier Stricker. "Introducing a new benchmarked s://arxiv.org/abs/1402.4735). Bibcode:2014NatCo...5.4308B (https://
dataset for activity monitoring (https://www.researchgate.net/profile/ ui.adsabs.harvard.edu/abs/2014NatCo...5.4308B).
Attila_Reiss/publication/235348485_Introducing_a_New_Benchma doi:10.1038/ncomms5308 (https://doi.org/10.1038%2Fncomms530
rked_Dataset_for_Activity_Monitoring/links/00b7d5309d19ca43460 8). PMID 24986233 (https://pubmed.ncbi.nlm.nih.gov/24986233).
00000/Introducing-a-New-Benchmarked-Dataset-for-Activity-Monito S2CID 195953 (https://api.semanticscholar.org/CorpusID:195953).
ring.pdf)."Wearable Computers (ISWC), 2012 16th International 381. Baldi, Pierre; Sadowski, Peter; Whiteson, Daniel (2015).
Symposium on. IEEE, 2012. "Enhanced Higgs Boson to τ+ τ− Search with Deep Learning".
368. Roggen, Daniel, et al. "OPPORTUNITY: Towards opportunistic Physical Review Letters. 114 (11): 111801. arXiv:1410.3469 (http
activity and context recognition systems (https://infoscience.epfl.ch/r s://arxiv.org/abs/1410.3469). Bibcode:2015PhRvL.114k1801B (http
ecord/138648/files/RoggenFoCaHoFaTrLuPiBaKuFeHoRiChMi09. s://ui.adsabs.harvard.edu/abs/2015PhRvL.114k1801B).
pdf)." World of Wireless, Mobile and Multimedia Networks & doi:10.1103/physrevlett.114.111801 (https://doi.org/10.1103%2Fph
Workshops, 2009. WoWMoM 2009. IEEE International Symposium ysrevlett.114.111801). PMID 25839260 (https://pubmed.ncbi.nlm.ni
on a. IEEE, 2009. h.gov/25839260). S2CID 2339142 (https://api.semanticscholar.org/
369. Kurz, Marc, et al. "Dynamic quantification of activity recognition CorpusID:2339142).
capabilities in opportunistic systems (https://www.researchgate.net/ 382. Adam-Bourdarios, C.; Cowan, G.; Germain-Renaud, C.; Guyon, I.;
profile/Marc_Kurz/publication/220271166_Dynamic_Quantification Kégl, B.; Rousseau, D. (2015). "The Higgs Machine Learning
_of_Activity_Recognition_Capabilities_in_Opportunistic_Systems/li Challenge" (https://higgsml.lal.in2p3.fr/). Journal of Physics:
nks/09e4150f66b480c97a000000/Dynamic-Quantification-of-Activit Conference Series. 664 (7): 072015.
y-Recognition-Capabilities-in-Opportunistic-Systems.pdf)." Bibcode:2015JPhCS.664g2015A (https://ui.adsabs.harvard.edu/ab
Vehicular Technology Conference (VTC Spring), 2011 IEEE 73rd. s/2015JPhCS.664g2015A). doi:10.1088/1742-6596/664/7/072015
IEEE, 2011. (https://doi.org/10.1088%2F1742-6596%2F664%2F7%2F072015).
370. Sztyler, Timo, and Heiner Stuckenschmidt. "On-body localization of 383. Baldi, Pierre; Cranmer, Kyle; Faucett, Taylor; Sadowski, Peter;
wearable devices: an investigation of position-aware activity Whiteson, Daniel (2016). "Parameterized neural networks for high-
recognition (https://sensor.informatik.uni-mannheim.de/publications/ energy physics". The European Physical Journal C. 76 (5): 235.
presentation/percom2016.pdf)." Pervasive Computing and arXiv:1601.07913 (https://arxiv.org/abs/1601.07913).
Communications (PerCom), 2016 IEEE International Conference Bibcode:2016EPJC...76..235B (https://ui.adsabs.harvard.edu/abs/2
on. IEEE, 2016. 016EPJC...76..235B). doi:10.1140/epjc/s10052-016-4099-4 (https://
371. Zhi, Ying Xuan; Lukasik, Michelle; Li, Michael H.; Dolatabadi, doi.org/10.1140%2Fepjc%2Fs10052-016-4099-4).
Elham; Wang, Rosalie H.; Taati, Babak (2018). "Automatic S2CID 254108545 (https://api.semanticscholar.org/CorpusID:25410
Detection of Compensation During Robotic Stroke Rehabilitation 8545).
Therapy" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5788403). 384. Ortigosa, I.; Lopez, R.; Garcia, J. "A neural networks approach to
IEEE Journal of Translational Engineering in Health and Medicine. residuary resistance of sailing yachts prediction". Proceedings of
6: 2100107. doi:10.1109/JTEHM.2017.2780836 (https://doi.org/10.1 the International Conference on Marine Engineering MARINE.
109%2FJTEHM.2017.2780836). ISSN 2168-2372 (https://www.worl 2007.
dcat.org/issn/2168-2372). PMC 5788403 (https://www.ncbi.nlm.nih. 385. Gerritsma, J., R. Onnink, and A. Versluis.Geometry, resistance and
gov/pmc/articles/PMC5788403). PMID 29404226 (https://pubmed.n stability of the delft systematic yacht hull series. Delft University of
cbi.nlm.nih.gov/29404226). Technology, 1981.
372. Dolatabadi, Elham; Zhi, Ying Xuan; Ye, Bing; Coahran, Marge; 386. Liu, Huan, and Hiroshi Motoda. Feature extraction, construction and
Lupinacci, Giorgia; Mihailidis, Alex; Wang, Rosalie; Taati, Babak selection: A data mining perspective (https://books.google.com/book
(23 May 2017). The toronto rehab stroke pose dataset to detect s?id=zi_0EdWW5fYC). Springer Science & Business Media, 1998.
compensation during stroke rehabilitation therapy. ACM. pp. 375–
387. Reich, Yoram. Converging to Ideal Design Knowledge by Learning.
381. doi:10.1145/3154862.3154925 (https://doi.org/10.1145%2F31 [Carnegie Mellon University], Engineering Design Research
54862.3154925). ISBN 9781450363631. S2CID 24581930 (https://
Center, 1989.
api.semanticscholar.org/CorpusID:24581930).
388. Todorovski, Ljupčo; Džeroski, Sašo (1999). "Experiments in Meta-
373. "Toronto Rehab Stroke Pose Dataset" (https://www.kaggle.com/der
level Learning with ILP" (https://link.springer.com/chapter/10.1007/9
ekdb/toronto-robot-stroke-posture-dataset).
78-3-540-48247-5_11). Principles of Data Mining and Knowledge
374. Jung, Merel M.; Poel, Mannes; Poppe, Ronald; Heylen, Dirk K. J. (1 Discovery. Lecture Notes in Computer Science. Vol. 1704. pp. 98–
March 2017). "Automatic recognition of touch gestures in the corpus 106. doi:10.1007/978-3-540-48247-5_11 (https://doi.org/10.1007%2
of social touch". Journal on Multimodal User Interfaces. 11 (1): 81– F978-3-540-48247-5_11). ISBN 978-3-540-66490-1.
96. doi:10.1007/s12193-016-0232-9 (https://doi.org/10.1007%2Fs1 S2CID 39382993 (https://api.semanticscholar.org/CorpusID:393829
2193-016-0232-9). ISSN 1783-8738 (https://www.worldcat.org/issn/ 93).
1783-8738). S2CID 1802116 (https://api.semanticscholar.org/Corpu
389. Wang, Yong. A new approach to fitting linear models in high
sID:1802116). dimensional spaces (http://www.cs.waikato.ac.nz/~ml/publications/2
375. Jung, M.M. (Merel) (1 June 2016). "Corpus of Social Touch (CoST)" 000/thesis.pdf). Diss. The University of Waikato, 2000.
(https://data.4tu.nl/articles/dataset/Corpus_of_Social_Touch_CoST 390. Kibler, Dennis; Aha, David W.; Albert, Marc K. (1989). "Instance‐
_/12696869). University of Twente. doi:10.4121/uuid:5ef62345-
based prediction of real‐valued attributes" (https://escholarship.org/
3b3e-479c-8e1d-c922748c9b29 (https://doi.org/10.4121%2Fuuid%
uc/item/68f860zb). Computational Intelligence. 5 (2): 51–57.
3A5ef62345-3b3e-479c-8e1d-c922748c9b29). doi:10.1111/j.1467-8640.1989.tb00315.x (https://doi.org/10.1111%2
376. Aeberhard, S., D. Coomans, and O. De Vel. "Comparison of Fj.1467-8640.1989.tb00315.x). S2CID 40800413 (https://api.seman
classifiers in high dimensional settings." Dept. Math. Statist., James ticscholar.org/CorpusID:40800413).
Cook Univ., North Queensland, Australia, Tech. Rep 92-02 (1992). 391. Palmer, Christopher R., and Christos Faloutsos. "Electricity based
377. Basu, Sugato. "Semi-supervised clustering with limited background external similarity of categorical attributes (http://citeseerx.ist.psu.ed
knowledge (http://www.aaai.org/Papers/AAAI/2004/AAAI04-138.pd u/viewdoc/download?doi=10.1.1.469.989&rep=rep1&type=pdf)."
f)." AAAI. 2004. Advances in Knowledge Discovery and Data Mining. Springer
378. Tüfekci, Pınar (2014). "Prediction of full load electrical power output Berlin Heidelberg, 2003. 486–500.
of a base load operated combined cycle power plant using machine
learning methods". International Journal of Electrical Power &
Energy Systems. 60: 126–140. doi:10.1016/j.ijepes.2014.02.027 (ht
tps://doi.org/10.1016%2Fj.ijepes.2014.02.027).
392. Tsanas, Athanasios; Xifara, Angeliki (2012). "Accurate quantitative 404. Sikora, Marek; Wróbel, Łukasz (2010). "Application of rule induction
estimation of energy performance of residential buildings using algorithms for analysis of data collected by seismic hazard
statistical machine learning tools". Energy and Buildings. 49: 560– monitoring systems in coal mines" (https://www.infona.pl/resource/b
567. doi:10.1016/j.enbuild.2012.03.003 (https://doi.org/10.1016%2F wmeta1.element.baztech-article-BPZ5-0008-0008). Archives of
j.enbuild.2012.03.003). Mining Sciences. 55 (1): 91–114.
393. De Wilde, Pieter (2014). "The gap between predicted and 405. Sikora, Marek, and Beata Sikora. "Rough natural hazards
measured energy performance of buildings: A framework for monitoring." Rough Sets: Selected Methods and Applications in
investigation". Automation in Construction. 41: 40–49. Management and Engineering. Springer London, 2012. 163–179.
doi:10.1016/j.autcon.2014.02.009 (https://doi.org/10.1016%2Fj.autc 406. Addor, Nans; Newman, Andrew J.; Mizukami, Naoki; Clark, Martyn
on.2014.02.009). P. (20 October 2017). "The CAMELS data set: catchment attributes
394. Brooks, Thomas F., D. Stuart Pope, and Michael A. Marcolini. Airfoil and meteorology for large-sample studies" (https://hess.copernicus.
self-noise and prediction (https://ntrs.nasa.gov/archive/nasa/casi.ntr org/articles/21/5293/2017/). Hydrology and Earth System Sciences.
s.nasa.gov/19890016302.pdf). Vol. 1218. National Aeronautics and 21 (10): 5293–5313. Bibcode:2017HESS...21.5293A (https://ui.adsa
Space Administration, Office of Management, Scientific and bs.harvard.edu/abs/2017HESS...21.5293A). doi:10.5194/hess-21-
Technical Information Division, 1989. 5293-2017 (https://doi.org/10.5194%2Fhess-21-5293-2017).
395. Draper, David. "Assessment and propagation of model uncertainty ISSN 1607-7938 (https://www.worldcat.org/issn/1607-7938).
(http://www2.denizyuret.com/ref/draper/assessment-and-propagatio 407. Newman, A. J.; Clark, M. P.; Sampson, K.; Wood, A.; Hay, L. E.;
n.pdf)." Journal of the Royal Statistical Society, Series B Bock, A.; Viger, R. J.; Blodgett, D.; Brekke, L.; Arnold, J. R.; Hopson,
(Methodological) (1995): 45–97. T. (14 January 2015). "Development of a large-sample watershed-
396. Lavine, Michael (1991). "Problems in extrapolation illustrated with scale hydrometeorological data set for the contiguous USA: data
space shuttle O-ring data". Journal of the American Statistical set characteristics and assessment of regional variability in
Association. 86 (416): 919–921. hydrologic model performance" (https://hess.copernicus.org/articles/
doi:10.1080/01621459.1991.10475132 (https://doi.org/10.1080%2F 19/209/2015/). Hydrology and Earth System Sciences. 19 (1): 209–
01621459.1991.10475132). 223. Bibcode:2015HESS...19..209N (https://ui.adsabs.harvard.edu/
397. Wang, Jun, Bei Yu, and Les Gasser. "Concept tree based clustering abs/2015HESS...19..209N). doi:10.5194/hess-19-209-2015 (https://
doi.org/10.5194%2Fhess-19-209-2015). ISSN 1607-7938 (https://w
visualization with shaded similarity matrices (https://www.researchg
ate.net/profile/Bei_Yu2/publication/228407462_Concept_Tree_Bas ww.worldcat.org/issn/1607-7938).
ed_Ordering_for_Shaded_Similarity_Matrix/links/00b7d5175607b6 408. Alvarez-Garreton, Camila; Mendoza, Pablo A.; Boisier, Juan Pablo;
1d2e000000.pdf)." Data Mining, 2002. ICDM 2003. Proceedings. Addor, Nans; Galleguillos, Mauricio; Zambrano-Bigiarini, Mauricio;
2002 IEEE International Conference on. IEEE, 2002. Lara, Antonio; Puelma, Cristóbal; Cortes, Gonzalo; Garreaud, Rene;
McPhee, James (13 November 2018). "The CAMELS-CL dataset:
398. Pettengill, Gordon H.; Ford, Peter G.; Johnson, William T. K.; Raney,
catchment attributes and meteorology for large sample studies –
R. Keith; Soderblom, Laurence A. (1991). "Magellan: Radar
Performance and Data Products" (https://www.science.org/doi/abs/1 Chile dataset" (https://hess.copernicus.org/articles/22/5817/2018/).
Hydrology and Earth System Sciences. 22 (11): 5817–5846.
0.1126/science.252.5003.260). Science. 252 (5003): 260–265.
Bibcode:2018HESS...22.5817A (https://ui.adsabs.harvard.edu/abs/
Bibcode:1991Sci...252..260P (https://ui.adsabs.harvard.edu/abs/19
91Sci...252..260P). doi:10.1126/science.252.5003.260 (https://doi.o 2018HESS...22.5817A). doi:10.5194/hess-22-5817-2018 (https://do
rg/10.1126%2Fscience.252.5003.260). PMID 17769272 (https://pub i.org/10.5194%2Fhess-22-5817-2018). ISSN 1607-7938 (https://ww
w.worldcat.org/issn/1607-7938). S2CID 133955609 (https://api.sem
med.ncbi.nlm.nih.gov/17769272). S2CID 43398343 (https://api.sem
anticscholar.org/CorpusID:43398343). anticscholar.org/CorpusID:133955609).
409. Chagas, Vinícius B. P.; Chaffe, Pedro L. B.; Addor, Nans; Fan,
399. Aharonian, F.; et al. (2008). "Energy spectrum of cosmic-ray
electrons at TeV energies". Physical Review Letters. 101 (26): Fernando M.; Fleischmann, Ayan S.; Paiva, Rodrigo C. D.;
261104. arXiv:0811.3894 (https://arxiv.org/abs/0811.3894). Siqueira, Vinícius A. (8 September 2020). "CAMELS-BR:
hydrometeorological time series and landscape attributes for 897
Bibcode:2008PhRvL.101z1104A (https://ui.adsabs.harvard.edu/ab
catchments in Brazil" (https://essd.copernicus.org/articles/12/2075/2
s/2008PhRvL.101z1104A). doi:10.1103/PhysRevLett.101.261104
(https://doi.org/10.1103%2FPhysRevLett.101.261104). 020/). Earth System Science Data. 12 (3): 2075–2096.
Bibcode:2020ESSD...12.2075C (https://ui.adsabs.harvard.edu/abs/
hdl:2440/51450 (https://hdl.handle.net/2440%2F51450).
2020ESSD...12.2075C). doi:10.5194/essd-12-2075-2020 (https://do
PMID 19437632 (https://pubmed.ncbi.nlm.nih.gov/19437632).
S2CID 41850528 (https://api.semanticscholar.org/CorpusID:418505 i.org/10.5194%2Fessd-12-2075-2020). ISSN 1866-3516 (https://ww
w.worldcat.org/issn/1866-3516). S2CID 234737197 (https://api.sem
28).
anticscholar.org/CorpusID:234737197).
400. Bock, R. K.; et al. (2004). "Methods for multidimensional event
classification: a case study using images from a Cherenkov 410. Coxon, Gemma; Addor, Nans; Bloomfield, John P.; Freer, Jim; Fry,
Matt; Hannaford, Jamie; Howden, Nicholas J. K.; Lane, Rosanna;
gamma-ray telescope". Nuclear Instruments and Methods in
Physics Research Section A: Accelerators, Spectrometers, Lewis, Melinda; Robinson, Emma L.; Wagener, Thorsten (12
Detectors and Associated Equipment. 516 (2): 511–528. October 2020). "CAMELS-GB: hydrometeorological time series and
landscape attributes for 671 catchments in Great Britain" (https://ess
Bibcode:2004NIMPA.516..511B (https://ui.adsabs.harvard.edu/abs/
2004NIMPA.516..511B). doi:10.1016/j.nima.2003.08.157 (https://do d.copernicus.org/articles/12/2459/2020/). Earth System Science
i.org/10.1016%2Fj.nima.2003.08.157). Data. 12 (4): 2459–2483. Bibcode:2020ESSD...12.2459C (https://ui.
adsabs.harvard.edu/abs/2020ESSD...12.2459C). doi:10.5194/essd-
401. Li, Jinyan; et al. (2004). "Deeps: A new instance-based lazy 12-2459-2020 (https://doi.org/10.5194%2Fessd-12-2459-2020).
discovery and classification system" (https://doi.org/10.1023%2Fb% ISSN 1866-3516 (https://www.worldcat.org/issn/1866-3516).
3Amach.0000011804.08528.7d). Machine Learning. 54 (2): 99– S2CID 226192657 (https://api.semanticscholar.org/CorpusID:22619
124. doi:10.1023/b:mach.0000011804.08528.7d (https://doi.org/10. 2657).
1023%2Fb%3Amach.0000011804.08528.7d).
411. Fowler, Keirnan J. A.; Acharya, Suwash Chandra; Addor, Nans;
402. Villaescusa-Navarro, Francisco; al., et (2022). "The CAMELS Chou, Chihchung; Peel, Murray C. (6 August 2021). "CAMELS-
Multifield Data Set: Learning the Universe's Fundamental AUS: hydrometeorological time series and landscape attributes for
Parameters with Artificial Intelligence". The Astrophysical Journal 222 catchments in Australia" (https://essd.copernicus.org/articles/1
Supplement Series. 259 (2): 61. arXiv:2109.10915 (https://arxiv.org/ 3/3847/2021/). Earth System Science Data. 13 (8): 3847–3867.
abs/2109.10915). Bibcode:2022ApJS..259...61V (https://ui.adsabs. Bibcode:2021ESSD...13.3847F (https://ui.adsabs.harvard.edu/abs/
harvard.edu/abs/2022ApJS..259...61V). doi:10.3847/1538- 2021ESSD...13.3847F). doi:10.5194/essd-13-3847-2021 (https://do
4365/ac5ab0 (https://doi.org/10.3847%2F1538-4365%2Fac5ab0). i.org/10.5194%2Fessd-13-3847-2021). ISSN 1866-3516 (https://ww
S2CID 237604997 (https://api.semanticscholar.org/CorpusID:23760 w.worldcat.org/issn/1866-3516). S2CID 238796784 (https://api.sem
4997). anticscholar.org/CorpusID:238796784).
403. Siebert, Lee, and Tom Simkin. "Volcanoes of the world: an
illustrated catalog of Holocene volcanoes and their eruptions."
(2014).
412. Klingler, Christoph; Schulz, Karsten; Herrnegger, Mathew (16 425. Donchin, Emanuel; Spencer, Kevin M.; Wijesinghe, Ranjith (2000).
September 2021). "LamaH-CE: LArge-SaMple DAta for Hydrology "The mental prosthesis: assessing the speed of a P300-based
and Environmental Sciences for Central Europe" (https://essd.coper brain-computer interface". IEEE Transactions on Rehabilitation
nicus.org/articles/13/4529/2021/). Earth System Science Data. 13 Engineering. 8 (2): 174–179. doi:10.1109/86.847808 (https://doi.org/
(9): 4529–4565. Bibcode:2021ESSD...13.4529K (https://ui.adsabs.h 10.1109%2F86.847808). PMID 10896179 (https://pubmed.ncbi.nlm.
arvard.edu/abs/2021ESSD...13.4529K). doi:10.5194/essd-13-4529- nih.gov/10896179).
2021 (https://doi.org/10.5194%2Fessd-13-4529-2021). ISSN 1866- 426. Detrano, Robert; et al. (1989). "International application of a new
3516 (https://www.worldcat.org/issn/1866-3516). S2CID 240533508 probability algorithm for the diagnosis of coronary artery disease".
(https://api.semanticscholar.org/CorpusID:240533508). The American Journal of Cardiology. 64 (5): 304–310.
413. Yeh, I–C (1998). "Modeling of strength of high-performance doi:10.1016/0002-9149(89)90524-9 (https://doi.org/10.1016%2F000
concrete using artificial neural networks". Cement and Concrete 2-9149%2889%2990524-9). PMID 2756873 (https://pubmed.ncbi.nl
Research. 28 (12): 1797–1808. doi:10.1016/s0008-8846(98)00165- m.nih.gov/2756873).
3 (https://doi.org/10.1016%2Fs0008-8846%2898%2900165-3). 427. Bradley, Andrew P (1997). "The use of the area under the ROC
414. Zarandi, MH Fazel; et al. (2008). "Fuzzy polynomial neural curve in the evaluation of machine learning algorithms" (http://espac
networks for approximation of the compressive strength of e.library.uq.edu.au/view/UQ:8925/pr-t.pdf) (PDF). Pattern
concrete". Applied Soft Computing. 8 (1): 488–498. Recognition. 30 (7): 1145–1159. Bibcode:1997PatRe..30.1145B (ht
Bibcode:2008ApSoC...8...79S (https://ui.adsabs.harvard.edu/abs/20 tps://ui.adsabs.harvard.edu/abs/1997PatRe..30.1145B).
08ApSoC...8...79S). doi:10.1016/j.asoc.2007.02.010 (https://doi.org/ doi:10.1016/s0031-3203(96)00142-2 (https://doi.org/10.1016%2Fs0
10.1016%2Fj.asoc.2007.02.010). 031-3203%2896%2900142-2). S2CID 13806304 (https://api.seman
415. Yeh, I. "Modeling slump of concrete with fly ash and ticscholar.org/CorpusID:13806304).
superplasticizer." Computers and Concrete5.6 (2008): 559–572. 428. Street, W. N.; Wolberg, W. H.; Mangasarian, O. L. (1993). "Nuclear
416. Gencel, Osman; et al. (2011). "Comparison of artificial neural feature extraction for breast tumor diagnosis" (https://www.spiedigita
networks and general linear model approaches for the analysis of llibrary.org/conference-proceedings-of-spie/1905/0000/Nuclear-feat
abrasive wear of concrete". Construction and Building Materials. 25 ure-extraction-for-breast-tumor-diagnosis/10.1117/12.148698.short).
(8): 3486–3494. doi:10.1016/j.conbuildmat.2011.03.040 (https://doi. In Acharya, Raj S; Goldgof, Dmitry B (eds.). Biomedical Image
org/10.1016%2Fj.conbuildmat.2011.03.040). Processing and Biomedical Visualization (http://digital.library.wisc.e
du/1793/59692). Vol. 1905. pp. 861–870. doi:10.1117/12.148698 (ht
417. Dietterich, Thomas G., et al. "A comparison of dynamic reposing
and tangent distance for drug activity prediction (http://papers.nips.c tps://doi.org/10.1117%2F12.148698). S2CID 14922543 (https://api.
semanticscholar.org/CorpusID:14922543).
c/paper/781-a-comparison-of-dynamic-reposing-and-tangent-distan
ce-for-drug-activity-prediction.pdf)." Advances in Neural Information 429. Demir, Cigdem, and Bülent Yener. "Automated cancer diagnosis
Processing Systems (1994): 216–216. based on histopathological images: a systematic survey (http://cites
eerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.1199&rep=rep1
418. Buscema, Massimo, William J. Tastle, and Stefano Terzi. "Meta net:
&type=pdf)." Rensselaer Polytechnic Institute, Tech. Rep (2005).
A new meta-classifier family (https://www.researchgate.net/profile/M
assimo_Buscema/publication/13731626_MetaNet_The_Theory_of 430. Abuse, Substance. "Mental Health Services Administration, Results
_Independent_Judges/links/0deec52baf2937fc8e000000.pd from the 2010 National Survey on Drug Use and Health: Summary
f)."Data Mining Applications Using Artificial Adaptive Systems. of National Findings, NSDUH Series H-41, HHS Publication No.
Springer New York, 2013. 141–182. (SMA) 11-4658." Rockville, MD: Substance Abuse and Mental
Health Services Administration 201 (2011).
419. Amoradnejad, Issa; Amoradnejad, Rahimberdi; et al. (2022). "Age
dataset: A structured general-purpose dataset on life, work, and 431. Hong, Zi-Quan; Yang, Jing-Yu (1991). "Optimal discriminant plane
death of 1.22 million distinguished people" (http://workshop-procee for a small number of samples and design method of classifier on
dings.icwsm.org/abstract?id=2022_82). Workshop Proceedings of the plane". Pattern Recognition. 24 (4): 317–324.
the 16th International AAAI Conference on Web and Social Media Bibcode:1991PatRe..24..317H (https://ui.adsabs.harvard.edu/abs/1
(ICWSM). 3: 1–4. doi:10.36190/2022.82 (https://doi.org/10.36190%2 991PatRe..24..317H). doi:10.1016/0031-3203(91)90074-f (https://do
F2022.82). S2CID 249668669 (https://api.semanticscholar.org/Corp i.org/10.1016%2F0031-3203%2891%2990074-f).
usID:249668669). 432. Li, Jinyan, and Limsoon Wong. "Using rules to analyse bio-medical
420. "Age Dataset" (https://github.com/Moradnejad/AgeDataset). GitHub. data: a comparison between C4. 5 and PCL." Advances in Web-
7 June 2022. Age Information Management. Springer Berlin Heidelberg, 2003.
421. "Synthetic Fundus Dataset" (https://web.archive.org/web/20211129 254–265.
155047/http://math.unipa.it/cvalenti/fundus/). Archived from the 433. Güvenir, H. Altay, et al. "A supervised machine learning algorithm
original (http://math.unipa.it/cvalenti/fundus/) on 29 November 2021. for arrhythmia analysis (http://repository.bilkent.edu.tr/bitstream/han
Retrieved 22 February 2023. dle/11693/27699/bilkent-research-paper.pdf?sequence=
422. Lo Castro, Dario; et al. (2020). "A visual framework to create 1)."Computers in Cardiology 1997. IEEE, 1997.
photorealistic retinal vessels for diagnosis purposes". Journal of 434. Lagus, Krista, et al. "Independent variable group analysis in
Biomedical Informatics. 108: 103490. doi:10.1016/j.jbi.2020.103490 learning compact representations for data (http://users.ics.aalto.fi/ah
(https://doi.org/10.1016%2Fj.jbi.2020.103490). PMID 32640292 (htt onkela/papers/Lagus05akrr.pdf)." Proceedings of the International
ps://pubmed.ncbi.nlm.nih.gov/32640292). S2CID 220429697 (http and Interdisciplinary Conference on Adaptive Knowledge
s://api.semanticscholar.org/CorpusID:220429697). Representation and Reasoning (AKRR'05), T. Honkela, V.
423. Ingber, Lester (1997). "Statistical mechanics of neocortical Könönen, M. Pöllä, and O. Simula, Eds., Espoo, Finland. 2005.
interactions: Canonical momenta indicatorsof 435. Strack, Beata, et al. "Impact of HbA1c measurement on hospital
electroencephalography". Physical Review E. 55 (4): 4578–4593. readmission rates: analysis of 70,000 clinical database patient
arXiv:physics/0001052 (https://arxiv.org/abs/physics/0001052). records (http://downloads.hindawi.com/journals/bmri/2014/781670.
Bibcode:1997PhRvE..55.4578I (https://ui.adsabs.harvard.edu/abs/1 pdf)." BioMed Research International 2014; 2014
997PhRvE..55.4578I). doi:10.1103/PhysRevE.55.4578 (https://doi.o 436. Rubin, Daniel J (2015). "Hospital readmission of patients with
rg/10.1103%2FPhysRevE.55.4578). S2CID 6390999 (https://api.se diabetes". Current Diabetes Reports. 15 (4): 1–9.
manticscholar.org/CorpusID:6390999). doi:10.1007/s11892-015-0584-7 (https://doi.org/10.1007%2Fs1189
424. Hoffmann, Ulrich; Vesin, Jean-Marc; Ebrahimi, Touradj; Diserens, 2-015-0584-7). PMID 25712258 (https://pubmed.ncbi.nlm.nih.gov/2
Karin (2008). "An efficient P300-based brain–computer interface for 5712258). S2CID 3908599 (https://api.semanticscholar.org/CorpusI
disabled subjects". Journal of Neuroscience Methods. 167 (1): 115– D:3908599).
125. CiteSeerX 10.1.1.352.4630 (https://citeseerx.ist.psu.edu/viewd 437. Antal, Bálint; Hajdu, András (2014). "An ensemble-based system for
oc/summary?doi=10.1.1.352.4630). automatic screening of diabetic retinopathy". Knowledge-Based
doi:10.1016/j.jneumeth.2007.03.005 (https://doi.org/10.1016%2Fj.jn Systems. 60 (2014): 20–27. arXiv:1410.8576 (https://arxiv.org/abs/1
eumeth.2007.03.005). PMID 17445904 (https://pubmed.ncbi.nlm.ni 410.8576). Bibcode:2014arXiv1410.8576A (https://ui.adsabs.harvar
h.gov/17445904). S2CID 9648828 (https://api.semanticscholar.org/ d.edu/abs/2014arXiv1410.8576A).
CorpusID:9648828). doi:10.1016/j.knosys.2013.12.023 (https://doi.org/10.1016%2Fj.kno
sys.2013.12.023). S2CID 13984326 (https://api.semanticscholar.or
g/CorpusID:13984326).
438. Haloi, Mrinal (2015). "Improved Microaneurysm Detection using 451. Javadi, Soroush; Mirroshandel, Seyed Abolghasem (2019). "A
Deep Neural Networks". arXiv:1505.04424 (https://arxiv.org/abs/150 novel deep learning method for automatic assessment of human
5.04424) [cs.CV (https://arxiv.org/archive/cs.CV)]. sperm images". Computers in Biology and Medicine. 109: 182–194.
439. ELIE, Guillaume PATRY, Gervais GAUTHIER, Bruno LAY, Julien doi:10.1016/j.compbiomed.2019.04.030 (https://doi.org/10.1016%2
ROGER, Damien. "ADCIS Download Third Party: Messidor Fj.compbiomed.2019.04.030). ISSN 0010-4825 (https://www.worldc
Database" (http://www.adcis.net/en/Download-Third-Party/Messido at.org/issn/0010-4825). PMID 31059902 (https://pubmed.ncbi.nlm.ni
r.htmldownload.php). adcis.net. Retrieved 25 February 2018. h.gov/31059902). S2CID 146809768 (https://api.semanticscholar.or
g/CorpusID:146809768).
440. Decencière, Etienne; Zhang, Xiwei; Cazuguel, Guy; Lay, Bruno;
Cochener, Béatrice; Trone, Caroline; Gain, Philippe; Ordonez, 452. "soroushj/mhsma-dataset: MHSMA: The Modified Human Sperm
Richard; Massin, Pascale (26 August 2014). "Feedback on a Morphology Analysis Dataset" (https://github.com/soroushj/mhsma-
Publicly Distributed Image Database: The Messidor Database" (http dataset). github.com. Retrieved 3 May 2019.
s://doi.org/10.5566%2Fias.1155). Image Analysis & Stereology. 33 453. Clark, David, Zoltan Schreter, and Anthony Adams. "A quantitative
(3): 231–234. doi:10.5566/ias.1155 (https://doi.org/10.5566%2Fias. comparison of dystal and backpropagation." Proceedings of 1996
1155). ISSN 1854-5165 (https://www.worldcat.org/issn/1854-5165). Australian Conference on Neural Networks. 1996.
441. Bagirov, A. M.; et al. (2003). "Unsupervised and supervised data 454. Jiang, Yuan, and Zhi-Hua Zhou. "Editing training data for kNN
classification via nonsmooth and global optimization". Top. 11 (1): classifiers with neural network ensemble (https://cs.nju.edu.cn/zhou
1–75. CiteSeerX 10.1.1.1.6429 (https://citeseerx.ist.psu.edu/viewdo zh/zhouzh.files/publication/isnn04a.pdf)." Advances in Neural
c/summary?doi=10.1.1.1.6429). doi:10.1007/bf02578945 (https://do Networks–ISNN 2004. Springer Berlin Heidelberg, 2004. 356–361.
i.org/10.1007%2Fbf02578945). S2CID 14165678 (https://api.seman 455. Ontañón, Santiago, and Enric Plaza. "On similarity measures based
ticscholar.org/CorpusID:14165678). on a refinement lattice." Case-Based Reasoning Research and
442. Fung, Glenn, et al. "A fast iterative algorithm for fisher discriminant Development. Springer Berlin Heidelberg, 2009. 240–255.
using heterogeneous kernels (https://jinbo-bi.uconn.edu/wp-conten 456. "PLF data inventory" (https://github.com/Animal-Data-Inventory/PLF
t/uploads/sites/2638/2018/12/icml04_kernel.pdf)."Proceedings of DataInventory). GitHub. 5 November 2021.
the twenty-first international conference on Machine learning. ACM,
457. Higuera, Clara; Gardiner, Katheleen J.; Cios, Krzysztof J. (2015).
2004. "Self-organizing feature maps identify proteins critical to learning in
443. Quinlan, John Ross, et al. "Inductive knowledge acquisition: a case a mouse model of down syndrome" (https://www.ncbi.nlm.nih.gov/p
study." Proceedings of the Second Australian Conference on mc/articles/PMC4482027). PLOS ONE. 10 (6): e0129126.
Applications of expert systems. Addison-Wesley Longman Bibcode:2015PLoSO..1029126H (https://ui.adsabs.harvard.edu/ab
Publishing Co., Inc., 1987. s/2015PLoSO..1029126H). doi:10.1371/journal.pone.0129126 (http
444. Zhou, Zhi-Hua; Jiang, Yuan (2004). "NeC4. 5: neural ensemble s://doi.org/10.1371%2Fjournal.pone.0129126). PMC 4482027 (http
based C4. 5". IEEE Transactions on Knowledge and Data s://www.ncbi.nlm.nih.gov/pmc/articles/PMC4482027).
Engineering. 16 (6): 770–773. CiteSeerX 10.1.1.1.8430 (https://cites PMID 26111164 (https://pubmed.ncbi.nlm.nih.gov/26111164).
eerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.8430). 458. Ahmed, Md Mahiuddin; et al. (2015). "Protein dynamics associated
doi:10.1109/tkde.2004.11 (https://doi.org/10.1109%2Ftkde.2004.1 with failed and rescued learning in the Ts65Dn mouse model of
1). S2CID 1024861 (https://api.semanticscholar.org/CorpusID:1024 Down syndrome" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4
861). 368539). PLOS ONE. 10 (3): e0119491.
445. Er, Orhan; et al. (2012). "An approach based on probabilistic neural Bibcode:2015PLoSO..1019491A (https://ui.adsabs.harvard.edu/ab
network for diagnosis of Mesothelioma's disease". Computers & s/2015PLoSO..1019491A). doi:10.1371/journal.pone.0119491 (http
Electrical Engineering. 38 (1): 75–81. s://doi.org/10.1371%2Fjournal.pone.0119491). PMC 4368539 (http
doi:10.1016/j.compeleceng.2011.09.001 (https://doi.org/10.1016%2 s://www.ncbi.nlm.nih.gov/pmc/articles/PMC4368539).
Fj.compeleceng.2011.09.001). PMID 25793384 (https://pubmed.ncbi.nlm.nih.gov/25793384).
446. Er, Orhan, A. Çetin Tanrikulu, and Abdurrahman Abakay. "Use of 459. Langley, PAT (2014). "Trading off simplicity and coverage in
artificial intelligence techniques for diagnosis of malignant pleural incremental concept learning" (https://web.archive.org/web/201908
mesothelioma (https://dergipark.org.tr/download/article-file/5452 06184005/https://www.westmont.edu/~iba/pubs/hillary-paper.pdf)
1)."Dicle Tıp Dergisi 42.1 (2015). (PDF). Machine Learning Proceedings. 1988: 73. Archived from the
447. Li, Michael H.; Mestre, Tiago A.; Fox, Susan H.; Taati, Babak (25 original (https://www.westmont.edu/~iba/pubs/hillary-paper.pdf)
July 2017). "Vision-Based Assessment of Parkinsonism and (PDF) on 6 August 2019. Retrieved 6 August 2019.
Levodopa-Induced Dyskinesia with Deep Learning Pose 460. "Mushroom Data Set 2020" (https://mushroom.mathematik.uni-marb
Estimation" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC621908 urg.de/). mushroom.mathematik.uni-marburg.de. Retrieved 6 April
2). Journal of Neuroengineering and Rehabilitation. 15 (1): 97. 2021.
arXiv:1707.09416 (https://arxiv.org/abs/1707.09416). 461. Wagner, Dennis; Heider, Dominik; Hattab, Georges (14 April 2021).
Bibcode:2017arXiv170709416L (https://ui.adsabs.harvard.edu/abs/ "Mushroom data creation, curation, and simulation to support
2017arXiv170709416L). doi:10.1186/s12984-018-0446-z (https://do classification tasks" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC
i.org/10.1186%2Fs12984-018-0446-z). PMC 6219082 (https://www. 8046754). Scientific Reports. 11 (1): 8134.
ncbi.nlm.nih.gov/pmc/articles/PMC6219082). PMID 30400914 (http Bibcode:2021NatSR..11.8134W (https://ui.adsabs.harvard.edu/abs/
s://pubmed.ncbi.nlm.nih.gov/30400914). 2021NatSR..11.8134W). doi:10.1038/s41598-021-87602-3 (https://
448. Li, Michael H.; Mestre, Tiago A.; Fox, Susan H.; Taati, Babak (May doi.org/10.1038%2Fs41598-021-87602-3). ISSN 2045-2322 (http
2018). "Automated assessment of levodopa-induced dyskinesia: s://www.worldcat.org/issn/2045-2322). PMC 8046754 (https://www.
Evaluating the responsiveness of video-based features". ncbi.nlm.nih.gov/pmc/articles/PMC8046754). PMID 33854157 (http
Parkinsonism & Related Disorders. 53: 42–45. s://pubmed.ncbi.nlm.nih.gov/33854157).
doi:10.1016/j.parkreldis.2018.04.036 (https://doi.org/10.1016%2Fj.p 462. Cortez, Paulo, and Aníbal de Jesus Raimundo Morais. "A data
arkreldis.2018.04.036). ISSN 1353-8020 (https://www.worldcat.org/i mining approach to predict forest fires using meteorological data."
ssn/1353-8020). PMID 29748112 (https://pubmed.ncbi.nlm.nih.gov/ (2007).
29748112). S2CID 13666294 (https://api.semanticscholar.org/Corp 463. Farquad, M. A. H.; Ravi, V.; Raju, S. Bapi (2010). "Support vector
usID:13666294). regression based hybrid rule extraction methods for forecasting".
449. "Parkinson's Vision-Based Pose Estimation Dataset | Kaggle" (http Expert Systems with Applications. 37 (8): 5577–5589.
s://www.kaggle.com/limi44/parkinsons-visionbased-pose-estimatio doi:10.1016/j.eswa.2010.02.055 (https://doi.org/10.1016%2Fj.eswa.
n-dataset/home). kaggle.com. Retrieved 22 August 2018. 2010.02.055).
450. Shannon, Paul; et al. (2003). "Cytoscape: a software environment 464. Fisher, Ronald A (1936). "The use of multiple measurements in
for integrated models of biomolecular interaction networks" (https:// taxonomic problems". Annals of Eugenics. 7 (2): 179–188.
www.ncbi.nlm.nih.gov/pmc/articles/PMC403769). Genome doi:10.1111/j.1469-1809.1936.tb02137.x (https://doi.org/10.1111%2
Research. 13 (11): 2498–2504. doi:10.1101/gr.1239303 (https://doi. Fj.1469-1809.1936.tb02137.x). hdl:2440/15227 (https://hdl.handle.n
org/10.1101%2Fgr.1239303). PMC 403769 (https://www.ncbi.nlm.ni et/2440%2F15227).
h.gov/pmc/articles/PMC403769). PMID 14597658 (https://pubmed.n
cbi.nlm.nih.gov/14597658).
465. Ghahramani, Zoubin, and Michael I. Jordan. "Supervised learning 478. Muresan, Horea; Oltean, Mihai (2018). "Fruit recognition from
from incomplete data via an EM approach (http://papers.nips.cc/pap images using deep learning" (https://www.researchgate.net/publicat
er/767-supervised-learning-from-incomplete-data-via-an-em-approa ion/321475443). Acta Univ. Sapientiae, Informatica. 10 (1): 26–42.
ch.pdf)." Advances in neural information processing systems 6. doi:10.2478/ausi-2018-0002 (https://doi.org/10.2478%2Fausi-2018-
1994. 0002).
466. Mallah, Charles; Cope, James; Orwell, James (2013). "Plant leaf 479. Oltean, Mihai; Muresan, Horea (2017). "A dataset with fruit images
classification using probabilistic integration of shape, texture and on Kaggle" (https://www.kaggle.com/moltean/fruits).
margin features" (https://www.researchgate.net/publication/2666323 480. Nakai, Kenta; Kanehisa, Minoru (1991). "Expert system for
57). Signal Processing, Pattern Recognition and Applications. 5: 1. predicting protein localization sites in gram‐negative bacteria".
467. Yahiaoui, Itheri, Olfa Mzoughi, and Nozha Boujemaa. "Leaf shape Proteins: Structure, Function, and Bioinformatics. 11 (2): 95–110.
descriptor for tree species identification (http://www.cmlab.csie.ntu.e doi:10.1002/prot.340110203 (https://doi.org/10.1002%2Fprot.34011
du.tw/~zenic/Data/Download/ICME2012/Conference/data/4711a25 0203). PMID 1946347 (https://pubmed.ncbi.nlm.nih.gov/1946347).
4.pdf) Archived (https://web.archive.org/web/20190806184006/htt S2CID 27606447 (https://api.semanticscholar.org/CorpusID:276064
p://www.cmlab.csie.ntu.edu.tw/~zenic/Data/Download/ICME2012/C 47).
onference/data/4711a254.pdf) 6 August 2019 at the Wayback 481. Ling, Charles X., et al. "Decision trees with minimal costs (https://cli
Machine." Multimedia and Expo (ICME), 2012 IEEE International ng.csd.uwo.ca/cs860/ICML04-Ling.pdf)." Proceedings of the twenty-
Conference on. IEEE, 2012. first international conference on Machine learning. ACM, 2004.
468. Tan, Ming, and Larry Eshelman. "Using weighted networks to 482. Mahé, Pierre, et al. "Automatic identification of mixed bacterial
represent classification knowledge in noisy domains (https://www.s species fingerprints in a MALDI-TOF mass-spectrum (https://acade
ciencedirect.com/science/article/pii/B9780934613644500189)." mic.oup.com/bioinformatics/article/30/9/1280/237488)."
Proceedings of the Fifth International Conference on Machine Bioinformatics (2014): btu022.
Learning. 2014.
483. Barbano, Duane; et al. (2015). "Rapid characterization of
469. Charytanowicz, Małgorzata, et al. "Complete gradient clustering microalgae and microalgae mixtures using matrix-assisted laser
algorithm for features analysis of x-ray images (http://home.agh.edu. desorption ionization time-of-flight mass spectrometry (MALDI-TOF
pl/~kulpi/publ/Charytanowicz_Niewczas_Kulczycki_Kowalski_Luk MS)" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4536233).
asik_Zak_-_Information_Technologies_in_Biomedicine_-_2010.pd PLOS ONE. 10 (8): e0135337. Bibcode:2015PLoSO..1035337B (htt
f)." Information technologies in biomedicine. Springer Berlin ps://ui.adsabs.harvard.edu/abs/2015PLoSO..1035337B).
Heidelberg, 2010. 15–24. doi:10.1371/journal.pone.0135337 (https://doi.org/10.1371%2Fjour
470. Sanchez, Mauricio A.; et al. (2014). "Fuzzy granular gravitational nal.pone.0135337). PMC 4536233 (https://www.ncbi.nlm.nih.gov/p
clustering algorithm for multivariate data". Information Sciences. mc/articles/PMC4536233). PMID 26271045 (https://pubmed.ncbi.nl
279: 498–511. doi:10.1016/j.ins.2014.04.005 (https://doi.org/10.101 m.nih.gov/26271045).
6%2Fj.ins.2014.04.005). 484. Horton, Paul; Nakai, Kenta (1996). "A probabilistic classification
471. Blackard, Jock A.; Dean, Denis J. (1999). "Comparative accuracies system for predicting the cellular localization sites of proteins" (http
of artificial neural networks and discriminant analysis in predicting s://www.aaai.org/Papers/ISMB/1996/ISMB96-012.pdf) (PDF). ISMB-
forest cover types from cartographic variables". Computers and 96 Proceedings. 4: 109–15. PMID 8877510 (https://pubmed.ncbi.nl
Electronics in Agriculture. 24 (3): 131–151. m.nih.gov/8877510).
CiteSeerX 10.1.1.128.2475 (https://citeseerx.ist.psu.edu/viewdoc/su 485. Allwein, Erin L.; Schapire, Robert E.; Singer, Yoram (2001).
mmary?doi=10.1.1.128.2475). doi:10.1016/s0168-1699(99)00046-0 "Reducing multiclass to binary: A unifying approach for margin
(https://doi.org/10.1016%2Fs0168-1699%2899%2900046-0). classifiers" (http://www.jmlr.org/papers/volume1/allwein00a/allwein
S2CID 13985407 (https://api.semanticscholar.org/CorpusID:139854 00a.pdf) (PDF). The Journal of Machine Learning Research. 1:
07). 113–141.
472. Fürnkranz, Johannes. "Round robin rule learning (http://citeseerx.is 486. Mayr, Andreas; Klambauer, Guenter; Unterthiner, Thomas;
t.psu.edu/viewdoc/summary?doi=10.1.1.20.9520)."Proceedings of Hochreiter, Sepp (2016). "DeepTox: Toxicity Prediction Using Deep
the 18th International Conference on Machine Learning (ICML-01): Learning" (http://bioinf.jku.at/research/DeepTox/tox21.html).
146—153. 2001. Frontiers in Environmental Science. 3: 80.
473. Li, Song; Assmann, Sarah M.; Albert, Réka (2006). "Predicting doi:10.3389/fenvs.2015.00080 (https://doi.org/10.3389%2Ffenvs.20
essential components of signal transduction networks: a dynamic 15.00080).
model of guard cell abscisic acid signaling" (https://www.ncbi.nlm.ni 487. Lavin, Alexander; Ahmad, Subutai (12 October 2015). Evaluating
h.gov/pmc/articles/PMC1564158). PLOS Biol. 4 (10): e312. arXiv:q- Real-time Anomaly Detection Algorithms – the Numenta Anomaly
bio/0610012 (https://arxiv.org/abs/q-bio/0610012). Benchmark. p. 38. arXiv:1510.03336 (https://arxiv.org/abs/1510.033
Bibcode:2006q.bio....10012L (https://ui.adsabs.harvard.edu/abs/200 36). doi:10.1109/ICMLA.2015.141 (https://doi.org/10.1109%2FICML
6q.bio....10012L). doi:10.1371/journal.pbio.0040312 (https://doi.org/ A.2015.141). ISBN 978-1-5090-0287-0. S2CID 6842305 (https://api.
10.1371%2Fjournal.pbio.0040312). PMC 1564158 (https://www.ncb semanticscholar.org/CorpusID:6842305).
i.nlm.nih.gov/pmc/articles/PMC1564158). PMID 16968132 (https://p 488. Iurii D. Katser; Vyacheslav O. Kozitsin. "SKAB GitHub repository" (h
ubmed.ncbi.nlm.nih.gov/16968132).
ttps://github.com/waico/skab). GitHub. Retrieved 12 January 2021.
474. Munisami, Trishen; et al. (2015). "Plant Leaf Recognition Using
489. Iurii D. Katser; Vyacheslav O. Kozitsin (2020). "Skoltech Anomaly
Shape Features and Colour Histogram with K-nearest Neighbour Benchmark (SKAB)" (https://www.kaggle.com/yuriykatser/skoltech-
Classifiers" (https://doi.org/10.1016%2Fj.procs.2015.08.095).
anomaly-benchmark-skab). Kaggle.
Procedia Computer Science. 58: 740–747.
doi:10.34740/KAGGLE/DSV/1693952 (https://doi.org/10.34740%2F
doi:10.1016/j.procs.2015.08.095 (https://doi.org/10.1016%2Fj.procs. KAGGLE%2FDSV%2F1693952). Retrieved 12 January 2021.
2015.08.095).
490. Campos, Guilherme O.; Zimek, Arthur; Sander, Jörg; Campello,
475. Li, Bai (2016). "Atomic potential matching: An evolutionary target Ricardo J. G. B.; Micenková, Barbora; Schubert, Erich; Assent, Ira;
recognition approach based on edge features". Optik. 127 (5): Houle, Michael E. (2016). "On the evaluation of unsupervised
3162–3168. Bibcode:2016Optik.127.3162L (https://ui.adsabs.harvar
outlier detection: measures, datasets, and an empirical study". Data
d.edu/abs/2016Optik.127.3162L). doi:10.1016/j.ijleo.2015.11.186 (h Mining and Knowledge Discovery. 30 (4): 891. doi:10.1007/s10618-
ttps://doi.org/10.1016%2Fj.ijleo.2015.11.186). 015-0444-8 (https://doi.org/10.1007%2Fs10618-015-0444-8).
476. Nilsback, Maria-Elena, and Andrew Zisserman. "A visual ISSN 1384-5810 (https://www.worldcat.org/issn/1384-5810).
vocabulary for flower classification (http://www.robots.ox.ac.uk/~me S2CID 1952214 (https://api.semanticscholar.org/CorpusID:195221
n/papers/nilsback_cvpr06.pdf)."Computer Vision and Pattern 4).
Recognition, 2006 IEEE Computer Society Conference on. Vol. 2.
491. Ann-Kathrin Hartmann, Tommaso Soru, Edgard Marx. Generating a
IEEE, 2006. Large Dataset for Neural Question Answering over the DBpedia
477. Giselsson, Thomas M.; et al. (2017). "A Public Image Database for Knowledge Base (https://www.researchgate.net/publication/324482
Benchmark of Plant Seedling Classification Algorithms". 598_Generating_a_Large_Dataset_for_Neural_Question_Answeri
arXiv:1711.05458 (https://arxiv.org/abs/1711.05458) [cs.CV (https:// ng_over_the_DBpedia_Knowledge_Base). 2018.
arxiv.org/archive/cs.CV)].
492. Tommaso Soru, Edgard Marx. Diego Moussallem, Andre
Valdestilhas, Diego Esteves, Ciro Baron. SPARQL as a Foreign
Language. 2018.
493. Kiet Van Nguyen, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan 507. "CWE - Common Weakness Enumeration" (https://cwe.mitre.org/ind
Luu-Thuy Nguyen. A Vietnamese Dataset for Evaluating Machine ex.html). cwe.mitre.org. Retrieved 14 January 2023.
Reading Comprehension (https://www.aclweb.org/anthology/2020.c 508. Lim, Swee Kiat; Muis, Aldrian Obaja; Lu, Wei; Ong, Chen Hui (July
oling-main.233.pdf). COLING 2020. 2017). "MalwareTextDB: A Database for Annotated Malware
494. Kiet Van Nguyen, Khiem Vinh Tran, Son T. Luu, Anh Gia-Tuan Articles" (https://aclanthology.org/P17-1143). Proceedings of the
Nguyen, Ngan Luu-Thuy Nguyen. Enhancing Lexical-Based 55th Annual Meeting of the Association for Computational
Approach With External Knowledge for Vietnamese Multiple- Linguistics (Volume 1: Long Papers). Vancouver, Canada:
Choice Machine Reading Comprehension (https://ieeexplore.ieee.o Association for Computational Linguistics: 1557–1567.
rg/document/9247161). IEEE Access. 2020. doi:10.18653/v1/P17-1143 (https://doi.org/10.18653%2Fv1%2FP17
495. Anantha, Raviteja; Vakulenko, Svitlana; Tu, Zhucheng; Longpre, -1143). S2CID 7816596 (https://api.semanticscholar.org/CorpusID:7
Shayne; Pulman, Stephen; Chappidi, Srinivas (2020). "Open- 816596).
Domain Question Answering Goes Conversational via Question 509. "USENIX" (https://www.usenix.org/). USENIX. Retrieved
Rewriting". arXiv:2010.04898 (https://arxiv.org/abs/2010.04898) 19 January 2023.
[cs.IR (https://arxiv.org/archive/cs.IR)]. 510. "APTnotes | Read the Docs" (https://readthedocs.org/projects/aptno
496. Khashabi, Daniel; Min, Sewon; Khot, Tushar; Sabharwal, Ashish; tes/). readthedocs.org. Retrieved 19 January 2023.
Tafjord, Oyvind; Clark, Peter; Hajishirzi, Hannaneh (November 511. "Cryptography and Security authors/titles recent submissions" (http
2020). "UNIFIEDQA: Crossing Format Boundaries with a Single QA s://arxiv.org/list/cs.CR/recent). arxiv.org. Retrieved 19 January 2023.
System" (https://aclanthology.org/2020.findings-emnlp.171). 512. "Holistic Info-Sec for Web Developers - Fascicle 0" (https://f0.holisti
Findings of the Association for Computational Linguistics: EMNLP
cinfosecforwebdevelopers.com/).
2020. Online: Association for Computational Linguistics: 1896– f0.holisticinfosecforwebdevelopers.com. Retrieved 20 January
1907. arXiv:2005.00700 (https://arxiv.org/abs/2005.00700). 2023.
doi:10.18653/v1/2020.findings-emnlp.171 (https://doi.org/10.1865
3%2Fv1%2F2020.findings-emnlp.171). S2CID 218487109 (https://a 513. "Holistic Info-Sec for Web Developers - Fascicle 1" (https://f1.holisti
pi.semanticscholar.org/CorpusID:218487109). cinfosecforwebdevelopers.com/).
f1.holisticinfosecforwebdevelopers.com. Retrieved 20 January
497. Taskmaster (https://github.com/google-research-datasets/Taskmast
2023.
er), Google Research Datasets, 17 December 2022, retrieved
7 January 2023 514. Vincent, Adam. "Web Services Web Services Hacking and
Hardening" (https://owasp.org/www-pdf-archive/Web_Services_Ha
498. Byrne, Bill; Krishnamoorthi, Karthik; Sankar, Chinnadhurai;
cking_and_Hardening.pdf) (PDF). owasp.org.
Neelakantan, Arvind; Duckworth, Daniel; Yavuz, Semih; Goodrich,
Ben; Dubey, Amit; Cedilnik, Andy; Kim, Kyu-Young (1 September 515. McCray, Joe. "Advanced SQL Injection" (https://defcon.org/images/d
2019). "Taskmaster-1: Toward a Realistic and Diverse Dialog efcon-17/dc-17-presentations/defcon-17-joseph_mccray-adv_sql_in
Dataset". arXiv:1909.05358 (https://arxiv.org/abs/1909.05358) jection.pdf) (PDF). defcon.org.
[cs.CL (https://arxiv.org/archive/cs.CL)]. 516. Shah, Shreeraj. "Blind SQL injection discovery & exploitation
499. Yasunaga, Michihiro; Liang, Percy (21 November 2020). "Graph- technique" (https://blueinfy.com/wp/blindsql.pdf) (PDF).
based, Self-Supervised Program Repair from Diagnostic blueinfy.com.
Feedback" (https://proceedings.mlr.press/v119/yasunaga20a.html). 517. Palcer, C. C. "Ethical hacking" (https://blueinfy.com/wp/blindsql.pdf)
International Conference on Machine Learning. PMLR: 10799– (PDF). textfiles.
10808. arXiv:2005.10636 (https://arxiv.org/abs/2005.10636). 518. "Hacking Secrets Revealed - Information and Instructional Guide"
500. Wang, Yizhong; Mishra, Swaroop; Alipoormolabashi, Pegah; Kordi, (https://www.onlinepot.org/security/HackersSecrets.pdf) (PDF).
Yeganeh; Mirzaei, Amirreza; Arunkumar, Anjana; Ashok, Arjun; 519. Park, Alexis. "Hack any website" (https://defcon.org/images/defcon-
Dhanasekaran, Arut Selvan; Naik, Atharva; Stap, David; Pathak, 11/dc-11-presentations/dc-11-Gentil/dc-11-gentil.pdf) (PDF).
Eshaan; Karamanolakis, Giannis; Lai, Haizhi Gary; Purohit, Ishan; 520. Cerrudo, Cesar; Martinez Fayo, Esteban. "Hacking Databases for
Mondal, Ishani (24 October 2022). "Super-NaturalInstructions: Owning your Data" (https://www.blackhat.com/presentations/bh-eur
Generalization via Declarative Instructions on 1600+ NLP Tasks". ope-07/Cerrudo/Whitepaper/bh-eu-07-cerrudo-WP-up.pdf) (PDF).
arXiv:2204.07705 (https://arxiv.org/abs/2204.07705) [cs.CL (https://a blackhat.
rxiv.org/archive/cs.CL)].
521. O'Connor, Tj. "Violent Python-A Cookbook for Hackers, Forensic
501. Paperno, Denis; Kruszewski, Germán; Lazaridou, Angeliki; Pham, Analysts, Penetration Testers and Security Engineers" (https://githu
Quan Ngoc; Bernardi, Raffaella; Pezzelle, Sandro; Baroni, Marco; b.com/reconSF/python/blob/master/Syngress.Violent.Python.a.Coo
Boleda, Gemma; Fernández, Raquel (7 August 2016), The kbook.for.Hackers.2013.pdf) (PDF). Github.
LAMBADA dataset (https://zenodo.org/record/2630551),
522. Grand, Joe. "Hardware Reverse Engineering: Access, Analyze, &
doi:10.5281/zenodo.2630551 (https://doi.org/10.5281%2Fzenodo.2
Defeat" (https://media.blackhat.com/bh-dc-11/Grand/BlackHat_DC_
630551), retrieved 7 January 2023
2011_Grand-Workshop.pdf) (PDF). blackhat.
502. Paperno, Denis; Kruszewski, Germán; Lazaridou, Angeliki; Pham,
523. Chang, Jason V. "Computer Hacking: Making the Case for National
Ngoc Quan; Bernardi, Raffaella; Pezzelle, Sandro; Baroni, Marco;
Reporting Requirement" (https://cyber.harvard.edu/sites/cyber.law.h
Boleda, Gemma; Fernández, Raquel (August 2016). "The
arvard.edu/files/ComputerHacking.pdf) (PDF). cyber.harvard.edu.
LAMBADA dataset: Word prediction requiring a broad discourse
context" (https://aclanthology.org/P16-1144). Proceedings of the 524. "National Cybersecurity Strategies Repository" (https://www.itu.int:4
54th Annual Meeting of the Association for Computational 43/en/ITU-D/Cybersecurity/Pages/National-Strategies-repository.as
Linguistics (Volume 1: Long Papers). Berlin, Germany: Association px). ITU. Retrieved 20 January 2023.
for Computational Linguistics: 1525–1534. doi:10.18653/v1/P16- 525. Chen, Yanlin (31 August 2022), Cyber Security Natural Language
1144 (https://doi.org/10.18653%2Fv1%2FP16-1144). Processing (https://github.com/Ychen463/Cyber), retrieved
hdl:10230/32702 (https://hdl.handle.net/10230%2F32702). 20 January 2023
S2CID 2381275 (https://api.semanticscholar.org/CorpusID:238127 526. "https://twitter.com/blackorbird" (https://twitter.com/blackorbird).
5). Twitter. Retrieved 20 January 2023. {{cite web}}: External link
503. Wei, Jason; Bosma, Maarten; Zhao, Vincent; Guu, Kelvin; Yu, in |title= (help)
Adams Wei; Lester, Brian; Du, Nan; Dai, Andrew M.; Le, Quoc V. 527. Zampieri, Marcos; Malmasi, Shervin; Nakov, Preslav; Rosenthal,
(10 February 2022). "Finetuned Language Models are Zero-Shot Sara; Farra, Noura; Kumar, Ritesh (16 April 2019). "Predicting the
Learners" (https://openreview.net/forum?id=gEZrGCozdqR). Type and Target of Offensive Posts in Social Media".
arXiv:2109.01652 (https://arxiv.org/abs/2109.01652). arXiv:1902.09666 (https://arxiv.org/abs/1902.09666) [cs.CL (https://a
504. "Working with ATT&CK | MITRE ATT&CK®" (https://attack.mitre.or rxiv.org/archive/cs.CL)].
g/resources/working-with-attack/). attack.mitre.org. Retrieved 528. "Threat reports" (https://www.ncsc.gov.uk/section/keep-up-to-date/th
14 January 2023. reat-reports). www.ncsc.gov.uk. Retrieved 20 January 2023.
505. "CAPEC - Common Attack Pattern Enumeration and Classification 529. "Category: APT reports | Securelist" (https://securelist.com/category/
(CAPEC™)" (https://capec.mitre.org/). capec.mitre.org. Retrieved apt-reports/). securelist.com. Retrieved 23 January 2023.
14 January 2023. 530. "Your Cybersecurity News Connection - Cyber News | CyberWire"
506. "CVE - Home" (https://cve.mitre.org/cve/). cve.mitre.org. Retrieved (https://thecyberwire.com/). The CyberWire. Retrieved 23 January
14 January 2023. 2023.
531. "News" (https://www.databreaches.net/news/). Retrieved 556. "Climatext" (http://www.sustainablefinance.uzh.ch/en/research/clim
23 January 2023. ate-fever/climatext.html). www.sustainablefinance.uzh.ch. Retrieved
532. "Cybernews" (https://cybernews.com/). Cybernews. 19 February 2023.
533. "HIPAA Journal" (https://www.hipaajournal.com/). HIPAA Journal. 557. "Greenbiz" (https://www.greenbiz.com/). www.greenbiz.com.
Retrieved 23 January 2023. Retrieved 2 March 2023.
534. "BleepingComputer" (https://www.bleepingcomputer.com/). 558. "Explore the @Reuters Hot List of 1,000 top climate scientists" (http
BleepingComputer. Retrieved 23 January 2023. s://www.reuters.com/investigates/special-report/climate-change-sci
entists-list/). Reuters. Retrieved 22 March 2023.
535. "Homepage" (https://therecord.media/). The Record from Recorded
Future News. Retrieved 23 January 2023. 559. "Blogs | Alliance for Research on Corporate Sustainability" (https://c
orporate-sustainability.org/blogs/). corporate-sustainability.org.
536. "HackRead | Latest Cyber Crime - InfoSec- Tech - Hacking News"
Retrieved 27 March 2023.
(https://www.hackread.com/). 8 January 2022. Retrieved 23 January
2023. 560. "Greenbiz" (https://www.greenbiz.com/). www.greenbiz.com.
Retrieved 29 March 2023.
537. "Securelist | Kaspersky's threat research and reports" (https://secure
list.com/). securelist.com. Retrieved 31 January 2023. 561. "CSR News" (https://www.csrwire.com/press_releases).
www.csrwire.com. Retrieved 29 March 2023.
538. Harshaw, Christopher R.; Bridges, Robert A.; Iannacone, Michael
D.; Reed, Joel W.; Goodall, John R. (5 April 2016). "GraphPrints: 562. "CDP Homepage" (https://www.cdp.net/en). www.cdp.net.
Towards a Graph Analytic Method for Network Anomaly Detection" Retrieved 29 March 2023.
(https://doi.org/10.1145/2897795.2897806). Proceedings of the 11th 563. "Hybrid cloud blog" (https://content.cloud.redhat.com/blog).
Annual Cyber and Information Security Research Conference. content.cloud.redhat.com. Retrieved 9 April 2023.
CISRC '16. New York, NY, USA: Association for Computing 564. "Production-Grade Container Orchestration" (https://kubernetes.io/).
Machinery: 1–4. doi:10.1145/2897795.2897806 (https://doi.org/10.1 Kubernetes. Retrieved 9 April 2023.
145%2F2897795.2897806). ISBN 978-1-4503-3752-6. 565. "Home | Official Red Hat OpenShift Documentation" (https://docs.op
539. "Farsight Security, cyber security intelligence solutions" (https://ww enshift.com/). docs.openshift.com. Retrieved 9 April 2023.
w.farsightsecurity.com/). Farsight Security. Retrieved 13 February 566. "Cloud Native Computing Foundation" (https://www.cncf.io/). Cloud
2023. Native Computing Foundation. Retrieved 9 April 2023.
540. "Schneier on Security" (https://www.schneier.com/).
567. CNCF Community Presentations (https://github.com/cncf/presentati
www.schneier.com. Retrieved 13 February 2023. ons/blob/2ff57e4d78f6d70bb1fd5daf81e76f04a54c8520/kubernete
541. "#1 in Cloud Security & Endpoint Cybersecurity" (https://www.trend s/README.md), Cloud Native Computing Foundation (CNCF), 11
micro.com/en_us/business.html). Trend Micro. Retrieved April 2023, retrieved 11 April 2023
13 February 2023. 568. "Red Hat - We make open source technologies for the enterprise"
542. "The Hacker News | #1 Trusted Cybersecurity News Site" (https://th (https://www.redhat.com/en). www.redhat.com. Retrieved 1 May
ehackernews.com/). The Hacker News. Retrieved 13 February 2023.
2023.
569. Brown, Michael Scott, Michael J. Pelosi, and Henry Dirska.
543. "Krebs on Security – In-depth security news and investigation" (http "Dynamic-radius species-conserving genetic algorithm for the
s://krebsonsecurity.com/). Retrieved 25 February 2023. financial forecasting of Dow Jones index stocks (http://www.academ
544. "MITRE D3FEND Knowledge Graph" (https://d3fend.mitre.org/). ia.edu/download/46729605/BrownPelosiDirska79880027.pdf)."
d3fend.mitre.org. Retrieved 31 March 2023. Machine Learning and Data Mining in Pattern Recognition.
545. "MITRE | ATLAS™" (https://atlas.mitre.org/). atlas.mitre.org. Springer Berlin Heidelberg, 2013. 27–41.
Retrieved 31 March 2023. 570. Shen, Kao-Yi; Tzeng, Gwo-Hshiung (2015). "Fuzzy Inference-
546. "MITRE Engage™ | An Adversary Engagement Framework from Enhanced VC-DRSA Model for Technical Analysis: Investment
MITRE" (https://engage.mitre.org/). Retrieved 1 April 2023. Decision Aid". International Journal of Fuzzy Systems. 17 (3): 375–
547. "Hacking Tutorials - The best Step-by-Step Hacking Tutorials" (http 389. doi:10.1007/s40815-015-0058-8 (https://doi.org/10.1007%2Fs
s://www.hackingtutorials.org/). Hacking Tutorials. Retrieved 1 April 40815-015-0058-8). S2CID 68241024 (https://api.semanticscholar.o
rg/CorpusID:68241024).
2023.
548. "TCFD Knowledge Hub" (https://www.tcfdhub.org/). TCFD 571. Quinlan, J. Ross (1987). "Simplifying decision trees". International
Journal of Man-Machine Studies. 27 (3): 221–234.
Knowledge Hub. Retrieved 3 February 2023.
CiteSeerX 10.1.1.18.4267 (https://citeseerx.ist.psu.edu/viewdoc/su
549. "ResponsibilityReports.com" (https://www.responsibilityreports.co mmary?doi=10.1.1.18.4267). doi:10.1016/s0020-7373(87)80053-6
m/). www.responsibilityreports.com. Retrieved 3 February 2023. (https://doi.org/10.1016%2Fs0020-7373%2887%2980053-6).
550. "About — IPCC" (https://www.ipcc.ch/about/). Retrieved 572. Hamers, Bart; Suykens, Johan AK; De Moor, Bart (2003). "Coupled
20 February 2023. transductive ensemble learning of kernel models" (http://ftp.esat.kul
551. "Alliance for Research on Corporate Sustainability | ARCS serves euven.be/pub/SISTA/hamers/BH_clm.pdf) (PDF). Journal of
as a vehicle for advancing rigorous academic research on Machine Learning Research. 1: 1–48.
corporate sustainability issues" (https://corporate-sustainability.or 573. Shmueli, Galit, Ralph P. Russo, and Wolfgang Jank. "The
g/). corporate-sustainability.org. Retrieved 2 March 2023. BARISTA: a model for bid arrivals in online auctions (https://project
552. Mehra, Srishti; Louka, Robert; Zhang, Yixun (26 March 2022). euclid.org/download/pdfview_1/euclid.aoas/1196438025)." The
"ESGBERT: Language Model to Help with Classification Tasks Annals of Applied Statistics(2007): 412–441.
Related to Companies Environmental, Social, and Governance 574. Peng, Jie, and Hans-Georg Müller. "Distance-based clustering of
Practices". Embedded Systems and Applications: 183–190. sparsely observed stochastic processes, with applications to online
arXiv:2203.16788 (https://arxiv.org/abs/2203.16788). auctions (https://projecteuclid.org/download/pdfview_1/euclid.aoas/
doi:10.5121/csit.2022.120616 (https://doi.org/10.5121%2Fcsit.2022. 1223908052)." The Annals of Applied Statistics (2008): 1056–1077.
120616). ISBN 9781925953657. S2CID 247825524 (https://api.sem
575. Eggermont, Jeroen, Joost N. Kok, and Walter A. Kosters. "Genetic
anticscholar.org/CorpusID:247825524).
programming for data classification: Partitioning the search space
553. This article incorporates text (https://www.tensorflow.or (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.9.8725&r
g/datasets/community_catalog/huggingface/climate_fever) ep=rep1&type=pdf)."Proceedings of the 2004 ACM symposium on
available under the CC BY 4.0 license. Applied computing. ACM, 2004.
554. Diggelmann, Thomas; Boyd-Graber, Jordan; Bulian, Jannis; 576. Moro, Sérgio; Cortez, Paulo; Rita, Paulo (2014). "A data-driven
Ciaramita, Massimiliano; Leippold, Markus (2 January 2021). approach to predict the success of bank telemarketing". Decision
"CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Support Systems. 62: 22–31. doi:10.1016/j.dss.2014.03.001 (https://
Claims". arXiv:2012.00614 (https://arxiv.org/abs/2012.00614) [cs.CL doi.org/10.1016%2Fj.dss.2014.03.001). hdl:10071/9499 (https://hdl.
(https://arxiv.org/archive/cs.CL)]. handle.net/10071%2F9499). S2CID 14181100 (https://api.semantic
555. "climate-news-db" (http://www.climate-news-db.com/). www.climate- scholar.org/CorpusID:14181100).
news-db.com. Retrieved 3 February 2023.
577. Payne, Richard D.; Mallick, Bani K. (2014). "Bayesian Big Data 593. Meek, Christopher, Bo Thiesson, and David Heckerman. "The
Classification: A Review with Complements". arXiv:1411.5653 (http Learning Curve Method Applied to Clustering (https://www.microsof
s://arxiv.org/abs/1411.5653) [stat.ME (https://arxiv.org/archive/stat.M t.com/en-us/research/wp-content/uploads/2001/01/lc-aistats.pdf)."
E)]. AISTATS. 2001.
578. Akbilgic, Oguz; Bozdogan, Hamparsum; Balaban, M. Erdal (2014). 594. Fanaee-T, Hadi; Gama, Joao (2013). "Event labeling combining
"A novel Hybrid RBF Neural Networks model as a forecaster". ensemble detectors and background knowledge" (http://repositorio.i
Statistics and Computing. 24 (3): 365–375. doi:10.1007/s11222- nesctec.pt/handle/123456789/3506). Progress in Artificial
013-9375-7 (https://doi.org/10.1007%2Fs11222-013-9375-7). Intelligence. 2 (2–3): 113–127. doi:10.1007/s13748-013-0040-3 (htt
S2CID 17764829 (https://api.semanticscholar.org/CorpusID:177648 ps://doi.org/10.1007%2Fs13748-013-0040-3). S2CID 3345087 (http
29). s://api.semanticscholar.org/CorpusID:3345087).
579. Jabin, Suraiya. "Stock market prediction using feed-forward artificial 595. Giot, Romain, and Raphaël Cherrier. "Predicting bikeshare system
neural network (http://citeseerx.ist.psu.edu/viewdoc/download?doi= usage up to one day ahead (https://hal.archives-ouvertes.fr/docs/01/
10.1.1.677.8985&rep=rep1&type=pdf)." Int. J. Comput. Appl. (IJCA) 06/59/83/PDF/paper_final.pdf)." Computational intelligence in
99.9 (2014). vehicles and transportation systems (CIVTS), 2014 IEEE
580. Yeh, I-Cheng; Che-hui, Lien (2009). "The comparisons of data symposium on. IEEE, 2014.
mining techniques for the predictive accuracy of probability of 596. Zhan, Xianyuan; et al. (2013). "Urban link travel time estimation
default of credit card clients". Expert Systems with Applications. 36 using large-scale taxi data with partial information". Transportation
(2): 2473–2480. doi:10.1016/j.eswa.2007.12.020 (https://doi.org/10. Research Part C: Emerging Technologies. 33: 37–49.
1016%2Fj.eswa.2007.12.020). doi:10.1016/j.trc.2013.04.001 (https://doi.org/10.1016%2Fj.trc.2013.
581. Lin, Shu Ling (2009). "A new two-stage hybrid approach of credit 04.001).
risk in banking industry". Expert Systems with Applications. 36 (4): 597. Moreira-Matias, Luis; et al. (2013). "Predicting taxi–passenger
8333–8341. doi:10.1016/j.eswa.2008.10.015 (https://doi.org/10.101 demand using streaming data" (http://repositorio.inesctec.pt/handle/
6%2Fj.eswa.2008.10.015). 123456789/5356). IEEE Transactions on Intelligent Transportation
582. Pelckmans, Kristiaan; et al. (2005). "The differogram: Non- Systems. 14 (3): 1393–1402. doi:10.1109/tits.2013.2262376 (https://
parametric noise variance estimation and its use for model doi.org/10.1109%2Ftits.2013.2262376). S2CID 14764358 (https://a
selection". Neurocomputing. 69 (1): 100–122. pi.semanticscholar.org/CorpusID:14764358).
doi:10.1016/j.neucom.2005.02.015 (https://doi.org/10.1016%2Fj.ne 598. Hwang, Ren-Hung; Hsueh, Yu-Ling; Chen, Yu-Ting (2015). "An
ucom.2005.02.015). effective taxi recommender system based on a spatio-temporal
583. Bay, Stephen D.; et al. (2000). "The UCI KDD archive of large data factor analysis model". Information Sciences. 314: 28–40.
sets for data mining research and experimentation". ACM SIGKDD doi:10.1016/j.ins.2015.03.068 (https://doi.org/10.1016%2Fj.ins.201
Explorations Newsletter. 2 (2): 81–85. CiteSeerX 10.1.1.15.9776 (ht 5.03.068).
tps://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.15.9776). 599. H. V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis
doi:10.1145/380995.381030 (https://doi.org/10.1145%2F380995.38 Papakonstantinou, Jignesh M. Patel, Raghu Ramakrishnan, and
1030). S2CID 534881 (https://api.semanticscholar.org/CorpusID:53 Cyrus Shahabi. Big data and its technical challenges. Commun.
4881). ACM, 57(7):86–94, July 2014.
584. Lucas, D. D.; et al. (2015). "Designing optimal greenhouse gas 600. Caltrans PeMS (http://pems.dot.ca.gov/)
observing networks that consider performance and cost" (https://doi. 601. Meusel, Robert, et al. "The Graph Structure in the Web—Analyzed
org/10.5194%2Fgi-4-121-2015). Geoscientific Instrumentation, on Different Aggregation Levels (https://www.nowpublishers.com/art
Methods and Data Systems. 4 (1): 121. Bibcode:2015GI......4..121L icle/OpenAccessDownload/JWS-0003)."The Journal of Web
(https://ui.adsabs.harvard.edu/abs/2015GI......4..121L). Science 1.1 (2015).
doi:10.5194/gi-4-121-2015 (https://doi.org/10.5194%2Fgi-4-121-201 602. Kushmerick, Nicholas. "Learning to remove internet advertisements
5). (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.35.5686
585. Pales, Jack C.; Keeling, Charles D. (1965). "The concentration of &rep=rep1&type=pdf)." Proceedings of the third annual conference
atmospheric carbon dioxide in Hawaii". Journal of Geophysical on Autonomous Agents. ACM, 1999.
Research. 70 (24): 6053–6076. Bibcode:1965JGR....70.6053P (http
603. Fradkin, Dmitriy, and David Madigan. "Experiments with random
s://ui.adsabs.harvard.edu/abs/1965JGR....70.6053P).
projections for machine learning (https://www.researchgate.net/profi
doi:10.1029/jz070i024p06053 (https://doi.org/10.1029%2Fjz070i02 le/Dmitriy_Fradkin/publication/2573186_Experiments_with_Rando
4p06053).
m_Projections_for_Machine_Learning/links/0fcfd50b6230aaf30900
586. Sigillito, Vincent G., et al. "Classification of radar returns from the 0000.pdf)."Proceedings of the ninth ACM SIGKDD international
ionosphere using neural networks." Johns Hopkins APL Technical conference on Knowledge discovery and data mining. ACM, 2003.
Digest10.3 (1989): 262–266.
604. This data was used in the American Statistical Association
587. Zhang, Kun, and Wei Fan. "Forecasting skewed biased stochastic Statistical Graphics and Computing Sections 1999 Data Exposition.
ozone days: analyses, solutions and beyond (http://citeseerx.ist.ps
605. Ma, Justin, et al. "Identifying suspicious URLs: an application of
u.edu/viewdoc/download?doi=10.1.1.218.9860&rep=rep1&type=pd
large-scale online learning (https://cseweb.ucsd.edu/~voelker/pubs/
f)." Knowledge and Information Systems14.3 (2008): 299–326. mal-url-icml09.pdf)."Proceedings of the 26th annual international
588. Reich, Brian J., Montserrat Fuentes, and David B. Dunson. conference on machine learning. ACM, 2009.
"Bayesian spatial quantile regression (https://www.ncbi.nlm.nih.gov/
606. Levchenko, Kirill, et al. "Click trajectories: End-to-end analysis of
pmc/articles/PMC3583387/)." Journal of the American Statistical
the spam value chain (http://www.icir.org/christian/publications/2011
Association (2012).
-oakland-trajectory.pdf)." Security and Privacy (SP), 2011 IEEE
589. Kohavi, Ron (1996). "Scaling Up the Accuracy of Naive-Bayes Symposium on. IEEE, 2011.
Classifiers: A Decision-Tree Hybrid". KDD. 96.
607. Mohammad, Rami M., Fadi Thabtah, and Lee McCluskey. "An
590. Oza, Nikunj C., and Stuart Russell. "Experimental comparisons of assessment of features related to phishing websites using an
online and batch versions of bagging and boosting." Proceedings of automated technique (http://eprints.hud.ac.uk/16229/1/The_7th_ICI
the seventh ACM SIGKDD international conference on Knowledge TST_2012_Conference_-An_Assessment_of_Features_Related_t
discovery and data mining. ACM, 2001. o_Phishing_Websites_using_an_Automated_Technique.pd
591. Bay, Stephen D (2001). "Multivariate discretization for set mining". f)."Internet Technology And Secured Transactions, 2012
Knowledge and Information Systems. 3 (4): 491–512. International Conference for. IEEE, 2012.
CiteSeerX 10.1.1.217.921 (https://citeseerx.ist.psu.edu/viewdoc/su 608. Singh, Ashishkumar, et al. "Clustering Experiments on Big
mmary?doi=10.1.1.217.921). doi:10.1007/pl00011680 (https://doi.or Transaction Data for Market Segmentation (https://dl.acm.org/citatio
g/10.1007%2Fpl00011680). S2CID 10945544 (https://api.semantic n.cfm?id=2644161)." Proceedings of the 2014 International
scholar.org/CorpusID:10945544). Conference on Big Data Science and Computing. ACM, 2014.
592. Ruggles, Steven (1995). "Sample designs and sampling errors". 609. Bollacker, Kurt, et al. "Freebase: a collaboratively created graph
Historical Methods. 28 (1): 40–46. database for structuring human knowledge (http://citeseerx.ist.psu.e
doi:10.1080/01615440.1995.9955312 (https://doi.org/10.1080%2F0 du/viewdoc/download?doi=10.1.1.538.7139&rep=rep1&type=pdf)."
1615440.1995.9955312). Proceedings of the 2008 ACM SIGMOD international conference on
Management of data. ACM, 2008.
610. Mintz, Mike, et al. "Distant supervision for relation extraction without 626. Li, Lihong; Chu, Wei; Langford, John; Wang, Xuanhui (2011).
labeled data (https://www.aclweb.org/anthology/P09-1113)." "Unbiased offline evaluation of contextual-bandit-based news
Proceedings of the Joint Conference of the 47th Annual Meeting of article recommendation algorithms". Proceedings of the fourth ACM
the ACL and the 4th International Joint Conference on Natural international conference on Web search and data mining. pp. 297–
Language Processing of the AFNLP: Volume 2-Volume 2. 306. arXiv:1003.5956 (https://arxiv.org/abs/1003.5956).
Association for Computational Linguistics, 2009. doi:10.1145/1935826.1935878 (https://doi.org/10.1145%2F193582
611. Mesterharm, Chris, and Michael J. Pazzani. "Active learning using 6.1935878). ISBN 9781450304931. S2CID 744200 (https://api.sem
on-line algorithms (http://research.cs.rutgers.edu/~mesterha/active- anticscholar.org/CorpusID:744200).
online.pdf) Archived (https://web.archive.org/web/20170922013803/ 627. Yeung, Kam Fung, and Yanyan Yang. "A proactive personalized
http://research.cs.rutgers.edu/~mesterha/active-online.pdf) 22 mobile news recommendation system (https://ieeexplore.ieee.org/a
September 2017 at the Wayback Machine."Proceedings of the 17th bstract/document/5633837/)." Developments in E-systems
ACM SIGKDD international conference on Knowledge discovery Engineering (DESE), 2010. IEEE, 2010.
and data mining. ACM, 2011. 628. Gass, Susan E.; Roberts, J. Murray (2006). "The occurrence of the
612. Wang, Shusen; Zhang, Zhihua (2013). "Improving CUR matrix cold-water coral Lophelia pertusa (Scleractinia) on oil and gas
decomposition and the Nyström approximation via adaptive platforms in the North Sea: colony growth, recruitment and
sampling" (http://www.jmlr.org/papers/volume14/wang13c/wang13c. environmental controls on distribution". Marine Pollution Bulletin.
pdf) (PDF). The Journal of Machine Learning Research. 14 (1): 52 (5): 549–559. Bibcode:2006MarPB..52..549G (https://ui.adsabs.h
2729–2769. arXiv:1303.4207 (https://arxiv.org/abs/1303.4207). arvard.edu/abs/2006MarPB..52..549G).
Bibcode:2013arXiv1303.4207W (https://ui.adsabs.harvard.edu/abs/ doi:10.1016/j.marpolbul.2005.10.002 (https://doi.org/10.1016%2Fj.
2013arXiv1303.4207W). marpolbul.2005.10.002). PMID 16300800 (https://pubmed.ncbi.nlm.
613. "The Pile" (https://pile.eleuther.ai/). pile.eleuther.ai. Retrieved nih.gov/16300800).
14 April 2022. 629. Gionis, Aristides; Mannila, Heikki; Tsaparas, Panayiotis (2007).
614. "JSON Lines" (https://jsonlines.org/). jsonlines.org. Retrieved "Clustering aggregation". ACM Transactions on Knowledge
14 April 2022. Discovery from Data. 1 (1): 4. CiteSeerX 10.1.1.709.528 (https://cite
seerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.709.528).
615. Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe,
Travis; Foster, Charles; Phang, Jason; He, Horace; Thite, Anish; doi:10.1145/1217299.1217303 (https://doi.org/10.1145%2F121729
9.1217303). S2CID 433708 (https://api.semanticscholar.org/CorpusI
Nabeshima, Noa; Presser, Shawn (31 December 2020). "The Pile:
D:433708).
An 800GB Dataset of Diverse Text for Language Modeling".
arXiv:2101.00027 (https://arxiv.org/abs/2101.00027) [cs.CL (https://a 630. Obradovic, Zoran, and Slobodan Vucetic.Challenges in Scientific
rxiv.org/archive/cs.CL)]. Data Mining: Heterogeneous, Biased, and Large Samples.
Technical Report, Center for Information Science and Technology
616. Cohen, Vanya. "OpenWebTextCorpus" (https://skylion007.github.io/
OpenWebTextCorpus/). OpenWebTextCorpus. Retrieved 9 January Temple University, 2004.
2023. 631. Van Der Putten, Peter; van Someren, Maarten (2000). "CoIL
617. "openwebtext · Datasets at Hugging Face" (https://huggingface.co/d challenge 2000: The insurance company case". Published by
Sentient Machine Research, Amsterdam. Also a Leiden Institute of
atasets/openwebtext). huggingface.co. 16 November 2022.
Advanced Computer Science Technical Report. 9: 1–43.
Retrieved 9 January 2023.
618. Cattral, Robert; Oppacher, Franz; Deugo, Dwight (2002). 632. Mao, K. Z. (2002). "RBF neural network center selection based on
Fisher ratio class separability measure". IEEE Transactions on
"Evolutionary data mining with automatic rule generalization" (http
Neural Networks. 13 (5): 1211–1217.
s://web.archive.org/web/20190806015013/https://pdfs.semanticsch
olar.org/c068/ea7807367573f4b5f98c0681fca665e9ef74.pdf) doi:10.1109/tnn.2002.1031953 (https://doi.org/10.1109%2Ftnn.200
2.1031953). PMID 18244518 (https://pubmed.ncbi.nlm.nih.gov/1824
(PDF). Recent Advances in Computers, Computing and
4518).
Communications: 296–300. S2CID 18625415 (https://api.semantics
cholar.org/CorpusID:18625415). Archived from the original (https://p 633. Olave, Manuel; Rajkovic, Vladislav; Bohanec, Marko (1989). "An
dfs.semanticscholar.org/c068/ea7807367573f4b5f98c0681fca665e application for admission in public school systems" (http://kt.ijs.si/M
9ef74.pdf) (PDF) on 6 August 2019. arkoBohanec/pub/Nursery89.pdf) (PDF). Expert Systems in Public
619. Burton, Ariel N.; Kelly, Paul H.J. (2006). "Performance prediction of Administration. 1: 145–160.
paging workloads using lightweight tracing". Future Generation 634. Lizotte, Daniel J.; Madani, Omid; Greiner, Russell (2012).
Computer Systems. Elsevier BV. 22 (7): 784–793. "Budgeted Learning of Naive-Bayes Classifiers". arXiv:1212.2472
doi:10.1016/j.future.2006.02.003 (https://doi.org/10.1016%2Fj.futur (https://arxiv.org/abs/1212.2472) [cs.LG (https://arxiv.org/archive/cs.
e.2006.02.003). ISSN 0167-739X (https://www.worldcat.org/issn/01 LG)].
67-739X). 635. Lebowitz, Michael (1986). Concept learning in a rich input domain:
620. Bain, Michael; Muggleton, Stephen (1994). "Learning optimal chess Generalization-based memory (https://books.google.com/books?id=
strategies". Machine Intelligence. Oxford University Press, Inc. 13. f9RylgKpHZsC&q=%22Concept+learning+in+a+rich+input+domai
621. Quilan, J. R. (1983). "Learning efficient classification procedures n:+Generalization-based+memory%22&pg=PA193). Machine
Learning: An Artificial Intelligence Approach. Vol. 2. pp. 193–214.
and their application to chess end games". Machine Learning: An
Artificial Intelligence Approach. 1: 463–482. doi:10.1007/978-3-662- ISBN 9780934613002.
12405-5_15 (https://doi.org/10.1007%2F978-3-662-12405-5_15). 636. Yeh, I-Cheng; Yang, King-Jang; Ting, Tao-Ming (2009). "Knowledge
ISBN 978-3-662-12407-9. discovery on RFM model using Bernoulli sequence". Expert
622. Shapiro, Alen D. (1987). Structured induction in expert systems. Systems with Applications. 36 (3): 5866–5871.
doi:10.1016/j.eswa.2008.07.018 (https://doi.org/10.1016%2Fj.eswa.
Addison-Wesley Longman Publishing Co., Inc.
2008.07.018).
623. Matheus, Christopher J.; Rendell, Larry A. (1989). "Constructive
637. Lee, Wen-Chen; Cheng, Bor-Wen (2011). "An intelligent system for
Induction on Decision Trees" (http://www.academia.edu/download/4
improving performance of blood donation" (http://www.airitilibrary.co
0413240/Constructive_Induction_On_Decision_Trees20151126-44
70-tjt71n.pdf) (PDF). IJCAI. 89. m/Publication/alDetailedMesh?docid=10220690-201104-20110505
0019-201105050019-173-185). Journal of Quality Vol. 18 (2): 173.
624. Belsley, David A., Edwin Kuh, and Roy E. Welsch. Regression
638. Schmidtmann, Irene, et al. "Evaluation des Krebsregisters NRW
diagnostics: Identifying influential data and sources of collinearity.
Vol. 571. John Wiley & Sons, 2005. Schwerpunkt Record Linkage (http://www.krebsregister-nrw.de/filea
dmin/user_upload/dokumente/Evaluation/EKR_NRW_Evaluation_
625. Ruotsalo, Tuukka; Aroyo, Lora; Schreiber, Guus (2009). Abschlussbericht_2009-06-11.pdf)." Abschlußbericht vom 11
"Knowledge-based linguistic annotation of digital cultural heritage (2009).
collections" (http://dare.ubvu.vu.nl/bitstream/handle/1871/24407/24
639. Sariyar, Murat; Borg, Andreas; Pommerening, Klaus (2011).
3319.pdf?sequence=3) (PDF). IEEE Intelligent Systems. 24 (2): 64–
"Controlling false match rates in record linkage using extreme value
75. doi:10.1109/MIS.2009.32 (https://doi.org/10.1109%2FMIS.2009.
32). hdl:1871.1/9f6091aa-9596-46a9-9251-f11edeeb28b7 (https://h theory". Journal of Biomedical Informatics. 44 (4): 648–654.
doi:10.1016/j.jbi.2011.02.008 (https://doi.org/10.1016%2Fj.jbi.2011.
dl.handle.net/1871.1%2F9f6091aa-9596-46a9-9251-f11edeeb28b
02.008). PMID 21352952 (https://pubmed.ncbi.nlm.nih.gov/2135295
7). S2CID 6667472 (https://api.semanticscholar.org/CorpusID:6667
472). 2).
640. Candillier, Laurent, and Vincent Lemaire. "Design and Analysis of 645. Barlacchi, Gianni; De Nadai, Marco; Larcher, Roberto; Casella,
the Nomao challenge Active Learning in the Real-World (https://we Antonio; Chitic, Cristiana; Torrisi, Giovanni; Antonelli, Fabrizio;
b.archive.org/web/20181206102406/https://pdfs.semanticscholar.or Vespignani, Alessandro; Pentland, Alex; Lepri, Bruno (2015). "A
g/1647/fc91cfe3e68ef3c41d727b7292ce20482b11.pdf)." multi-source dataset of urban life in the city of Milan and the
Proceedings of the ALRA: Active Learning in Real-world Province of Trentino" (https://www.ncbi.nlm.nih.gov/pmc/articles/PM
Applications, Workshop ECML-PKDD. 2012. C4622222). Scientific Data. 2: 150055.
641. Marquez, Ivan Garrido. "A Domain Adaptation Method for Text Bibcode:2015NatSD...250055B (https://ui.adsabs.harvard.edu/abs/
Classification based on Self-adjusted Training Approach (http://ccc.i 2015NatSD...250055B). doi:10.1038/sdata.2015.55 (https://doi.org/
naoep.mx/~mmontesg/tesis%20estudiantes/TesisMaestria-IvanGarr 10.1038%2Fsdata.2015.55). ISSN 2052-4463 (https://www.worldca
ido.pdf)." (2013). t.org/issn/2052-4463). PMC 4622222 (https://www.ncbi.nlm.nih.gov/
pmc/articles/PMC4622222). PMID 26528394 (https://pubmed.ncbi.n
642. Nagesh, Harsha S., Sanjay Goil, and Alok N. Choudhary. "Adaptive
lm.nih.gov/26528394).
Grids for Clustering Massive Data Sets." SDM. 2001.
643. Kuzilek, Jakub, et al. "OU Analyse: analysing at-risk students at The 646. Vanschoren J, van Rijn JN, Bischl B, Torgo L (2013). "OpenML:
networked science in machine learning". SIGKDD Explorations. 15
Open University (http://oro.open.ac.uk/42529/1/__userdata_docume
nts4_ctb44_Desktop_analysing-at-risk-students-at-open-university. (2): 49–60. arXiv:1407.7722 (https://arxiv.org/abs/1407.7722).
pdf)." Learning Analytics Review (2015): 1–16. doi:10.1145/2641190.2641198 (https://doi.org/10.1145%2F264119
0.2641198). S2CID 4977460 (https://api.semanticscholar.org/Corpu
644. Siemens, George, et al. Open Learning Analytics: an integrated & sID:4977460).
modularized platform (http://search.ror.unisa.edu.au/record/UNISA_
647. Olson RS, La Cava W, Orzechowski P, Urbanowicz RJ, Moore JH
ALMA11143300720001831/media/digital/open/991590917910183
(2017). "PMLB: a large benchmark suite for machine learning
1/12143300710001831/13143328550001831/pdf). Diss. Open
University Press, 2011. evaluation and comparison" (https://www.ncbi.nlm.nih.gov/pmc/artic
les/PMC5725843). BioData Mining. 10: 36. arXiv:1703.00512 (http
s://arxiv.org/abs/1703.00512). Bibcode:2017arXiv170300512O (http
s://ui.adsabs.harvard.edu/abs/2017arXiv170300512O).
doi:10.1186/s13040-017-0154-4 (https://doi.org/10.1186%2Fs1304
0-017-0154-4). PMC 5725843 (https://www.ncbi.nlm.nih.gov/pmc/art
icles/PMC5725843). PMID 29238404 (https://pubmed.ncbi.nlm.nih.
gov/29238404).
648. "Off The Shelf Datasets" (https://appen.com/off-the-shelf-datasets/).
appen.com. Appen. Retrieved 30 December 2020.
649. "Open Source Datasets" (https://appen.com/resources/datasets/).
appen.com. Appen. Retrieved 30 December 2020.
Retrieved from "https://en.wikipedia.org/w/index.php?title=List_of_datasets_for_machine-learning_research&oldid=1165735668"

List of Datasets For Machine-Learning Research

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

List of Datasets For Machine-Learning Research

Uploaded by

Copyright:

Available Formats

List of datasets for machine-learning research

List of sorting used for datasets

Language Mandarin Chinese, Spanish, English, Arabic, Hindi, Bengali

Type Tabular, Graph, Text, Image, Sound, Video

File-Formats CSV, JSON, XML, KML, GeoJSON, Shapefile, GML

Licenses Creative-Commons, GPL, Other Non-Open data licenses

File-Size Minimum, Maximum, Range

Number of variables Less than 10, 10s, 100s, 1000s, 10000s

Services Individual, Aggregation

List of open data portals

Data repository for government or non-profit

Data Management Solution for Research

Data Management Solution to share

List of portals suitable for multiple types of applications

Amazon Datasets https://registry.opendata.aws/

Awesome Public Datasets Collection https://github.com/awesomedata/awesome-public-datasets

Datahub – Core Datasets https://datahub.io/docs/core-data

Hugging Face https://huggingface.co/docs/datasets/

IBM's Data Asset Exchange https://developer.ibm.com/exchanges/data/

Machine learning datasets https://macgence.com/data-sets-and-cataloges/

Microsoft Datasets https://msropendata.com/datasets

Open Data Inception https://opendatainception.io/

Penn Machine Learning Benchmarks https://github.com/EpistasisLab/pmlb/tree/master/datasets

Public APIs https://github.com/public-apis/public-apis

REgistry of REsearch Data REpositories https://www.re3data.org/

UCI Machine Learning Repository http://mlr.cs.umass.edu/ml/

Visual Data Discovery https://visualdata.io/discovery

List of portals suitable for a specific subtype of applications

298 videos of 200

558 videos of 458

213 images of 7 Images are

Skin Randomly sampled

neutral face, and 6

Up to 100 subjects, Royal

Videos from 20 different

Video dataset for action Actions classified and Action [49]

Object detection and recognition

Visual Images and their Image [50] R.

500 natural images,

Very large scene Object

The instances were

Large dataset of Classification,

10 billion pairs of alt-text

Many small, low-

This multispectral More than

Handwriting and character recognition

Dataset Brief Created

Upper-case 17 features are

All symbols are

Syed Waqas Zamir,

Ling Shao, Gui-Song

Remote sensing data Classification,

These images were

SAT-4 has four broad

Images with pixel

Raw data (in HDF5

Activity paths and

Original PNG files,

Images of 120 breeds of Train/test splits and

Many features including

Part locations for birds,

YouTube video IDs and

Long videos annotated