# RF

Corsello Research Foundation

Time Series Data
Concepts

Introduction
 Sensors and other continual monitoring data collection efforts are used in many fields  These forms of data collection have a common underlying premise
 A fixed set of data fields collected at regular time intervals over a longer-term time period  This is the essence of a time series

 How time series data is collected and used has a direct influence on the storage methodology that should be used
 Time series data is a form of temporal data that is managed as a set  It is the uniformity of the collection that enables and favors specific treatment for management

RF

Corsello Research Foundation

Temporal Concepts
 The term “temporal” in this construct means “with respect to time” or “of or pertaining to time” where are key term is “data”
 Temporal data is any data that has value measured with respect to or bound by time
Temporal data is about bounding the validity or relevance of a specific data point in time If a river is measured to be 15oC; that measurement is only valid at the point in time the measurement was taken

Time is an intrinsic concept familiar to us all
  It marks the “when” of all events All events may be marked by “when” they occur

 

All measurements are collected in time
    A temperature (say 15oC) is a value measured at a point in time (and space) Measurements are “taken” at the time “now” for when the measurement occurs At any point in time after the measurement was recorded, it can be referred to by the time of the measurement This basic concept implies that all data is temporal in nature

For any data measurement:
 the value measured (15oC) is non-temporal

15oC is a value, it is the “thing” measured “a river” which makes it temporal
For certain very specific applications, a measurement may not be capable of variance over time and is therefore temporally static  This does not imply that the data is not temporal, only that the temporal validity of the measurement is equivalent to the temporal life of the item measured

These are two distinct concepts, temporality of the measurement and temporality of the item measured

RF

Corsello Research Foundation

Time Series
 A time series is defined as a fixed structure of data collected repeatedly over time at fixed intervals  This definition is very broad and as such allows for variability in several areas

RF

Corsello Research Foundation

Time Domain
 A single time series data set will have a time domain marking the start and end of the time series
 For continual monitoring scenarios, the end may be thought of as both “now” and “the end of time”  Since the data is a time series, “now” represents the current last record

RF

Corsello Research Foundation

Time Interval
 For any time series, there is a fixed interval between value points
 For example “every five minutes” is an interval for a time series of data points collected at five minute intervals
 It is this exact concept that permits a time series to only store the data collected and not the time value it is a measurement for

 A time series only stores two actual times
 Start date/time  End date/time

 The time series stores a single interval value that is the return period or sampling interval separating discrete readings
 Five minutes in our example

RF

Corsello Research Foundation

Measurement Interval
 An important related concept of time series data is the actual measurement interval
 If a measurement is taken every five minutes, what is the collection method for the measure?
 If a temperature measurement is recorded every five minutes on the “0” and “5” (e.g. 5:00, 5:05)

 Is the measure:
 An instantaneous temperature

 An average temperature from the previous time
 An average of a split time (5:00 recorded, sampled from 4:57:30-5:02:30)

 

This information is not part of the time series itself, but is instead metadata about the series An important concept here is that for continual monitoring time series, changes of sensors over time may measure using different approaches
 In the case of different measure intervals, the time series should be split for consistency

RF

Corsello Research Foundation

Interval Examples

RF

Corsello Research Foundation

Relationship to Temporal Data
 Time series data is a special case of temporal data

A time series is temporal in that each measurement within the time series may be treated as a single temporal measurement
The fixed interval of measures makes the treatment of the data special, whereas the data itself is not special in any way

   

A single time series may have thousands or millions of individual measurements, each spaced at fixed intervals If a time series were to have only a single measurement (the degenerate case), it would be exactly a temporal measure Any collection of temporal measures that have the property of being evenly spaced in time may be treated as a time series It is possible to construct a time series from non-evenly spaced data via an interpolation process
 It is common to abstract detailed measures (such as hourly temperatures at uneven intervals – sparse data) into more abstract time series such as daily, weekly or monthly means

RF

Corsello Research Foundation

Time Series and Temporal Representation

RF

Corsello Research Foundation

Collection
 Time series data may be collected in any of a number of ways
 A simulation or application may generate a time series directly
  A single run of an application generates a full time series at once An application may also append to a time series each time it runs

 

In the latter case, it is critical the application is consistent in each run to maintain the integrity of time series temporal offsets It is often desirable to know which run produced which part of the time series

In the collection of time series data from sensors or manual entry:
  Each subsequent “round” of collection is conceptually separate from the previous “round” of collection In the case of a field deployed sensor (non-telemetry)
 Each time the sensor is changed out or data is downloaded there is a new time series created for that batch of data

This is critical in that each deployment of a sensor may overlap slightly, may have short gaps, or may be skewed slightly (every five minutes, but on the “1s” and “6s”)

RF

Corsello Research Foundation

Collection Example

RF

Corsello Research Foundation

Virtual Time Series
 The concept of multiple time series collections that align with efforts establishes a need for a “virtual time series”  This virtual time series is the defined “global” time series for a collection definition (fields, interval and domain)
 Composed of individual “physical” time series that each contains actual data records for a collection effort

RF

Corsello Research Foundation

Time Series Use
 The long-term purpose of time series is no different than that of any data – use

 How time series data is used will influence the approach used for storage to ensure adequate performance and storage volumes are available to handle the demand
 It is the nature of how time series data is used that most influences its special treatment  In many cases a time series is used as a whole (the entire series) rather than as individual measures  Without such a directed form of use, the notion of a time series would be irrelevant as a separate entity from the more general temporal data  It is the cost of storage and transmission which can greatly affect the performance of applications using time series data that suggests the special treatment of time series to reduce size and increase access performance

RF

Corsello Research Foundation

Methodologies of Use
RF
Corsello Research Foundation

Random Extraction
 The most basic form of use for a time series is that of random extractions
 A user needs data from a time series based upon a set of criterion known only by the user at extraction time (not planned or expected at data collection time)  This is one of the most common scenarios for any data use and has large implications in storage format  For random extraction, a user may request “all records where temperature is over 32”  This form of access results in a search over the time series to extract the individual elements matching the criteria provided

RF

Corsello Research Foundation

Temporal Extraction
 The easiest form of extraction from a time series is temporal extraction
 The user wants a portion of the time series between two dates  This results in a new time series being returned that is bounded by the most constrained limits between the user defined limits and the time series internal limits
 Such as requesting an extraction starting prior to the start of the time series itself

RF

Corsello Research Foundation

Complete Delivery

 The best case use scenario for a time series is complete delivery
 Notice this is not an extraction, in that the entire data set is delivered as a whole  No processing is required beyond integrating “physical” time series into the “virtual” record

RF

Corsello Research Foundation

Enumeration

 Once delivered a user will general “walk through” the data in some manner toward a goal
 For example, to compute the average of a time series a full forwardscrolling read is performed to sum all values in the time series
 This is a complete linear access from start to finish

RF

Corsello Research Foundation

Linear and Partial Access
 Linear Access
 Linear or sequential access is the direct reading of the time series in time order of the data  Linear access has no special requirements and is one common access scenario

 Partial Access
 Only a portion of the data may need to be reviewed

 The access will only need to visit a portion of the data points within the time series

RF

Corsello Research Foundation

Random Access
 The user may need to access any point within the time series at any time

 The user must be able to “move” within the time series at will
 Random access is the most complex form of access for any data structure, and is commonly required  One common example of random access is for sort  If a user wanted to sort a time series by temperature rather than by time, they would used both linear access to enumerate and random access to read specific items

 More significantly, random access allows for access by data field, such as temperature (e.g. get record for temperature = 26)
 This form of random access is closed related to random extraction and has similar impacts for performance

RF

Corsello Research Foundation

Index or Ordinal
 Index or ordinal access to a time series is access by time offset or by “offset” into the time series by position (e.g. the 26th data point in the series)  Index access is closely related to random access

 Is in fact a mechanism for random access without the performance issues of other forms of random access
 In general, index access is the only form of random access with low performance costs
 Still has implications for large volume time series
Corsello Research Foundation

RF

Storage
Placing the Bytes on the Disk
Corsello Research Foundation

RF

Storage
 There are many well-defined storage formats for dealing with the storage and transport of time series data such as:
 CDF (Common Data Format)

NetCDF (Network Common Data Format)

There are many databases and applications that have support for time series data such as
    Aquarius Historis Temporal Analyst GrADs

Timescape XDB
Hec-DSS

There is a common thread across all time series formats
  A time series is a set of data delimited in time by a fixed interval with a fixed start date (our general definition) In specific implementations, there may be constraints on the data stored in a single time series (the fields) or on the maximum size of the time series when stored (Aquarius for example has the limit of the underlying database)

 

When planning time series storage, considerations must be made for the collection and use of the data to be stored to ensure adequate capacity and performance Each type of data to be stored in a time series (the field set) will require a dedicated time series store
  For example, a water quality time series cannot store sediment data (there are different fields) A water/sediment time series may be created that stores both together as a single entity

RF

Corsello Research Foundation

Storage Mechanisms
 A time series may be stored:
 In a relational database management system (RDBMS)
 In flat files  As XML

 The selection of storage location (e.g. flat file or RDBMS) will influence how the data within that location is structured
 For example, in an RDBMS, each time series could be stored as:
 A dedicated table

 A set of rows in a shared table
 A single row in a shared table

RF

Corsello Research Foundation

Field Storage
 An important aspect of the time series is the fields within the series  If a time series stores only a single parameter (such as temperature), the time series storage is relatively trivial  If the time series stores a complex data structure, the storage of the time series will be equally complex

RF

Corsello Research Foundation

Storage Basics
 For storage on a computer:
  Data must be reduced into bytes that are written to and read from disk Even in an RDBMS, the same is true

 

In any programming language or RDBMS, there are a set of specific data types that are well known and can be directly converted between bytes and the data type (such as a 32-bit integer or text string) Each language and database understands a different way of converting between bytes and data types:
 A 32-bit integer in Java does not represent the same byte pattern as a 32-bit integer in Visual Basic

 

The conversion of a data type to bytes is called “serialization” and the reverse is called “deserialization”
This is an ongoing issue in computer science and affects all computing applications As long as there is a single platform performing all operations across the lifecycle, there is no measurable issue

 

The most consistent format across all platforms is text, which is a powerful indicator of why XML has been so successful as everything is represented as text in XML The comparison of data (such as during search) requires the processing software to “understand” the data stored
 Due to this fundamental concept, the storage format used should be aligned with the ultimate patterns of use and limitations of the storage platforms (for example maximum allowed field lengths in an RDBMS)

RF

Corsello Research Foundation

Storage Considerations
 It is critical that storage designers consider:
 Volume (size)  Access speed (read and write)  General performance

 If most access will enumerate a data set, the selected storage mechanism should favor that form of access  If random access is still needed, then no optimizations should be used for enumerations that make random access unusable  This is always a trade-off and must be evaluated on a case-by-case basis
Corsello Research Foundation

RF

Time Series Field Concepts
 Each time series may have multiple fields of data collected  Each time series may have different fields collected than another time series  Given both of these premises, the design of the data fields within a time series may be of considerable importance
 Time series data may be stored in any number of ways using various technologies  In each of these technologies, the time series and the data values are related and may be treated differently based upon the specific technology used

RF

Corsello Research Foundation

Single Field Time Series
 This form of time series has a single value collected at each time interval
  This form of time series may be thought of and treated as a basic “value stream” of discrete values for the single field at the fixed interval of the time series The field and storage design for this type of time series only needs to deal with the most primitive anomaly:   Missing data values

Within any time series it must be expected that some individual value points may be corrupt and therefore are missing from the series

In any time series that uses IEEE 754 compliant single (32-bit) or double (64-bit) precision floating point numbers, there is a built-in “not a number” (NaN) value
  In this case, no special handling is required for the time series except to expect that NaN values may be present anywhere within the value stream If a single field time series is storing data in another format, such as integer or string values, accommodations must be made for the absence of value within the value stream

For the design of single field time series data, there are two basic approaches:
  Time coupled Sequential

A time coupled single value series will associate each record within the time series as the (T,V) pair of time (T) and value (V)
   This set of pairs becomes the time series A sequential single value series will provide all records within the time series as a stream of values with only a single time stored indicating the start of the series and a single interval which indicates the temporal spacing of the values within the series In this manner, the time series may be though of simply as an array of values

RF

Corsello Research Foundation

Multiple Value Time Series
 Each temporal record within the time series has a set of multiple fields
 Based the definition of a time series, all records have exactly the same set of fields within a single time series  Each time series defines its own set of fields, and therefore may result in arbitrarily many time series field sets within an organizations corpus of time series data

 The pattern for storing multiple field time series data can take several forms
 The most basic form is to treat each field within the time series as a distinct, single field time series  This approach isolates each data field as a distinct time series and provides the ability to distribute the storage of each time series to different storage locations  There is however the overhead of additional storage for the time series metadata

RF

Corsello Research Foundation

Hub and Spoke Model
 A basic expansion of the single field time series pattern for multiple fields is to create a “hub and spoke” or “star” pattern for the time series

 The core time series metadata is recorded as a single entity, with each field modeled as a discrete time series data value stream

RF

Corsello Research Foundation

Field Interleaved
 A time series is stored as a series of value streams
 Each value stream is complete for the time series, containing all values for a single field  This model most closely resembles the result of the hub and spoke model, where each parameter is isolated as a series  The total time series has one value stream per field that can be easily enumerated

 If values of multiple fields must be accessed together, there is additional overhead for enumerating multiple streams

RF

Corsello Research Foundation

Field Interleaved Example

RF

Corsello Research Foundation

Interval Interleaved
 The fields are stored in order within each temporal interval
 This permits each temporal interval to be the primary unit of separation between each data record  Within a single temporal record, the fields are consecutive in a pre-defined order  Interval interleaved storage provides rapid enumeration of the time series when all fields are used in the enumeration  If it is most common to enumerate the time series to access only a single parameter, there is overhead in the transport and skipping of unused fields to access the required field

RF

Corsello Research Foundation

Interval Interleaved Example

RF

Corsello Research Foundation

Coupled Interleaved
 For any time series where general enumeration involves specific known groups of fields, a hybrid of field and interval interleaving may be used
 Allows for groups of fields to be represented as field interleaved with the remainder of the dataset interval interleaved  Provides fast enumeration for the coupled fields while avoiding the cost of skipping unused fields  If the coupling of fields is not known at design time, this representation is difficult to plan for

 Use of this pattern has the overhead of both interleaving methods if enumerating uncoupled fields (e.g. field 1 and field 5 in example)

RF

Corsello Research Foundation

Coupled Interleaved Example

RF

Corsello Research Foundation

RDBMS Storage Patterns
RF
Corsello Research Foundation

RDBMS Storage

 Time series data can be stored in a number of ways within an RDBMS  Time series data may be stored as temporal records, one value per row  Likewise, time series data can be compacted into a single field and stored as a binary object (BLOB) or XML

RF

Corsello Research Foundation

Flat Temporal
 In the flat temporal model of storing time series data, there is no notion of a time series
 All data is simply stored as temporal records  This is the most simplistic method of storing temporal data overall  Provides good performance for random access  Suffers from poor insert performance (mainly when indexed)  Relatively slow overall sequential access performance due to the tablescan nature of retrieval

RF

Corsello Research Foundation

Flat Time Series
 Each time series is “registered” in a time series table that defines only the time series reference information (metadata)
 All the actual data for the time series is stored in a values table
 Each record in the values table stores a single time series record (point in time)
 In most cases, each time series will have a different set of fields and therefore be best represented by a separate values table

 Results in a single “master” time series table and multiple values tables

RF

Corsello Research Foundation

Flat Time Series Example
 Provides similar performance characteristics to the flat temporal model
 Time series table allows for retrieval based upon a specific time series instance  Allows for a long-term time series (such as continual monitoring) to be identified in a single values table

 If random access to data is the most common, this model will yield the best overall performance characteristics and allow for query by data values with no special software capabilities utilized

RF

Corsello Research Foundation

Entity Time Series
 An entire time series is treated as an entity
 Individual data values treated simply as atoms within the entity
 Time series is stored as a single record in a database table

 Entity time series storage has multiple “flavors” that each have subtle differences to improve some aspect of the time series storage size or performance
 Flat BLOB  Flat XML

 External File

RF

Corsello Research Foundation

Entity Time Series Example

RF

Corsello Research Foundation

Dynamic Time Series
 A further refinement for RDBMS storage of time series data is to dynamically structure the storage rather than use fixed elements as in the previous methodologies  Dynamic time series storage is a broad class of methodologies that attempt to gain advantages in performance and size for managing time series data within an RDBMS  In all dynamic time series storage strategies data within the values fields may be encoded as BLOB or XML data  In dynamic storage, the time series is simply broken into multiple individual records each of which contains multiple data values

RF

Corsello Research Foundation

Fixed Size Dynamic Storage
 Each record has a field target size limit (e.g. 100kb, 10Mb, etc) for storing data values  The data value encoding software is responsible for breaking the time series into “chunks” of data values that do not exceed this size limit  The goal is to encode the most discrete values possible, in time order, that do not exceed this size limit. In this manner, each record will contain all values between a min and max time  There are two basic sub-strategies for fixed size time series storage:
 Time Window
 Entity Window

RF

Corsello Research Foundation

Fixed Size Dynamic Example
 In the time window strategy, the time series values table maintains a start date and end date for each time series record that indicates the bounds stored within that record

 The entity window strategy is very similar, except that if the time series records are all of fixed size, it is possible to know a priori what the exact maximum number of data values may be stored within a single record of the time series

RF

Corsello Research Foundation

Entity Window Computation
 In this strategy, the time series itself indicates the number of values stored within a record and the “offset” is computed to any value as:

Once the computation is completed
 recordOffset indicates the sequenceId (zero-based) containing the value  elementOffset indicates which value within the record is to be returned

 

The need for this computation makes random access possible but slightly computationally costly For enumeration of data, there is no such overhead cost

RF

Corsello Research Foundation

Fixed Entities Dynamic Storage
 The fixed entities dynamic storage strategy is similar to the fixed size storage except:
 Each record will contain an exact number of interval records  Regardless of the size required to store that number of records

 It is imperative that the number of records stored will fit within the constraints of the underlying RDBMS  This form of storage is most similar to the fixed entity strategy of fixed size

RF

Corsello Research Foundation

Conclusion
     Every organization must evaluate its information strategy and time series data needs to ensure adequate planning and effective implementations are used for an effective lifecycle for all users There are many considerations for each type of time series data that comprises the organizational information corpus Data modeling and implementation planning is an activity which is critical to ensure the proper entities are captured in a repeatable, standardized and maintainable manner Time series data can be reduced to a simple set of concepts and a small set of general patterns for implementation Each actual time series data set within the organization can use these concepts and patterns to create an effective and efficient implementation for that time series that can be reused over the organizations lifetime Each time series data type will need to be evaluated separately and the most effective storage patterns used
Corsello Research Foundation

RF

Questions
There are no silver bullets
Corsello Research Foundation

RF