T I M E

I N D E

X I N G

The flexible framework for time-based applications.

Time Indexing — An Introduction

Dr Stuart Clayman

Time Index Technologies

sclayman@timeindexing.com

March 2003

Introduction

This document gives an overview of the "Time Index Technologies" time-indexing architecture. The architecture provides a framework for building new time-indexed applications, or for adding time-indexed capabilities to existing applications.

In a time-indexed system, all data is stored by recording the time the data was put into the system together with the data itself. Data is retrieved by asking for a data item at a particular time, or by asking for a list of data items in a specified time interval, or by asking for a set of data items based on some selection criteria. From the basic operations, more powerful time-based functionality can be achieved. These include: compositing, where selections of data from many time-indexed sources are collated and pulled together to form a new time-index; merging, where data from many time-indexed sources are combined solely on the basis of a common time value; and cross-referencing, where data in one time-indexed source can be accessed and aligned based on the time in another time-indexed source.

The applications themselves may have many different forms. They may be in the areas of multi-media such as video and audio, financial data applications such as stock price analysis and visualization, control systems and monitoring data, log file processing, and versioning of files.

This document will present the main features of time-indexing, including what the technology of time-indexing is, the main properties of time-indexing, together with some scenarios in which time-indexing can be used in real day-to-day situations.

What Is The Technology?

The design is based on some proven academic research that is being brought to market and commercialised. The original research had a narrow focus, namely, to record and playback Internet-based video conferences. This commercial version has a broader scope than the original work and encapsulates ideas from commercial applications used in the financial and data analysis arenas which deal with time-series data. This has been done in order to increase the applicability of time-indexing to a much wider range of data sets and applications. The concept of time-indexing was novel when the original research was done, but with the inclusion of the extra ideas the result is some cutting edge technology.

The design of time-indexing defines two main things. The first thing defined is the set of operations on the container, the second thing defined is the set of data formats that the container can be in. The fundamentals of time-indexing technology is therefore, programming language independent. From the definitions of thr operations and the data formats, a language specific binding can be written which allows applications to utilize time-indexing in virtually any programming language.

Time-indexing has a set of interesting properties and features which allow applications to store and access data held at particular times. The data itself is held in a time-indexed container, and the features provided allow the applications to store and access the data held at particular times, thus allowing the applications to process the data in an application specific way. Time-indexing does not interfere with the data itself, but provides a time navigation mechanism to get to data.

Time-indexing is not a specific application in its own right, rather it is a shell for building time-based applications. This is in a similar way that a traditional database is not a specific application, but is a shell for building data-centric applications. In both cases, if there is no data and no application code then, neither have any specific function. It is only when data is put in them, and application code is written that some useful function prevails.

In time-indexing, time is the primary item, and time-indexing is specialized for time-based tasks. Everything is presented as streams of time, with data attached at particular times. More general storage mechanisms, such as databases, make many compromises to do the task of data manipulation well, but do time-indexing poorly.

The Architecture

The implementation itself has a layered architecture, with an application independent core providing a foundation for various kinds of applications suitable for different business domains.

The diagram below shows the architecture:

The three lowest levels, which are the indexing core, provide an implementation of the set of operations on the container, and the set of formats for the container.

Layer one is how the time-index is formatted on a disc, it is a static entity, being just the bytes on the disc.
Layer two provides input and output services, to get the data to and from the disc and into memory.
Layer three is the implementation of the container operations. The implementation can be in any programming language as the format on disc is language independent.

The next two levels are for index presentation. In these layers more complex operations on time-indexes are implemented.

Layer four is for plug-in components that provide a bridge between the domain of time-indexing and the application domain. Such plug-in components include data specific recorders and players, that rely on the indexing core for their operation. Examples might be a log file recorder, which would add one log entry at a time, or an audio player, which would select a single audio sample from the time index and play just the one sample. Also at this level are plug-in matching components which inspect or match the data but do not attempt to play it back.
Layer five implements the more complex operations that gives time-indexing much of its power. These include operations such as the merging, the cross-referencing, and the compositing operations.

The top most layer is for the applications themselves. These applications utilize the core operations on time-indexes, the complex operations such as the merging, compositing, and cross referencing, and access the plug-in components. At this level many kinds of application can be written.

Properties of Time-Indexing

An overview of the technology has been presented, together with the architecture of time-indexing. In this section, a more detailed discussion of the properties of time-indexing is made.

Time-indexing has a set of operations on a data container and stores its data using its well-defined formats for all the time-based data. Everything is presented as streams of time, with data attached at particular times. The difference between the time-indexing architecture, presented here, and existing approaches to storing data where time is a key element, is that time-indexing provides a consistent and coherent framework for doing any time-based selection and processing.

The time-indexes rely on the fact that time is well ordered, and this ordering is maintained by an index. In fact, they are time-ordered containers of data. They give access to data at given timestamps, but do nothing special with the data. The indexes themselves are data agnostic. This is a bizarre property of a data container.

From Nanoseconds to Millennia

Time-indexes have been presented in general terms, but in this section a more detailed discussed is made.

Time for most people is usually represented by a value on a clock or a watch, and usually constitutes an hour and a number of minutes. Sometimes people consider the date with the year, the month, and the day to be correlated with the time. In most instances, this is an adequate representation of time. There are situations, however, where time has to be considered in a different scope than the usual perception. Time may viewed at a very small scale of increments, resolving to microseconds or nanoseconds, or at a very large scale resolving to millions of years.

The time values allowed in this technology, encapsulate all of the above scenarios. Therefore, a time-index can store data with timestamps being seconds apart, or it can store data with timestamps just microseconds apart. Alternatively, it can store data with timestamps that are thousands of years apart. Such a wide range of timestamps gives this technology a broad range of applicability where time is involved.

By having such accuracy and spread in the times held in a time-index, data can be presented back to the user at that level of accuracy. This means that it becomes possible for an application writer to choose how to present data to end-users. It can be done in bulk, like a report, or it can be done in real-time. That is, data is presented an item at a time, where each item is presented with gap equal to the gap of the timestamps. Such real-time data presenters are ideal in multi-media or simulation frameworks.

If an application needs the type of the data that is being held in a time-index, this can be held as well. Whether it be a number, a string, a text file, an image, an audio segment, or a video frame; all of these can be stored. The type held is not a feature of the whole index, each individual element in a time-index can hold the type of the data held for that entry, even if the type varies. Furthermore, the size of the data for each element is not limited. Each element can hold data that is a different size to any of the other elements. Finally, the total number of time and data values that can be held in each index is also unlimited, meaning that the number of elements can run to billions.

Basically, a time-index can hold a few items of data or can expand to be a massive data set. The times held are resolved to any accuracy required. Such attributes mean that it becomes possible to reconsider how data is stored, what data is stored, and for how long data is stored.

Data Security and Data Integrity

Once data is in an index it cannot be changed; it is immutable . There are no operations to change data held at a particular timestamp. Data can only be appended to the end of a time-index. Also there no operations to change timestamps, they too are immutable .

The lack of modify or update operations may seem like a major drawback, but rather, the converse applies. The advantage is that data is secure as it can never be altered. With this attribute,data integrity is also maintained, as it is not possible to take parts of data away.

Consider how today's computer systems usually replace existing data with the latest version. The original data is considered to be out-of-date, and its value is lost forever. Both data security and integrity are lost when the update occurs. When using time-indexing, rather than changing a data value, a new value is appended to the index at the time the change occurs. With the time-indexing technology presented here, every version of the data can be saved. Asking the time-index for the latest version of the data gets the most up-to-date value, but one can go back in time and find previous values. No data is ever lost.

Sharing and Overlaying

By having a well-defined set of operations and a set of common formats, the need for different formats to hold time related data can be significantly reduced. Data which was in a file and originally intended for just one tool, can be utilized in a range of tools by using plug-in components such as data formatters. The number of tools that can access the same data is increased. Time-indexing is, therefore, a catalyst for a high degree of data sharing.

Data sharing not only means that different tools can utilize a wider range of data sets, it also allows the same data set to be used concurrently by the a single tool. It also means that the same data can be viewed by different tools without having the intermediate step of converting and copying data from one format to another to suit the tool.

Time-indexing not only promotes data sharing, it also promotes data overlaying, where data sets are aggregated and visualized together in the same tool. By overlaying previously unrelated data it becomes possible to observe and expose new patterns and relationships in that data. These patterns and relationships may not have been previously understood as no formal connection had been made between them.

By converting existing applications to use time-indexing they can be made to interact with data originally destined for other applications. New application areas which have not traditionally held time data, can be augmented with time-indexing to bring about an increase in application functionality and effectiveness for such applications.

Continuous vs Discrete

Once time-indexing has been added to an application, the way the time-index is used will vary from application to application. As the data in these applications can have different internal structures, the time-indexing can be used differently. The main variance in the way time-indexing is used for data depends on whether that data is continuous, such as multi-media streams, or discrete, such as stock prices or log file entries. Continuous data is time sensitive within the data, discrete data is time sensitive for each piece of data.

An example of continuous data is audio data. One audio file can be split into many subcomponents (audio samples), with each audio sample having its own time-index item. In this case, the single audio file will be divided such that there are many time-index items for the one audio file.

An example of discrete data is the stock price. A stock price system could hold each value of the price, with each version being indexed within one time-index item. In this case, there is one version of the stock value for one time index item.

Summary

To summarize, time-indexing assumes that time is ordered, with times ranging from millions of years down to nanoseconds. The indexes themselves can hold any kind of data, with billions of separate data items being held. The main properties of time-indexing are time and data immutability ensuring data security and data integrity, data sharing, concurrent access, data overlaying, plus the ability to hold both continuous and discrete data.

Other Data Storage Technology

In this section there is a brief description of tools and applications that are either designed for time-series or have a capability to hold time and data.

The one that most people know best is the relational database. Relational databases are set (or relation) based. They work by joining sets, selecting columns and rows from sets to form new sets. They are not specialized for any particular purpose, but are general purpose. Ordering data can be requested, but has to done on demand.

In applications that are time-based it is possible to use the heuristic that time is strongly ordered to provide specialized optimizations. The lack of speciality in this area means that relational databases can fail to behave well or perform adequately for ordered data.

Joe Celko, the well know author of SQL books and articles, writes in one of his articles:

"SQL is not good at things that assume an ordering, such as time series. If you have read any of my books, you know that a time series usually involves self-joins, with two or three copies of the table representing past, present, and future events relative to each point in time."

Relational databases do not just fail for time-indexing but in various other areas too. This has been seen often in the area of free text systems, where large volumes of text are processed. Relational databases seem unable to perform the search and matching operations sufficiently well. Only specialized text engines are able to well in this area.

Another failing of relational databases is that for them to do ordering, the database has to know the type of the data. Relational databases do have a TIMESTAMP type, but the range of this is too limited. This makes relational databases unsuitable, where timestamps have to be very accurate or very large.

The tools that have been specifically designed for time series applications have been optimized to fit in with the needs of the financial markets. Examples of these include FAME, Kx, S-Plus, and Vision. They can process integer, string, and money data values very powerfully. Most use vector processing primitives to manipulate the data values in bulk. They have many builtin grouping and statistical functions that allow financial data to be analysed to get reports on things like moving averages of stock prices, or doing auto-regression for forecasting.

The disadvantages of these time series data systems is that they are unable to deal with large text data, binary data such as images, audio, or video data in the time series. Furthermore, the resolution of their timestamps has been optimized for stock price data. They generally do not resolve times below one second granularity, or above years. They are therefore, only suitable for application where times are in the usual human perception.

Usage Scenarios

In this section there is a presentation of usage scenarios for the time-indexing architecture. The first scenario is for discrete data and considers the issue of log file management. The second scenario is for continuous data and considers the issue of downloading audio data from the web.

Log File Processing

Some application areas have always held time information for each of the data items held. Such an example is log file data, where applications write special entries into a log file. For each entry there is a field in the line which is the time and date of the log entry. The way to access that time and the associated data is dependent on the format of the log file. Each different application may write their log files using a different format. Having data in different formats has a consequence that there are different mechanisms or tools for each of the formats. Each tool can only utilize data in a familiar format. A tool for one format will generally not work with log file data in another format. Do to identical time-based tasks on different format log files requires re-inventing the wheel for each format. Time-indexing solves this problem by having the time-based operations defined for the container for any. time-based data.

The log files shown for this example are for a mail server and for a web server. Both applications will add an entry in their logs when a significant event occurs. The entries are discrete and individual pieces of data. By looking at the formats of the the two log files presented below, it is apparent that the way the time is presented and where it is in each individual entry is different.

Below is a sample from a mail server log file:

Mar 17 14:01:03 netvista postfix/pickup[21336]: 0779CAB0B: uid=0 from=
Mar 17 14:01:03 netvista postfix/cleanup[21598]: 0779CAB0B: message-id=<20030317140102.0779CAB0B@mail.timeindexing.com>Mar 17 14:01:03 netvista postfix/nqmgr[3029]: 0779CAB0B: from=, size=664, nrcpt=1 (queue active)
Mar 17 14:01:04 netvista postfix/local[21600]: 0779CAB0B: to=, relay=local, delay=2, status=sent ("|/usr/bin/procmail -Y -a $DOMAIN")
Mar 17 14:59:21 netvista postfix/smtpd[21845]: connect from unknown[81.6.214.74]
Mar 17 14:59:22 netvista postfix/smtpd[21845]: 0E1C7AAAD: client=unknown[81.6.214.74]
Mar 17 14:59:22 netvista postfix/cleanup[21847]: 0E1C7AAAD: message-id=<1047913159.2358.73.camel@netvista.timeindexing.com>
Mar 17 14:59:22 netvista postfix/smtpd[21845]: disconnect from unknown[81.6.214.74]
Mar 17 14:59:22 netvista postfix/nqmgr[3029]: 0E1C7AAAD: from=, size=821, nrcpt=1 (queue active)
Mar 17 14:59:26 netvista postfix/smtp[21849]: 0E1C7AAAD: to=, relay=smtp.nildram.co.uk[195.112.4.54], delay=4, status=sent (250 Ok: queued as CA0781E22B4)

The time is at the start of the line and has the format: month day time. An example is Mar 17 14:01:03. The rest of the data is related to a mail message.

Below is a sample from a web server log file:

81.6.214.74 - - [17/Mar/2003:16:46:54 +0000] "GET /docs/ HTTP/1.1" 200 1341 "http://www.timeindexing.com/timeindexing/" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826"
81.6.214.74 - - [17/Mar/2003:16:46:54 +0000] "GET /docs/favicon.gif HTTP/1.1" 404 327 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826"
81.6.214.74 - - [17/Mar/2003:16:46:54 +0000] "GET /docs/movieonly.jpg HTTP/1.1" 404 329 "http://www.timeindexing.com/docs/" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826"
81.6.214.74 - - [17/Mar/2003:16:46:55 +0000] "GET /docs/summary.html HTTP/1.1" 200 11243 "http://www.timeindexing.com/docs/" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826"
81.6.214.74 - - [17/Mar/2003:16:46:55 +0000] "GET /docs/print.css HTTP/1.1" 200 418 "http://www.timeindexing.com/docs/summary.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826"
81.6.214.74 - - [17/Mar/2003:16:46:55 +0000] "GET /docs/normal.css HTTP/1.1" 200 455 "http://www.timeindexing.com/docs/summary.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826"
81.6.214.74 - - [17/Mar/2003:16:46:55 +0000] "GET /docs/diagrams/architecture.gif HTTP/1.1" 200 7268 "http://www.timeindexing.com/docs/summary.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826"
81.6.214.74 - - [17/Mar/2003:16:46:55 +0000] "GET /docs/diagrams/web-audio.gif HTTP/1.1" 200 3407 "http://www.timeindexing.com/docs/summary.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826"

The time is at the middle of the line and has the format: [day/month/year:time +0000]. An example is [17/Mar/2003:16:46:54 +0000]. The first field is the IP address of the machine that connected to the web server, and the fields after the time are the page that was requested from the server.

A final example is from a network packet sniffer log file:

17:20:06.748556 63.236.73.20.http > netvista.timeindexing.com.3867: . 1:1409(1408) ack 152 win 5792 <nop,nop,timestamp 627422282 62477830> (DF)
17:20:06.748618 netvista.timeindexing.com.3867 > 63.236.73.20.http: . ack 1997 win 8448 <nop,nop,timestamp 62478301 627422282> (DF)
17:20:06.772297 205.156.51.200.http > netvista.timeindexing.com.3868: . 501:1001(500) ack 161 win 65500 <nop,nop,timestamp 1056862170 62478293>
17:20:06.772351 netvista.timeindexing.com.3868 > 205.156.51.200.http: . ack 1001 win 7500 <nop,nop,timestamp 62478303 1056862170> (DF)
17:20:06.780926 205.156.51.200.http > netvista.timeindexing.com.3868: . 1001:1501(500) ack 161 win 65500 <nop,nop,timestamp 1056862170 62478293>
17:20:06.780990 netvista.timeindexing.com.3868 > 205.156.51.200.http: . ack 1501 win 8500 <nop,nop,timestamp 62478304 1056862170> (DF)
17:20:06.869071 205.156.51.200.http > netvista.timeindexing.com.3868: P 1501:1710(209) ack 161 win 65500 <nop,nop,timestamp 1056862170 62478303>
17:20:06.869137 netvista.timeindexing.com.3868 > 205.156.51.200.http: . ack 1710 win 8500 <nop,nop,timestamp 62478313 1056862170> (DF)
17:20:07.338818 205.156.51.200.http > netvista.timeindexing.com.3868: P 1710:1715(5) ack 161 win 65500 <nop,nop,timestamp 1056862171 62478313>
17:20:07.338871 netvista.timeindexing.com.3868 > 205.156.51.200.http: . ack 1715 win 8500 <nop,nop,timestamp 62478360 1056862171> (DF)

The time is also at the start of the line, but in this case the format is: hour:minute:seconds.microseconds. An example is 17:20:06.748556. This time format has no year, or month, or day. There is no way to select data by time outside of a very narrow scope.

One of the major concerns with management of servers such as mail servers and web servers is what to do with the log files. Many system managers keep the logs for each day separately, and after 7 days the logs are removed. However, this approach fragments the log files which contain much useful data, and eventually throws that data away. Using this approach makes it difficult to do long and medium term statistics and analysis on server usage and behaviour. Other system managers realise that the fragmentation is problematic, and that 7 days worth of log data is not enough. They resort to keeping one large log file. These managers now have the problem of how to get selections of data, say for a particular day or a particular hour, out of the log file in order to process it further. The times and dates in the log file are hard to process in their textual form. Furthermore, any scheme used to get data out of one log file, will not work with a different log file because the time formats are different. The format of dates in the mail server logs is different from the format of dates in the web server logs is different from the format of dates in the network sniffer logs.

This is where time-indexing come in. By using a time-index as a container for the log file, data selections for periods of time can be easily retrieved. This is because time-indexing has the operations required to process times. The system manager can keep log files as long as he needs because any time-based selection can be easily made. Moreover, a time-based selection for one indexed log will work on any other indexed log. That is, if he requests all the data between midnight on March 1st and midnight on April 1st for one indexed log only those values will be presented. This will work for all indexed logs.

Although the indexed log may get quite large over time, this is not really an issue given the size of discs these days. Of more importance are the benefits of using time-indexing. These being that the data security and data integrity of the content of the log files is maintained. It becomes possible to select any time-based selection of entries from the indexed logs. Most importantly, is that the effort is reduced and the subsequent cost of managing log file data is also reduced.

Media Download on the Web

To highlight the power of time-indexing for continuous data consider the issue of media playback on the web. The current approach is to click on a link in a web page and wait for the data to arrive at the browser. A media server can send the whole media file before it is played, or stream it to the browser and have the media played more immediately. The media is usually presented to the user in a player that has an interface similar to a CD player with various buttons and a slider which can be used to determine the position to play from. At first the slider is at the beginning of the media because no data has been received by the player.

At some time later the user can choose to move the slider to any point in the media, and playback is started from that point. For example, if the user wishes to play from the middle, the slider can be dragged to the middle point of the slider. The main disadvantages are that the control the user has is very small and the resolution of the slider is coarse, in that it is nearly impossible for the user to choose an exact spot to start playing from. To find a particular piece usually requires quite a bit of trial and error movement before success is had.

The process described above often happens, even for a short 3 minute audio file. It is massively exacerbated for a large presentation that may be an hour long. Imagine the difficulty of wanting to see 30 seconds of playback within that hour. Such a desire is nearly impossible using current playback technology. This is where the introduction of time-indexing can help, and provide a solution to these problems.

Time-indexing can help in the selection of particular playback as it allows playback from an exact location in a media stream. Rather than presenting the whole of the media to the user, time-indexing allows the media to be played from the server at a particular point in the media.

In the following diagram the left-hand part shows how the current media playback looks in a web browser. The right-hand part shows how playback from multiple exact locations would look in a web browser.

A further benefit of time-indexing is that it allows not just the playback from a particular point, but also the playback of a segment of the media. That is, playback from a particular point for a fixed amount of time. Without time-indexing each segment that a media maintainer wishes to present has to be hand selected and placed in its own file on the server. Because of the operations time-indexing supports, the segment does not have to be copied into a new file, it can be played directly out of a time-indexed media file. This save both time and cost for the maintainers of the media data as they do not need to make multiple files.

The advantages of using time-indexing for media playback are that it saves on resources, as only one copy of a media file is needed. Even if different segments are needed they can all be played out of the single time-indexed media file. This is a direct consequence of the data sharing and concurrency capabilities that time-indexing brings. A further advantage of this approach, is that network bandwidth utilization is reduced as the playback of the media does not always necessitate sending the whole media file. As we have seen, it is possible to transmit just a segment of the media, which can be significantly smaller than the whole media file. By reducing the network bandwidth, this saves costs in that area also, and frees up bandwidth for other usage. It also saves the end-user time and effort as the guess work of finding key positions is eliminated. It is done just once, by the media maintainer.

Conclusions

The technology for time-indexing provides a way of utilizing a specialized container for data that is time-ordered. The container has been designed from its inception to be optimized for for building applications that have inherently time-ordered data. The first thing defined is the set of operations on the container, the second thing defined is the set of data formats that the container can be in. Until now there has been no common data format to represent time-based data. Time Index Technologies defines that format and the core library of operations required for time-indexed applications to use that format. Time is the primary key that drives everything, it is not some secondary field buried in the data. Time-indexing brings a harmonising effect to applications that deal with time-based data.

Through harmonisation comes commonality. The combination of data immutability and the commonality of format, allows broader goals to be achieved, such as data sharing, data integrity, data overlaying, time cross-referencing, and powerful data analysis capabilities. By using the high level operations that allow multiple time-indexes to be utilized, more complex time-indexed applications can be created.

Time-indexing has a broad range of applications in the arenas of multi-media like video and audio, for financial data applications which rely on time-series analysis, and in data visualization applications such as engineering, and control and feedback systems.

The Time Index Technologies system is unique in that it can index multi-media data such as video or audio streams and can index data such as log file entries, stock prices. It can be used in applications which rely on time-series analysis, and in data visualization applications such as engineering, and control and feedback systems. Furthermore, time-indexing allows the combination of discrete and continuous data to be utilized in the same application. Time-based cross-referencing can be done for any data set, irrespective of whether it is continuous data or discrete data.

Another unique factor is the broad range of timestamps that can be held. These range from the nanosecond level through milliseconds, days, and onto millennia. The kinds of data that can be held range from simple types, to large text items, to binary data, which also adds a level of uniqueness. Another important feature is that time-indexes can hold literally billions of items in each index, if required. As time-indexes retain every piece of data ever entered into the index, rather than updating and modifying data, the technology promote the effect of being able to peer back into the past to find out what values existed at a particular time.