ABSTRACT

Over the last decade, the size of data seen by a computational problem have grown immensely. There appears to be more web pages than human beings, and we have successfully indexed the pages. Routers generate huge traffic logs, in the order of terabytes, in a short time. The same explosion of data is felt in observational sciences because our capabilities of measurement have grown significantly. In comparison, computational resources have not increased at the same rate. In particular, it has been found that the ability of random access to data itself is a resource. In settings where each individual input item is not so significant by itself, consider monitoring a network, estimating costs of query plans, and so on, but the quantity we are interested in is the aggregate picture that emerges from the data. In several scenarios, such as network monitoring, some data are never stored but merely used to infer aggregate health of the network. At the same time, in several data-intensive computations, making passes over the data has been found to be significantly more efficient. This has brought the data stream model to the fore. The model consists of an algorithm with a small random access memory, typically sublinear in input size, operating in passes over the input. Any input item not explicitly stored is inaccessible to the algorithm in the same pass. The model captures the essence of a monitoring process that is allowed to observe a system unobtrusively using some small “extra” space. Of particular interest is the one pass model, in which the input may simply not be stored at all. The data stream model poses several challenges. Intriguingly, even simple problems, which were thought to be fully understood at small scales, have been found to be ominous at the current scale of data and have required reexamination. As mentioned earlier, the aggregate picture that emerges from the the data, or the synopsis, is often the desiderata.