The exabyte challenge

9 November 2005

Our approach to dealing with data has essentially remained unchanged for the past 25 centuries ...

One exabyte (10¹⁸ bytes) is a rough - and probably conservative - estimate of the size of everything ever written, composed, filmed, painted, or in any other way 'recorded' by humans. By 2010, virtually all of this vast amount of data will be on line - and most of us will be able to access it from our homes, our mobiles, and other kinds of wearable devices. This constitutes a major change to our lives that is already raising new issues about how to collect and process the data and how to use it in research, as well as how it will impact on society.

Take photos from a digital camera, for example. There are billions of digital cameras around the world these days. Each high-resolution photo is a couple of megabytes and most people have hundreds, if not thousands, sitting on their computer - the digital equivalent of shoeboxes full of paper prints. If you don't annotate and categorise them immediately, it will never happen and you are unlikely to ever look at those pictures again. But what if the computer could find a particular photo for you, without you having to categorise them? Will computers ever adequately respond to queries such as 'find me that picture of Lisa and me on Christmas eve'?

The emergence of the World Wide Web in the past decade demonstrates an alternative to the divide-and-conquer approach of categorisation and indexing. For example, Google's search engine leaves the data in one enormous 'heap' and provides query-driven dynamic views on the data. But this syntax-based approach is already showing its limits, particularly when it comes to integrating data from diverse sources and formats (images, sound, text), incorporating semantics of the data and dealing with complex, interlinked data. Flach believes that this new complexity requires a new way of thinking and a new way of dealing with the data. Computing devices with 'data awareness' are needed - devices that make sense of the exabytes of data at our fingertips.

Will computers ever adequately respond to queries such as 'find me that picture of Lisa and me on Christmas eve'?

While semantics, data fusion and complex data have all been studied widely in computer science and related disciplines, the exabyte challenge is about taking these techniques to the next level, in order to stop us from 'drowning in data while starving for knowledge'. A key factor here is interdisciplinarity. A deep understanding of the nature of scientific data and the scientific aims of the investigation, in particular, are crucial - hence the need to combine research and resources in a University-wide research theme. Today, most universities are highly fragmented environments that fail to exploit the considerable synergy that could result from combining diverse research areas. The Exabyte Informatics research theme, which includes research groups from each of the University's six faculties, provides a unique opportunity to truly exploit this synergy.

Flach and his group have already been thinking about creating a scientist's desktop - an environment something like Microsoft Office, but for scientists (and without the patronising paper clip!). Given a free hand, what would he want it to be like? Well, it could be something that proactively searches on the web for things a particular type of scientist might be interested in. It would effectively look over your shoulder while you are working and build up an idea of your profile - what kind of scientist you are, what kind of information you are seeking - and try to pre-empt your requirements. A simple example would be downloading an academic paper, and the computer could then start searching for all the papers referenced in it. You might not ever look at them, or even download them onto your computer - you just need to know that they are out there and available.

But such data mining brings its own problems. While the introduction of the web has encouraged scientists towards more openness and sharing, in fields such as bioinformatics their data may be their scientific capital - not everyone is willing to share. And that raises new issues of how you might be able to link these databases together and still maintain the benefits of that, while preserving the privacy of the database. These are complicated issues that would require input from the University's Centre for Information Technology and Law. And then there is the impact on society and education. How do we make use of all these data in teaching? The web itself does not determine how we are going to use it. It is just a piece of technology, like a blackboard or an overhead projector, and how we use it determines how useful it is. But it is the educationalists who must ask what they are trying to achieve with the web and to under-stand what can and cannot be done. It is not the job of the computer scientist to tell them how it should be used.

Initially the way this revolution will actually happen is from the bottom up via a number of pilot projects that will start increasing the awareness of people of what the issues are. There is a need for a coordinated effort - people talking to each other and determining the best way to achieve these things. Flach thinks that although things have moved at an incredible rate since the '90s, we are still in a transitional period of innovation. But we cannot sustain this rate of change. In 50 to 100 years' time the situation will be much more stable because the technologies will have proved themselves. Getting there, by making optimal use of the enormous potential of computing technology, will be as challenging as it will be rewarding.

Information object	How many bytes?
A binary decision	1 bit
A single text character	1 byte
A typewritten page	2 kilobytes (KB)
The complete works of Shakespeare	5 megabytes (MB)
A library floor of academic journals	100 gigabytes (GB)
The print collections of the US Library of Congress	10 terabytes (TB)
All printed material in the world	200 petabytes (PB)
All words ever spoken by human beings	5 exabytes (EB)

Peter Flach/Department of Computer Science