Run-time overhead of text-based storage formats for numerical data

Discussion:

(too old to reply)

Rune Allnor

2009-11-08 20:22:11 UTC

Hi all.

A couple of weeks ago I posted a question on comp.lang.c++ about some
technicality
about binary file IO. Over the course of the discussion, I discovered
to my
amazement - and, quite frankly, horror - that there seems to be a
school of
thought that text-based storage formats are universally preferable to
binary text
formats for reasons of portability and human readability.

The people who presented such ideas appeared not to appreciate two
details that
counter any benefits text-based numerical formats might offer:

1) Binary files are about 70-20% of the file size of the text files,
depending
on the number of significant digits stored in the text files and
other
formatting text glyphs.
2) Text-formatted numerical data take significantly longer to read and
write
than binary formats.

Timings are difficult to compare, since the exact numbers depend on
buffering
strategies, buffer sizes, disk speeds, network bandwidths and so on.

I have therefore sketched a 'distilled' test (code below) to test what
overheads
are involved with formatting numerical data back and forth between
text and
binary formats. To eliminate the impact of peripherical devices, I
have used
a std::stringstream to store the data. The binary bufferes are
represented
by vectors, and I have assumed that a memcpy from the file buffer to
the
destination memory location is all that is needed to import the binary
format
from the file buffer. (If there are significant run-time overheads
associated with
moving NATIVE binary formats to the destination, please let me
know.)

The output on my computer is (do note the _different_ numbers of IO
cycles in the two cases!):

Sun Nov 08 19:48:54 2009 : Binary IO cycles started
Sun Nov 08 19:49:00 2009 : 1000 Binary IO cycles completed
Sun Nov 08 19:49:00 2009 : Text-format IO cycles started
Sun Nov 08 19:49:16 2009 : 100 Text-format IO cycles completed

A little bit of math produces *average*, *crude* numbers for IO
cycles:

Text: 6 seconds / (1000 * 1e6) read/write cycles = 6e-9 s per r/w
cycle
Binary: 16 seconds / (100 * 1e6) read/write cycles = 160e-9 s per r/w
cycle

which in turn means there is an overhead on the order of of
160e-9/6e-9 = 26x
associated with the text formats.

Add a little bit of other overheads, e.g. caused by the significantly
larger text file sizes in combination with suboptimal buffering
strategies,
and the relative numbers easily hit the triple digits. Not at all
insignificant when one works with large amounts of data under tight
deadlines.

So please: Shoot this demo down! Give it your best, and prove me
and my numbers wrong.

And to the textbook authors who might be lurking: Please include a
chapter on relative binary and text-based IO speeds in your upcoming
editions. Binary file formats might not fit into your overall
philosophies about human readability and universal portability of C++
code, but some of your readers might appreciate being made aware of
such practical details.

Rune

/
***************************************************************************/
#include <iostream>
#include <sstream>
#include <time.h>
#include <vector>

int main()
{
const size_t NumElements = 1000000;
std::vector<double> SourceBuffer;
std::vector<double> DestinationBuffer;

for (size_t n=0;n<NumElements;++n)
{
SourceBuffer.push_back(n);
DestinationBuffer.push_back(0);
}

time_t rawtime;
struct tm * timeinfo;

time( &rawtime );
timeinfo = localtime( & rawtime );
std::string message( asctime (timeinfo) );
message.erase(message.size()-1);

std::cout << message.c_str() << " : Binary IO cycles started"
<< std::endl;

size_t NumBinaryIOCycles = 1000;
for (size_t n = 0; n < NumBinaryIOCycles; ++n)
{
for (size_t m = 0; m<NumElements; ++m )
{
DestinationBuffer[m] = SourceBuffer[m];
}
}

time( &rawtime );
timeinfo = localtime( & rawtime );
message=std::string( asctime (timeinfo) );
message.erase(message.size()-1);

std::cout << message.c_str() << " : " << NumBinaryIOCycles
<< " Binary IO cycles completed " << std:: endl;

std::stringstream ss;
const size_t NumTextFormatIOCycles = 100;

time( &rawtime );
timeinfo = localtime( & rawtime );
message=std::string( asctime (timeinfo) );
message.erase(message.size()-1);

std::cout << message.c_str() << " : Text-format IO cycles started"
<< std::endl;

for (size_t n = 0; n < NumTextFormatIOCycles; ++n)
{
size_t m;
for (m = 0; m < NumElements; ++m)
ss << SourceBuffer[m];

m = 0;
while(!ss.eof())
{
ss >> DestinationBuffer[m];
++m;
}
}

time( &rawtime );
timeinfo = localtime( & rawtime );
message=std::string( asctime (timeinfo) );
message.erase(message.size()-1);

std::cout << message.c_str() << " : " << NumTextFormatIOCycles
<< " Text-format IO cycles completed " << std:: endl;

return 0;
}

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Seungbeom Kim

2009-11-09 14:50:56 UTC