Discussion:
Run-time overhead of text-based storage formats for numerical data
(too old to reply)
Rune Allnor
2009-11-08 20:22:11 UTC
Permalink
Hi all.

A couple of weeks ago I posted a question on comp.lang.c++ about some
technicality
about binary file IO. Over the course of the discussion, I discovered
to my
amazement - and, quite frankly, horror - that there seems to be a
school of
thought that text-based storage formats are universally preferable to
binary text
formats for reasons of portability and human readability.

The people who presented such ideas appeared not to appreciate two
details that
counter any benefits text-based numerical formats might offer:

1) Binary files are about 70-20% of the file size of the text files,
depending
on the number of significant digits stored in the text files and
other
formatting text glyphs.
2) Text-formatted numerical data take significantly longer to read and
write
than binary formats.

Timings are difficult to compare, since the exact numbers depend on
buffering
strategies, buffer sizes, disk speeds, network bandwidths and so on.

I have therefore sketched a 'distilled' test (code below) to test what
overheads
are involved with formatting numerical data back and forth between
text and
binary formats. To eliminate the impact of peripherical devices, I
have used
a std::stringstream to store the data. The binary bufferes are
represented
by vectors, and I have assumed that a memcpy from the file buffer to
the
destination memory location is all that is needed to import the binary
format
from the file buffer. (If there are significant run-time overheads
associated with
moving NATIVE binary formats to the destination, please let me
know.)

The output on my computer is (do note the _different_ numbers of IO
cycles in the two cases!):

Sun Nov 08 19:48:54 2009 : Binary IO cycles started
Sun Nov 08 19:49:00 2009 : 1000 Binary IO cycles completed
Sun Nov 08 19:49:00 2009 : Text-format IO cycles started
Sun Nov 08 19:49:16 2009 : 100 Text-format IO cycles completed

A little bit of math produces *average*, *crude* numbers for IO
cycles:

Text: 6 seconds / (1000 * 1e6) read/write cycles = 6e-9 s per r/w
cycle
Binary: 16 seconds / (100 * 1e6) read/write cycles = 160e-9 s per r/w
cycle

which in turn means there is an overhead on the order of of
160e-9/6e-9 = 26x
associated with the text formats.

Add a little bit of other overheads, e.g. caused by the significantly
larger text file sizes in combination with suboptimal buffering
strategies,
and the relative numbers easily hit the triple digits. Not at all
insignificant when one works with large amounts of data under tight
deadlines.

So please: Shoot this demo down! Give it your best, and prove me
and my numbers wrong.

And to the textbook authors who might be lurking: Please include a
chapter on relative binary and text-based IO speeds in your upcoming
editions. Binary file formats might not fit into your overall
philosophies about human readability and universal portability of C++
code, but some of your readers might appreciate being made aware of
such practical details.

Rune


/
***************************************************************************/
#include <iostream>
#include <sstream>
#include <time.h>
#include <vector>

int main()
{
const size_t NumElements = 1000000;
std::vector<double> SourceBuffer;
std::vector<double> DestinationBuffer;

for (size_t n=0;n<NumElements;++n)
{
SourceBuffer.push_back(n);
DestinationBuffer.push_back(0);
}

time_t rawtime;
struct tm * timeinfo;

time( &rawtime );
timeinfo = localtime( & rawtime );
std::string message( asctime (timeinfo) );
message.erase(message.size()-1);

std::cout << message.c_str() << " : Binary IO cycles started"
<< std::endl;

size_t NumBinaryIOCycles = 1000;
for (size_t n = 0; n < NumBinaryIOCycles; ++n)
{
for (size_t m = 0; m<NumElements; ++m )
{
DestinationBuffer[m] = SourceBuffer[m];
}
}

time( &rawtime );
timeinfo = localtime( & rawtime );
message=std::string( asctime (timeinfo) );
message.erase(message.size()-1);

std::cout << message.c_str() << " : " << NumBinaryIOCycles
<< " Binary IO cycles completed " << std:: endl;

std::stringstream ss;
const size_t NumTextFormatIOCycles = 100;

time( &rawtime );
timeinfo = localtime( & rawtime );
message=std::string( asctime (timeinfo) );
message.erase(message.size()-1);

std::cout << message.c_str() << " : Text-format IO cycles started"
<< std::endl;

for (size_t n = 0; n < NumTextFormatIOCycles; ++n)
{
size_t m;
for (m = 0; m < NumElements; ++m)
ss << SourceBuffer[m];

m = 0;
while(!ss.eof())
{
ss >> DestinationBuffer[m];
++m;
}
}

time( &rawtime );
timeinfo = localtime( & rawtime );
message=std::string( asctime (timeinfo) );
message.erase(message.size()-1);

std::cout << message.c_str() << " : " << NumTextFormatIOCycles
<< " Text-format IO cycles completed " << std:: endl;

return 0;
}
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Seungbeom Kim
2009-11-09 14:50:56 UTC
Permalink
A couple of weeks ago I posted a question on comp.lang.c++ about some technicality
about binary file IO. Over the course of the discussion, I discovered to my
amazement - and, quite frankly, horror - that there seems to be a school of
thought that text-based storage formats are universally preferable to binary text
formats for reasons of portability and human readability.
I don't see textual formats "universally preferred". Who said that?
The people who presented such ideas appeared not to appreciate two details that
1) Binary files are about 70-20% of the file size of the text files, depending
on the number of significant digits stored in the text files and other
formatting text glyphs.
2) Text-formatted numerical data take significantly longer to read and write
than binary formats.
Actual numbers may vary, but it is an established fact that text formats
take more space and more processing time, and no one objected to that.
So, if your application cannot afford that overhead, you don't have a
choice, and you go binary. However, other applications may afford that
overhead and instead enjoy the benefits that textual formats offer:

- human readability
- transparency
- portability (I'm not talking about preserving the exact precision,
but about being free of issues such as encoding, endianness, etc.)
- flexibility (Upgrading from 32-bit int to 64-bit int is a breeze.)
- manipulability (You can use text-based utilities such as awk or perl,
and even text editors to modify some parts.)

... especially when you consider that in many (not all) situations,
storage is less of a problem nowadays than it used to be before (and
maybe processing time too), and that the difference in processing times
of text and binary is only a fraction of the total processing time.

// I'm afraid I'm just repeating what has been discussed over there. :(

If you're interested enough, see the section "The Importance of Being
Textual" from "The Art of Unix Programming" by Eric Steven Raymond,
at <http://www.catb.org/~esr/writings/taoup/html/ch05s01.html>.

YMMV, of course. No one tells you you /should/ use a textual format,
or you shouldn't tell others they /should/ use a binary format, either.
The decision is, as always, a trade-off between different values.
No one knows your objectives and constraints better than you do, and
while others can present the pros and cons of the options, it's your
job to understand them and make the decision. (Just note that worrying
about performance is justified only after an actual measurement.)
--
Seungbeom Kim

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
DeMarcus
2009-11-09 14:56:27 UTC
Permalink
Post by Rune Allnor
Hi all.
A couple of weeks ago I posted a question on comp.lang.c++ about some
technicality
about binary file IO. Over the course of the discussion, I discovered
to my
amazement - and, quite frankly, horror - that there seems to be a
school of
thought that text-based storage formats are universally preferable to
binary text
formats for reasons of portability and human readability.
Please don't see it as a horror. You're right that binary files are
faster but text files are nice for debugging and backward compatibility.

In one software we used binary files to store configurations. Then
suddenly we wanted to add an item into the configuration, which made the
old configuration files incompatible with the new software version.
To support the old configuration files we had to do a converter, and
soon we realized that we couldn't have version converters each time we
wanted to add an item. That's where XML came at hand.

Cheers,
Daniel
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Ulrich Eckhardt
2009-11-09 14:51:35 UTC
Permalink
Post by Rune Allnor
A couple of weeks ago I posted a question on comp.lang.c++ about some
technicality about binary file IO.
All files are binary. ;)
Post by Rune Allnor
Over the course of the discussion, I discovered to my amazement - and,
quite frankly, horror - that there seems to be a school of thought that
text-based storage formats are universally preferable to binary text
formats for reasons of portability and human readability.
This is the same school as the one that suggests not doing any early
optimisations.
Post by Rune Allnor
The people who presented such ideas appeared not to appreciate two
details that counter any benefits text-based numerical formats might
1) Binary files are about 70-20% of the file size of the text files,
depending on the number of significant digits stored in the text files
and other formatting text glyphs.
Compression?
Post by Rune Allnor
2) Text-formatted numerical data take significantly longer to read and
write than binary formats.
Do they? I don't really believe you. The point is that IO takes lots of
Post by Rune Allnor
Timings are difficult to compare, since the exact numbers depend on
buffering strategies, buffer sizes, disk speeds, network bandwidths
and so on.
...as you state yourself.
Post by Rune Allnor
I have therefore sketched a 'distilled' test (code below) to test what
overheads are involved with formatting numerical data back and forth
between text and binary formats. To eliminate the impact of peripherical
devices, I have used a std::stringstream to store the data.
Fair choice.
Post by Rune Allnor
The binary bufferes are represented by vectors, and I have assumed that a
memcpy from the file buffer to the destination memory location is all that
is needed to import the binary format from the file buffer. (If there are
significant run-time overheads associated with moving NATIVE binary
formats to the destination, please let me know.)
Not a fair choice. You have completely omitted to convert the on-disk
representation to your in-memory representation. Things that differ are
endianess, sizes, alignment and padding.
Post by Rune Allnor
And to the textbook authors who might be lurking: Please include a
chapter on relative binary and text-based IO speeds in your upcoming
editions. Binary file formats might not fit into your overall
philosophies about human readability and universal portability of C++
code, but some of your readers might appreciate being made aware of
such practical details.
IMHO less for file formats but for protocols, otherwise I agree, a
comparison/warning would be useful.
Post by Rune Allnor
std::stringstream ss;
[...]
Post by Rune Allnor
for (m = 0; m < NumElements; ++m)
ss << SourceBuffer[m];
Wrong: You are writing the numbers without any separating character, making
it impossible to read them afterwards.
Post by Rune Allnor
while(!ss.eof())
{
ss >> DestinationBuffer[m];
++m;
}
Wrong: Use the idiomatic "while(s >> val)". Your loop will probably overflow
the buffer by reading one past the end. Actually, with the error above, I
have no clue what your loop does, you should have checked correctness, too.


Further notes:
1. C++ IOStreams are a complex formatting and parsing framework using
plugins for pretty much any operation. Every use of a plugin amounts to a
lookup of the plugin and a virtual function call, with all the restrictions
that imposes on the optimizer. I would try to optimize that part first
before dumping a textual file layout.
2. Apart from the two glitches above, which are easily caught, textual
formatting is pretty easy to get right. However, I dare you to write
portable code to write a sequence of double values to a "packed binary"
file. This is far from trivial.

Uli
--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Neil Butterworth
2009-11-09 14:51:58 UTC
Permalink
Post by Rune Allnor
Hi all.
A couple of weeks ago I posted a question on comp.lang.c++ about some
technicality
about binary file IO. Over the course of the discussion, I discovered
to my
amazement - and, quite frankly, horror - that there seems to be a
school of
thought that text-based storage formats are universally preferable to
binary text
formats for reasons of portability and human readability.
The people who presented such ideas appeared not to appreciate two
details that
Well, I can't speak for those people, but I would prefer text files for
exactly the reasons you suggest, provided those are of overriding
mnportance for the particular application. So if the application is
concerned with data transfer, I would use XML for portability, if it
requires a configuration file, I would use a text format to make it easy
for users to read and edit.

However, if I wanted performance, I would use a binary format FOR THE
FILES WHERE PERFORMANCE IS THE PRIMARY REQUIREMENT. I don't think that
anyone is suggesting that a SQL database (for example) should be
implemented using text files for its indexes and tables. It would make
sense though for such a database to use text files for configuration etc.

You seem to have set up a straw man, and one that has very little to do
with C++, I would add.

Neil Butterworth
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Nick Hounsome
2009-11-09 14:56:57 UTC
Permalink
Post by Rune Allnor
Hi all.
A couple of weeks ago I posted a question on comp.lang.c++ about some
technicality
about binary file IO. Over the course of the discussion, I discovered
to my
amazement - and, quite frankly, horror - that there seems to be a
school of
thought that text-based storage formats are universally preferable to
binary text
formats for reasons of portability and human readability.
That's not a school of thought - It's a fact. They are preferred and
for those reasons.

That doesn't mean that sometimes you really do need the performance
but it would have to be quite a large data set or quite a stringent
performance requirement to make it preferrable.
Post by Rune Allnor
The people who presented such ideas appeared not to appreciate two
details that
1) Binary files are about 70-20% of the file size of the text files,
depending
on the number of significant digits stored in the text files and
other
formatting text glyphs.
In 25 years programming I have never come across a problem (for files)
where this has been a problem and the rate at which storage capacities
increase suggests to me that it never will be for any "normal"
application.
Post by Rune Allnor
2) Text-formatted numerical data take significantly longer to read and
write
than binary formats.
Again - Never in my experience.
In network protocols YES because you can never have too much
performance in low level general purpose protocols but in application
files I have never had a problem.

A slight optimisation that you might be interested in is to use hex -
This is still portable and readable but can be read and written
without multiplications or divisions.
Post by Rune Allnor
Timings are difficult to compare, since the exact numbers depend on
buffering
strategies, buffer sizes, disk speeds, network bandwidths and so on.
In other words they are of minor siginificance otherwise they would
dwarf these things.
Post by Rune Allnor
I have therefore sketched a 'distilled' test (code below) to test what
overheads
are involved with formatting numerical data back and forth between
text and
binary formats. To eliminate the impact of peripherical devices, I
have used
a std::stringstream to store the data. The binary bufferes are
If you really worry about performance you will never use the C++ I/O
library conversions - The fastest way to write an integer will almost
certainly be itoa()/atoi() and (if you have it) read()/write()
Post by Rune Allnor
represented
by vectors, and I have assumed that a memcpy from the file buffer to
the
destination memory location is all that is needed to import the binary
format
from the file buffer. (If there are significant run-time overheads
associated with
moving NATIVE binary formats to the destination, please let me
know.)
If you are realy realy realy speed obssessed the way to go is to map a
binary file into memory
rather than using ANY I/O library at all (mmap on POSIX systems, Not
sure about Windows).

Try it - You'll be impressed.
Post by Rune Allnor
The output on my computer is (do note the _different_ numbers of IO
Sun Nov 08 19:48:54 2009 : Binary IO cycles started
Sun Nov 08 19:49:00 2009 : 1000 Binary IO cycles completed
Sun Nov 08 19:49:00 2009 : Text-format IO cycles started
Sun Nov 08 19:49:16 2009 : 100 Text-format IO cycles completed
A little bit of math produces *average*, *crude* numbers for IO
Text: 6 seconds / (1000 * 1e6) read/write cycles = 6e-9 s per r/w
cycle
Binary: 16 seconds / (100 * 1e6) read/write cycles = 160e-9 s per r/w
cycle
which in turn means there is an overhead on the order of of
160e-9/6e-9 = 26x
associated with the text formats.
Add a little bit of other overheads, e.g. caused by the significantly
larger text file sizes in combination with suboptimal buffering
strategies,
and the relative numbers easily hit the triple digits. Not at all
insignificant when one works with large amounts of data under tight
deadlines.
So please: Shoot this demo down! Give it your best, and prove me
and my numbers wrong.
They are not wrong. They are just irrelevant to 99% of all
applications.
Post by Rune Allnor
And to the textbook authors who might be lurking: Please include a
chapter on relative binary and text-based IO speeds in your upcoming
editions. Binary file formats might not fit into your overall
philosophies about human readability and universal portability of C++
code, but some of your readers might appreciate being made aware of
such practical details.
Rune
The text book authors are writing for the 99% not the 1% so they are
not going to change.

I enjoy your posts Rune but IMHO you realy do get carried away with
the wrong performance issues.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Rune Allnor
2009-11-09 20:03:35 UTC
Permalink
Post by Nick Hounsome
The text book authors are writing for the 99% not the 1% so they are
not going to change.
I am working among the 1%. I have seen companies loose business
because of poorly performing software that went undetected. That
is, the people whose job it was to know, did not know about the
major performance issues.

In one company, which did 24/7 survey jobs and stored the data
on text format, merely reading 24 hrs worth of data from text-
formatted files imposed some 3-5 hrs idle time on behalf of human
operators. It wouldn't have been a big deal if those 3-5 hrs were
organized as one bulk (the operators in question could have had
a long break if it was), but these 3-5 hrs were intersped throughout
the process, tying the operators down in front of their terminals.

From a human standpoint, there are several time scales. Most
(all?) readers of this newsgroup are computer programmers, so
they know what it means to be in 'The Zone' where time just
flys and work is being made.

Now, if you can get a job done with operator idle time less
than a second, the operator can stretch his neck, yawn, have
zip of coffee, and remain in 'The Zone' afterwards.

If the idle time is a couple of seconds, the waiting time
start to become noticeable and thus annoying. If the waiting
time becomes ten seconds or more, an operator already in 'The
Zone' is yanked out of 'The Zone'. If ten seconds operator
idle time is commonplace in the application, the operator never
reaches 'The Zone' in the first place.

Once we start talking about minutes of operator idle time,
operators go away to have a cup of coffee, read the newspaper,
surf the net, flirt with the 20-year-old blonde at the
swicthboard - whatever. Once that happens productivity numbers
reach the point where companies go out of business.
Post by Nick Hounsome
I enjoy your posts Rune but IMHO you realy do get carried away with
the wrong performance issues.
No. The performance issues I worry about are the ones that
kick users out of business. No one cares if 15 seconds or 50
seconds is the most representative number for reading 100 MBytes
of text-foematted numeric data, when the same amount of binary
formatted data easily can be loaded in 0.3 seconds.

These details have a profound impact where I work. The only
reason this is not recognized, is the omnipresent misconception
that the slower time working with text-formatted numeric data
is insignificant.

People who know their programing craft would know that one uses
binary data formats for numeric data as default, and only deviates
towards text-based formats where one can get away with them
(file sizes less than about 5-10 MBytes).

Rune
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
DeMarcus
2009-11-09 22:06:46 UTC
Permalink
[...]
Post by Rune Allnor
Post by Nick Hounsome
I enjoy your posts Rune but IMHO you realy do get carried away with
the wrong performance issues.
No. The performance issues I worry about are the ones that
kick users out of business. No one cares if 15 seconds or 50
seconds is the most representative number for reading 100 MBytes
of text-foematted numeric data, when the same amount of binary
formatted data easily can be loaded in 0.3 seconds.
You have a point, but if you asked me to solve the problem, I would
probably try to keep data in the easiest way, i.e. still as text. Why?
Because if it's supposed to be read by a human in the end, text is the
native format. Just as much as a picture's native format is binary and
not XML.

Then I would ask myself; how do we speed this up? As someone suggested
you could use mmap in *nix systems (if you think of it, most
configuration files in *nix are actually text files, probably loaded
with mmap).

Now, let's say mmap doesn't solve the problem, what do we do next. I
would look into compressors. Then you can store compressed files on
disk, still in their native text format. And suddenly you have made the
disk access time disappear with minimal hassle. You don't have to come
up with a strange binary format to work around disk latencies.

I used to be part of developing a real-time system where the disks just
couldn't perform real-time transfer rates. Then we just made an adapter
class called Compressor taking two pointers to src- and dst memory,
chained the class with our File class, and solved the problem with
minimal effort. When the disks got faster a couple of years later we
just removed the compressor.


Cheers,
Daniel
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Graziano
2009-11-10 08:23:36 UTC
Permalink
Well, out here also programmers working with HUGE amounts of data
(say: satellites, meteorological models, simulations).
Text files in these fields just pure nonsense. We use binary formats,
well documented and with convenient API, to allow indexing,
transformation to text, xml, code, I/O filters (say compress), missing
values, units, etc and A LOT of public available applications to view,
plot, explore data.

If in case, have a look for example to HDF4 or HDF5 format, NetCDF
format or the like.

If just a bunch of numbers (say up to some thousands) I will go for
sure with a documented XML.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Nick Hounsome
2009-11-10 12:56:15 UTC
Permalink
Post by Graziano
Well, out here also programmers working with HUGE amounts of data
(say: satellites, meteorological models, simulations).
Text files in these fields just pure nonsense. We use binary formats,
well documented and with convenient API, to allow indexing,
transformation to text, xml, code, I/O filters (say compress), missing
values, units, etc and A LOT of public available applications to view,
plot, explore data.
If in case, have a look for example to HDF4 or HDF5 format, NetCDF
format or the like.
If just a bunch of numbers (say up to some thousands) I will go for
sure with a documented XML.
And I would do exactly the same in your situation because I've read
about how big those files can be but you are in the 1%
(and I'd use memory mapping)

The important thing in this case is to provide the API for readers.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Rune Allnor
2009-11-10 13:03:28 UTC
Permalink
Post by Graziano
Well, out here also programmers working with HUGE amounts of data
(say: satellites, meteorological models, simulations).
Text files in these fields just pure nonsense. We use binary formats,
well documented and with convenient API, to allow indexing,
transformation to text, xml, code, I/O filters (say compress), missing
values, units, etc and A LOT of public available applications to view,
plot, explore data.
If in case, have a look for example to HDF4 or HDF5 format, NetCDF
format or the like.
I know what binary file format to use with the data in question.

My problem has been a bit more fundamental than that. When I ask
decision-makers on what grounds text-based file formats were
chosen, people either respond with "text files are so convenient"
or a blank stare.

In other words, strategic decisions that directly affect the
ability to meet deadlines are taken as a matter of course,
without evaluating the operational impact on the process - or
even without the awareness that an alternative existed at all.

Which is why I would like the trade-offs involved to at least
be mentioned in upcoming textbooks on programming in general
and C++ in particular.

Rune
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Konstantin Oznobikhin
2009-11-12 12:14:39 UTC
Permalink
{ Please confine follow ups to matters that are on-topic for clc++m. -mod }

On 10 ноя, 16:03, Rune Allnor <***@tele.ntnu.no> wrote:

[snip]
Post by Rune Allnor
My problem has been a bit more fundamental than that. When I ask
decision-makers on what grounds text-based file formats were
chosen, people either respond with "text files are so convenient"
or a blank stare.
In other words, strategic decisions that directly affect the
ability to meet deadlines are taken as a matter of course,
without evaluating the operational impact on the process - or
even without the awareness that an alternative existed at all.
And how did it happen that file format is a *strategic* decision? If
it is that hard to switch to binary files than you definitely have
more fundamental problems and they are not anyhow file format related.
The issue is inability of those "decision-makers" to develop a
software according to requirements. If it is not as fast as you need
today it might be very fast but do something irrelevant tomorrow.
Post by Rune Allnor
Which is why I would like the trade-offs involved to at least
be mentioned in upcoming textbooks on programming in general
and C++ in particular.
Probably, it'd help much more if they read books about requirements
engineering, QA and good software design instead of ones about text vs
binary files. Which is why those topics are already covered in
corresponding books.

--
Konstantin.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Seungbeom Kim
2009-11-10 08:22:33 UTC
Permalink
Post by Rune Allnor
Post by Nick Hounsome
The text book authors are writing for the 99% not the 1% so they are
not going to change.
I am working among the 1%. I have seen companies loose business
because of poorly performing software that went undetected. That
is, the people whose job it was to know, did not know about the
major performance issues.
So what? I, as well as other posters I guess, understand that your
application may require performance that can't be satisfied by a textual
format, and stated so. Did anyone say that you're not among the 1% or
that you should switch from textual to binary? Do you have anything
to refute from the other replies in this thread?

Frankly I don't understand your point. "But I..." or "But my..."
isn't very meaningful when others didn't preclude your case.
--
Seungbeom Kim

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Rune Allnor
2009-11-10 12:59:20 UTC
Permalink
Post by Seungbeom Kim
Frankly I don't understand your point.
My point is that

1) The speed penalty imposed by using text formats for numerical
data is totally absent in the USENET discussions and printed
literature on C++.

2) The speed penalty imposed by using text formats for numerical
data is one of the main bottlenecks in the data processing
chains I have seen where I work.

3) The speed penalty imposed by using text formats for numerical
data is totally unknown among the people whose job it is to
set up said data processing chains.

4) The speed penalty imposed by using text formats for numerical
data can easily be on the order of 100x or 200x relative to
using binary data, depending on implementations of the software
that accesses the file - not all applications are written in
C++; not all C++ applications are efficient.

5) The speed penalty imposed by using text formats for numerical
data is a *design* *choise*, on a par with using O(NlgN) quick
sort algorithms instead of O(N^2) bubble sorts algorithms.
What I am concerned, people are free to use text formats if

a) Portability is an *actual* issue - not always the case.
b) Speed *is* irrelevant - not always the case.

Once one or both these factors is no longer relevant, text-
based formats are out of the picture. As for the "human
readability" question, that's irrelevant unless the contents
of the file is meant to be inspected by humans.

6) The speed penalty imposed by using text formats for numerical
data should be mentioned in textbooks on C++, so that
unsuspecting users have a fair chance of making informed
choises on the matter, instead of the present situation,
where "politically correct" textbook authors not only make
the choises for them, but also avoid to mention the
alternatives.

Rune
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
mzdude
2009-11-10 21:30:49 UTC
Permalink
Post by Rune Allnor
Post by Seungbeom Kim
Frankly I don't understand your point.
My point is that
I think the whole text vs. binary debate for large data files is a
little
facetious. Use what works. I work for a company that acquires spectral
data and stores it in native binary format (for speed). We often
aquire
100's of MB or very large fractions of GB's worth of data. If you can
honestly tell me that a human is going to sift through a text file
looking for inaccuracies in the data, then that person is deluding
themselves and has way too much time on their hands.

We provide a dll interface to read our data file format. One routine
will
extract the data in native binary format for speed issues, the other
will
extract the data and return it in text format. This was done to
promote
language interoperability. In tests done long ago when 300Mz PCs ruled
the earth, binary extraction was at least an order of magnitude
faster. IIRC
it was about 70 times faster.

Now that our instruments are being used in the medical community, we
have
additional tamper detection requirements. Storing data in text makes
it
very tempting (and easy) for a human to fire up a text editor and
manipulate
the data. It can still be done using a binary editor, but it's much
harder
to do.

So I would add for security reasons as well as speed text file formats
do
not work for us.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Jens Schmidt
2009-11-12 01:18:40 UTC
Permalink
mzdude wrote:

[comb formatting repaired]
Post by mzdude
Now that our instruments are being used in the medical community, we
have additional tamper detection requirements. Storing data in text
makes it very tempting (and easy) for a human to fire up a text editor
and manipulate the data. It can still be done using a binary editor,
but it's much harder to do.
So I would add for security reasons as well as speed text file formats
do not work for us.
Using a binary editor is not very different to using a text editor. In
fact, Emacs can edit both. You may be in for a bad surprise when that
happens.
Security by obscurity is just no security at all. If you really want
tamper detection, then use digital signatures. You will get away with
just binary files only if you don't have to certify your security
measures or the person issueing the certificate is not worth his/her
money.
--
Greetings,
Jens Schmidt


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Rune Allnor
2009-11-12 18:29:52 UTC
Permalink
Post by Jens Schmidt
[comb formatting repaired]
Post by mzdude
Now that our instruments are being used in the medical community, we
have additional tamper detection requirements. Storing data in text
makes it very tempting (and easy) for a human to fire up a text editor
and manipulate the data. It can still be done using a binary editor,
but it's much harder to do.
So I would add for security reasons as well as speed text file formats
do not work for us.
Using a binary editor is not very different to using a text editor. In
fact, Emacs can edit both. You may be in for a bad surprise when that
happens.
Security by obscurity is just no security at all.
Binary formats present a first-layer defence against
tampering more or less on the same level as, say, those
plastic bands law enforcement agencies span around crime
scenes: Causal passers by are kept at some distance.

True, the binary format alone presents no serious defence
against a determined adversary, but people who just happens
come across the files are encouraged to keep their distance.

Rune
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Francis Glassborow
2009-11-10 21:30:39 UTC
Permalink
Post by Rune Allnor
Post by Seungbeom Kim
Frankly I don't understand your point.
My point is that
1) The speed penalty imposed by using text formats for numerical
data is totally absent in the USENET discussions and printed
literature on C++.
Because it is irrelevant to those involved. And any halfway competnet
programmer would understand that there is an overhead for using a test file.
Post by Rune Allnor
2) The speed penalty imposed by using text formats for numerical
data is one of the main bottlenecks in the data processing
chains I have seen where I work.
Which means that a properly qualified programmer would recognise that
this is one of the minority cases where a binary format would be useful.
BTW in another post I mentioned that I do use binary formats for scratch
files (exactly because there is no advantage in using a text format).

In addition it should be noted that a programmer who does not recognise
when to use a binary format file probably does not understand the
dangers in using such.
Post by Rune Allnor
3) The speed penalty imposed by using text formats for numerical
data is totally unknown among the people whose job it is to
set up said data processing chains.
OK, so you are using insufficiently qualified people. Whose fault is that?
Post by Rune Allnor
4) The speed penalty imposed by using text formats for numerical
data can easily be on the order of 100x or 200x relative to
using binary data, depending on implementations of the software
that accesses the file - not all applications are written in
C++; not all C++ applications are efficient.
I frankly do not believe that. Those kind of performance hits are almost
invariably the consequence of using the wrong algorithms.
Post by Rune Allnor
5) The speed penalty imposed by using text formats for numerical
data is a *design* *choise*, on a par with using O(NlgN) quick
sort algorithms instead of O(N^2) bubble sorts algorithms.
What I am concerned, people are free to use text formats if
a) Portability is an *actual* issue - not always the case.
b) Speed *is* irrelevant - not always the case.
The point is that for the overwhelming majority using text formats is
win-win. I would expect to see information about when to use binary
formats in specialist books on areas where it matters (authors of
general texts have to trim the content to meet criteria provided by
publishers and something that is important to a very small minority
would almost invariable be cut.
Post by Rune Allnor
Once one or both these factors is no longer relevant, text-
based formats are out of the picture. As for the "human
readability" question, that's irrelevant unless the contents
of the file is meant to be inspected by humans.
True but readability also extends to tools except that then we often
call it portability.
Post by Rune Allnor
6) The speed penalty imposed by using text formats for numerical
data should be mentioned in textbooks on C++, so that
unsuspecting users have a fair chance of making informed
choises on the matter, instead of the present situation,
where "politically correct" textbook authors not only make
the choises for them, but also avoid to mention the
alternatives.
See above. Unfortunately far too many programmers consider reading to be
an arcane art and never read books. Worse they think they know all the
answers and never listen to advice from others.

A programmer who cannot recognise that text formats affect performance
is not fit for anything other than grunt-work.

Actually I would suggest that such things as choice of file formats are
design issues and if an employer chooses not to employ a competent
designer with knowledge of the area he gets all he deserves.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Rune Allnor
2009-11-11 05:15:23 UTC
Permalink
On 10 Nov, 22:30, Francis Glassborow
Post by Francis Glassborow
Post by Rune Allnor
3) The speed penalty imposed by using text formats for numerical
data is totally unknown among the people whose job it is to
set up said data processing chains.
OK, so you are using insufficiently qualified people. Whose fault is that?
*I* am not using anyone. I happen to work among people who
ought to know these things, but don't.
Post by Francis Glassborow
Post by Rune Allnor
4) The speed penalty imposed by using text formats for numerical
data can easily be on the order of 100x or 200x relative to
using binary data, depending on implementations of the software
that accesses the file - not all applications are written in
C++; not all C++ applications are efficient.
I frankly do not believe that. Those kind of performance hits are almost
invariably the consequence of using the wrong algorithms.
Below is a test I wrote in matlab, which is an increasingly
popular language for these kinds of things. The script first
generates ten million random numbers, and writes them to file
on both ASCII and binary double precision floating point formats.
The files are then read straight back in, hopefully mitigating
effects of file caches etc:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
N = 10000000;
d1=randn(N,1);
t1=cputime;
save test.txt d1 -ascii
t2=cputime-t1;
disp(['Wrote ASCII data in ',num2str(t2),' seconds'])


t3=cputime;
d2=load('test.txt','-ascii');
t4=cputime-t3;
disp(['Read ASCII data in ',num2str(t4),' seconds'])


t5=cputime;
fid=fopen('test.raw','w');
fwrite(fid,d1,'double');
fclose(fid);
t6=cputime-t5;
disp(['Wrote binary data in ',num2str(t6),' seconds'])


t7=cputime;
fid=fopen('test.raw','r');
d3=fread(fid,'double');
fclose(fid);
t8=cputime-t7;
disp(['Read binary data in ',num2str(t8),' seconds'])
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


Output:
------------------------------------
Wrote ASCII data in 24.0469 seconds
Read ASCII data in 42.2031 seconds
Wrote binary data in 0.10938 seconds
Read binary data in 0.32813 seconds
------------------------------------

Binary writes are 24.0/0.1 = 240x faster than text write.
Binary reads are 42.2/0.32 = 130x faster than text read.

These numbers are representative for where I work.
Post by Francis Glassborow
Post by Rune Allnor
5) The speed penalty imposed by using text formats for numerical
data is a *design* *choise*, on a par with using O(NlgN) quick
sort algorithms instead of O(N^2) bubble sorts algorithms.
What I am concerned, people are free to use text formats if
a) Portability is an *actual* issue - not always the case.
b) Speed *is* irrelevant - not always the case.
The point is that for the overwhelming majority using text formats is
win-win. I would expect to see information about when to use binary
formats in specialist books on areas where it matters (authors of
general texts have to trim the content to meet criteria provided by
publishers and something that is important to a very small minority
would almost invariable be cut.
My point is that the people who only know a little programming
also would benefit from at least having seen these things
mentioned.

I don't mind you and other textbook authors arguing fiercly
for one approach and against the other, *provided* both
approaches are at least mentioned, and preferably described
in terms of pros & cons.
Post by Francis Glassborow
Post by Rune Allnor
6) The speed penalty imposed by using text formats for numerical
data should be mentioned in textbooks on C++, so that
unsuspecting users have a fair chance of making informed
choises on the matter, instead of the present situation,
where "politically correct" textbook authors not only make
the choises for them, but also avoid to mention the
alternatives.
See above. Unfortunately far too many programmers consider reading to be
an arcane art and never read books. Worse they think they know all the
answers and never listen to advice from others.
A programmer who cannot recognise that text formats affect performance
is not fit for anything other than grunt-work.
Actually I would suggest that such things as choice of file formats are
design issues and if an employer chooses not to employ a competent
designer with knowledge of the area he gets all he deserves.
I might agree with you on both cases, if the programmers and
designers had access to textbooks where these questions are
discussed.

As of right now, they don't.

Rune
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Seungbeom Kim
2009-11-12 00:59:11 UTC
Permalink
Post by Rune Allnor
On 10 Nov, 22:30, Francis Glassborow
Post by Francis Glassborow
Post by Rune Allnor
4) The speed penalty imposed by using text formats for numerical
data can easily be on the order of 100x or 200x relative to
using binary data, depending on implementations of the software
that accesses the file - not all applications are written in
C++; not all C++ applications are efficient.
I frankly do not believe that. Those kind of performance hits are almost
invariably the consequence of using the wrong algorithms.
Below is a test I wrote in matlab, which is an increasingly
popular language for these kinds of things. [...]
You can discuss it in a MATLAB forum, then. MATLAB measurements
in a C++ forum doesn't mean much, because the people don't know
(and may not either be interested in) what's going on inside MATLAB.
Gratuitous inefficiencies inside MATLAB, if any, cannot be used to
justify any argument in C++.
Post by Rune Allnor
------------------------------------
Wrote ASCII data in 24.0469 seconds
Read ASCII data in 42.2031 seconds
Wrote binary data in 0.10938 seconds
Read binary data in 0.32813 seconds
------------------------------------
Binary writes are 24.0/0.1 = 240x faster than text write.
Binary reads are 42.2/0.32 = 130x faster than text read.
These numbers are representative for where I work.
Again, if MATLAB is representative for where you work, please visit
a MATLAB forum. Otherwise, there was a C++ test program by James Kanze
in the comp.lang.c++ thread you mentioned, and the numbers given by
that program are much more persuasive and convincing, at least here
in a C++ newsgroup. Or you can suggest a better C++ program, of course.
Post by Rune Allnor
My point is that the people who only know a little programming
also would benefit from at least having seen these things
mentioned.
I don't mind you and other textbook authors arguing fiercly
for one approach and against the other, *provided* both
approaches are at least mentioned, and preferably described
in terms of pros & cons.
It may not be a job for language textbooks, as I mentioned earlier,
though it definitely is for numerical programming textbooks, or
more general programming books dealing with choosing data formats.

It is very natural and acceptable that language textbooks focus on
the language features and that for the sake of simplicity they default
to a text data format that's easier to understand and debug. They
don't want the readers to struggle on other issues when they don't
understand the language features very well yet.

You are welcome to write a book of your own, of course.
--
Seungbeom Kim

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Francis Glassborow
2009-11-12 12:09:21 UTC
Permalink
Post by Rune Allnor
On 10 Nov, 22:30, Francis Glassborow
Post by Francis Glassborow
Post by Rune Allnor
3) The speed penalty imposed by using text formats for numerical
data is totally unknown among the people whose job it is to
set up said data processing chains.
OK, so you are using insufficiently qualified people. Whose fault is that?
*I* am not using anyone. I happen to work among people who
ought to know these things, but don't.
Then they are inadequately trained for the job. As many know, I am a
self-taught amateur and I have always been aware of the overhead for
using text files (and once again note that I sometimes use binary for
scratch files, i.e. files that are being used within a single run of a
program and that is exactly because of the performance gain). No one
ever had to tell me because it was blatantly obvious that converting
to/from text format would take time.
Post by Rune Allnor
Post by Francis Glassborow
Post by Rune Allnor
4) The speed penalty imposed by using text formats for numerical
data can easily be on the order of 100x or 200x relative to
using binary data, depending on implementations of the software
that accesses the file - not all applications are written in
C++; not all C++ applications are efficient.
I frankly do not believe that. Those kind of performance hits are almost
invariably the consequence of using the wrong algorithms.
Below is a test I wrote in matlab, which is an increasingly
popular language for these kinds of things. The script first
generates ten million random numbers, and writes them to file
on both ASCII and binary double precision floating point formats.
The files are then read straight back in, hopefully mitigating
No, almost certainly aggravating it. MarLab is rather good at using
caches and I think the timings you give below have a great deal to do
with that.
Post by Rune Allnor
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
N = 10000000;
d1=randn(N,1);
t1=cputime;
save test.txt d1 -ascii
t2=cputime-t1;
disp(['Wrote ASCII data in ',num2str(t2),' seconds'])
t3=cputime;
d2=load('test.txt','-ascii');
t4=cputime-t3;
disp(['Read ASCII data in ',num2str(t4),' seconds'])
t5=cputime;
fid=fopen('test.raw','w');
fwrite(fid,d1,'double');
fclose(fid);
t6=cputime-t5;
disp(['Wrote binary data in ',num2str(t6),' seconds'])
t7=cputime;
fid=fopen('test.raw','r');
d3=fread(fid,'double');
fclose(fid);
t8=cputime-t7;
disp(['Read binary data in ',num2str(t8),' seconds'])
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
------------------------------------
Wrote ASCII data in 24.0469 seconds
Read ASCII data in 42.2031 seconds
Wrote binary data in 0.10938 seconds
wow! that is a data transfer rate of around a gigabyte per second. What
drives are you using? I suspect you are not measuring what you think you
are. I suspect that drive caches are coming into the equation somewhere.
Of course I am no numerical expert and hardware performance keeps
improving but a hard-drive delivering a sustained gigabyte/sec transfer
rate is rather beyond my expectations.

I think what is happening is that the extra size of the text data is
blowing the cache limits and so part of those timings relate to the real
STR for the drive system.

The only way to measure this kind of thing is to get data for a range of
different vales for N. I think you will find that the graph has one or
more discontinuities.

When I get an idle moment I will write a C++ program to investigate (me
experience suggests something more like a factor of 5 or 10)
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Rune Allnor
2009-11-12 18:31:57 UTC
Permalink
On 12 Nov, 13:09, Francis Glassborow
Post by Francis Glassborow
Post by Rune Allnor
------------------------------------
Wrote ASCII data in 24.0469 seconds
Read ASCII data in 42.2031 seconds
Wrote binary data in 0.10938 seconds
wow! that is a data transfer rate of around a gigabyte per second. What
drives are you using?
The thing is listed as 300 MBytes/s.
Post by Francis Glassborow
I suspect you are not measuring what you think you
are.
I think I am measuring the time delays from the point when
the terminal goes in 'busy' mode, where the user is prevented
from interacting with the program, till the program again
accepts user inputs. Those are the numbers that matter.
Post by Francis Glassborow
I suspect that drive caches are coming into the equation somewhere.
Of course I am no numerical expert and hardware performance keeps
improving but a hard-drive delivering a sustained gigabyte/sec transfer
rate is rather beyond my expectations.
I think what is happening is that the extra size of the text data is
blowing the cache limits and so part of those timings relate to the real
STR for the drive system.
The only way to measure this kind of thing is to get data for a range of
different vales for N. I think you will find that the graph has one or
more discontinuities.
Maybe. As for the cache effects, here is the output after
I ran the ASCII write/read cycle before the binary write/read
cycle:

Wrote binary data in 0.15625 seconds
Read binary data in 0.32813 seconds
Wrote ASCII data in 22.4844 seconds
Read ASCII data in 43.9844 seconds

I can't see any significant differences in the numbers
that would indicate severe cache effects.
Post by Francis Glassborow
When I get an idle moment I will write a C++ program to investigate (me
experience suggests something more like a factor of 5 or 10)
I would be very interested in seeing that kind of thing.
One that takes stuff like locales, input validation and
error checking into account, that is; not just the plain
calls to std::atof() or std::atoi().

Rune
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Bart van Ingen Schenau
2009-11-15 19:32:18 UTC
Permalink
Post by Rune Allnor
On 12 Nov, 13:09, Francis Glassborow
Post by Francis Glassborow
When I get an idle moment I will write a C++ program to investigate
(me experience suggests something more like a factor of 5 or 10)
I would be very interested in seeing that kind of thing.
Here are some measurements that I made. The measurements were made with
the unix tool 'time', which gives the elapsed time for a complete
program run including startup and shutdown.
To simulate that the data is generated and consumed by different
programs (so in-memory buffering is not possible), reading and writing
are done in separate runs. This has the added advantage that you get
separate measurements for reading and writing.

I have tested text file, native binary files and big-endian IEEE
double-format binary files with the following results:

Writing 10000000 values:
Native binary: 1.716 s (reference value)
IEEE binary: 2.708 s (1.58 times slower)
Text: 22.465 s (13.09 times slower)

Reading 10000000 values:
Native binary: 1.032 s (reference value)
IEEE binary: 2.208 s (2.14 times slower)
Text: 8.693 s (8.42 times slower)

As you can see, there is a factor 8 to 13 between text and native
binary. When comparing with a precisely specified binary format, the
difference becomes about half that factor.
Post by Rune Allnor
One that takes stuff like locales, input validation and
error checking into account, that is; not just the plain
calls to std::atof() or std::atoi().
The program does not do extensive input validation, because it is
assumed that the input files are created by another automated system
working to the same interface specification. It is unreasonable to
assume that humans will write files containing more than a few hundred
numbers.

The program I tested with is this
<start code>
#include <iostream>
#include <fstream>
#include <cstdlib>
#include <cctype>
#include <cfloat>
#include <math.h>

//#define _DEBUG

using namespace std;

typedef enum {
TEXT,
NATIVE,
IEEE,
} Operation;

ostream& write_text(ostream& os, double val)
{
return os << val << ' ';
}

ostream& write_native(ostream& os, double val)
{
return os.write(reinterpret_cast<const char*>(&val), sizeof(val));
}

ostream& write_ieee(ostream& os, double val)
{
int power;
double significand;
unsigned char sign;
unsigned long long mantissa;
unsigned char bytes[8];

if(val<0)
{
sign=1;
val = -val;
}
else
{
sign=0;
}
significand = frexp(val,&power);

if (power < -1022 || power > 1023)
{
cerr << "ieee754: exponent out of range" << endl;
os.setstate(ios::failbit);
}
else
{
power += 1022;
}
mantissa = (significand-0.5) * pow(2,53);

bytes[0] = ((sign & 0x01) << 7) | ((power & 0x7ff) >> 4);
bytes[1] = ((power & 0xf)) << 4 |
((mantissa & 0xfffffffffffffLL) >> 48);
bytes[2] = (mantissa >> 40) & 0xff;
bytes[3] = (mantissa >> 32) & 0xff;
bytes[4] = (mantissa >> 24) & 0xff;
bytes[5] = (mantissa >> 16) & 0xff;
bytes[6] = (mantissa >> 8) & 0xff;
bytes[7] = mantissa & 0xff;
return os.write(reinterpret_cast<const char*>(bytes), 8);
}

istream& read_text(istream& is, double& val)
{
return is >> val;
}

istream& read_native(istream& is, double& val)
{
return is.read(reinterpret_cast<char*>(&val), sizeof(val));
}

istream& read_ieee(istream& is, double& val)
{
unsigned char bytes[8];

is.read(reinterpret_cast<char*>(bytes), 8);
if (is)
{
int power;
double significand;
unsigned char sign;
unsigned long long mantissa;

mantissa = ( ((unsigned long long)bytes[7]) |
(((unsigned long long)bytes[6]) << 8) |
(((unsigned long long)bytes[5]) << 16) |
(((unsigned long long)bytes[4]) << 24) |
(((unsigned long long)bytes[3]) << 32) |
(((unsigned long long)bytes[2]) << 40) |
(((unsigned long long)bytes[1]) << 48) )
& 0xfffffffffffffLL;
significand = (mantissa/pow(2,53)) + 0.5;
power = (((bytes[1] >> 4) |
(((unsigned int)bytes[0]) << 4)) & 0x7ff) - 1022;
sign = bytes[0] >> 7;
val = ldexp(significand, power);
if (sign) val = -val;
}
return is;
}

int main(int argc, char** argv)
{
if (argc != 5)
{
cerr << "Usage: " << argv[0] << " <r(ead)|w(rite)>" <<
" <t(ext)|n(ative)|i(eee)> <N> <filename>" << endl;
return EXIT_FAILURE;
}

bool read_mode = (tolower(argv[1][0]) == 'r');
unsigned long num = strtoul(argv[3], NULL, 0);
Operation op_mode;

switch (tolower(argv[2][0]))
{
case 't': default: op_mode = TEXT; break;
case 'n': case 'b': op_mode = NATIVE; break;
case 'i': op_mode = IEEE; break;
}

//TODO: Insert timing code here

if (read_mode)
{
ifstream is(argv[4], (op_mode == TEXT ? ios::in : ios::binary));
double value;

for (unsigned long count = 0; count < num; count++)
{
switch (op_mode)
{
case TEXT: read_text (is, value); break;
case NATIVE: read_native(is, value); break;
case IEEE: read_ieee (is, value); break;
}
if (!is)
{
if (is.eof())
{
cerr << "Unexpected EOF after reading " << count
<< " values from file \"" << argv[4] << '"' << endl;
}
else
{
cerr << "Read error after reading " << count
<< " values from file \"" << argv[4] << '"' << endl;
}
break;
}
#ifdef _DEBUG
else
{
cout << value << '\n';
}
#endif
}
}
else
{
ofstream os(argv[4], (op_mode == TEXT ? ios::out : ios::binary));
double value;

for (unsigned long count = 0; count < num; count++)
{
value = rand();
switch (op_mode)
{
case TEXT: write_text (os, value); break;
case NATIVE: write_native(os, value); break;
case IEEE: write_ieee (os, value); break;
}
if (!os)
{
cerr << "Write error after writing " << count
<< " values to file \"" << argv[4] << '"' << endl;
break;
}
#ifdef _DEBUG
else
{
cout << value << '\n';
}
#endif
}
}

//TODO: Insert timing code here

}
<end code>
Post by Rune Allnor
Rune
Bart v Ingen Schenau
--
a.c.l.l.c-c++ FAQ: http://www.comeaucomputing.com/learn/faq
c.l.c FAQ: http://c-faq.com/
c.l.c++ FAQ: http://www.parashift.com/c++-faq-lite/

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Rune Allnor
2009-11-16 14:27:21 UTC
Permalink
Post by Bart van Ingen Schenau
Post by Rune Allnor
One that takes stuff like locales, input validation and
error checking into account, that is; not just the plain
calls to std::atof() or std::atoi().
The program does not do extensive input validation, because it is
assumed that the input files are created by another automated system
working to the same interface specification.
Famous last words. The specification might be the same; what
matters are the actual locale settings. It only takes a human
user to either

1) Not be aware of the importance of locales and never
specify locales, let alone set them according to spec
2) Switch from specified to local locale (ouch...) to
write a native-language memo or email.

and you are in deep trouble.

And do keep in minds, there are plenty of people around who
are likely to insist on doing things in native language (which
would likely apply to locale settings) as a matter of principle:
"I will not accept having to switch to, or be dominated by, a
foreign language in order to get my local, native-language job
done!"
Post by Bart van Ingen Schenau
It is unreasonable to
assume that humans will write files containing more than a few hundred
numbers.
Maybe, but humans might be likely to tinker with the
numbers, either by poking around in the files or by
using some file viewer that somehow alters the contents
of the data (e.g. by changing end-of-line encodings).

Rune
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Bart van Ingen Schenau
2009-11-16 21:04:25 UTC
Permalink
Post by Rune Allnor
Post by Bart van Ingen Schenau
Post by Rune Allnor
One that takes stuff like locales, input validation and
error checking into account, that is; not just the plain
calls to std::atof() or std::atoi().
The program does not do extensive input validation, because it is
assumed that the input files are created by another automated system
working to the same interface specification.
Famous last words. The specification might be the same; what
matters are the actual locale settings. It only takes a human
user to either
1) Not be aware of the importance of locales and never
specify locales, let alone set them according to spec
2) Switch from specified to local locale (ouch...) to
write a native-language memo or email.
When a program is dealing with data that is primarily meant to be
processed by other programs (so, the data format is an data-exchange
format), then whatever locale the user switches to should have no effect
whatever on the files that are produced or consumed.

In other words, two program runs that only differ in the locale that was
set by the user should produce identical data-exchange files.

<snip>
Post by Rune Allnor
Post by Bart van Ingen Schenau
It is unreasonable to
assume that humans will write files containing more than a few
hundred numbers.
Maybe, but humans might be likely to tinker with the
numbers, either by poking around in the files or by
using some file viewer that somehow alters the contents
of the data (e.g. by changing end-of-line encodings).
They may want to do that, but then it is *their* responsibility to leave
the files in a state that they still can be processed by the intended
applications.
Do you make your I/O routines so robust that they can handle any
arbitrary garbage that was inserted in the file? Or the removal of some
data?
Or do you bail out at a certain point and say: "This file has been
corrupted. It is no use to me"?

And note that binary files are even more sensitive to inquisitive users.
If their tooling can't even read a text-formatted file without
destroying it, I shudder to think about what that tooling would do with
a binary file.
Post by Rune Allnor
Rune
Bart v Ingen Schenau
--
a.c.l.l.c-c++ FAQ: http://www.comeaucomputing.com/learn/faq
c.l.c FAQ: http://c-faq.com/
c.l.c++ FAQ: http://www.parashift.com/c++-faq-lite/

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
stan
2009-11-17 02:19:19 UTC
Permalink
Post by Rune Allnor
Post by Bart van Ingen Schenau
Post by Rune Allnor
One that takes stuff like locales, input validation and
error checking into account, that is; not just the plain
calls to std::atof() or std::atoi().
The program does not do extensive input validation, because it is
assumed that the input files are created by another automated system
working to the same interface specification.
Famous last words. The specification might be the same; what
matters are the actual locale settings. It only takes a human
user to either
1) Not be aware of the importance of locales and never
specify locales, let alone set them according to spec
2) Switch from specified to local locale (ouch...) to
write a native-language memo or email.
and you are in deep trouble.
When you are forced to deal with considering binary I/O for
performance reasons, you are not dealing with a general purpose
application where the users will be totally clueless. With a
specialized application there must be a learning curve for the
user. If unqualified people attempt to use specialized applications
you get what you should expect. You cannot idiot proof things; idiots
are clever and cunning and always improving. I spent 32 years in the
military so I feel qualified to speak to the advancing quality of
stupidity.

This hypothetical specification would need to include details and no
competent programmer would force the user to muck around with system
settings; you would simply control the application I/O local setting
in the code.
Post by Rune Allnor
And do keep in minds, there are plenty of people around who
are likely to insist on doing things in native language (which
"I will not accept having to switch to, or be dominated by, a
foreign language in order to get my local, native-language job
done!"
Post by Bart van Ingen Schenau
It is unreasonable to
assume that humans will write files containing more than a few hundred
numbers.
Maybe, but humans might be likely to tinker with the
numbers, either by poking around in the files or by
using some file viewer that somehow alters the contents
of the data (e.g. by changing end-of-line encodings).
These appear to be local (to your domain) problems. Most applications
aren't facing nor solving these problems. There are plenty of real
problems to go around. Sounds like borrowed trouble.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Rune Allnor
2009-11-17 13:49:26 UTC
Permalink
Post by stan
These appear to be local (to your domain) problems.
Local to my geography / locale. Not application.
Post by stan
Most applications
aren't facing nor solving these problems.
Maybe these problems are hidden in the 100% English-speaking
locales. In anything but 100% English locales, one would be
wise to watch out for these kinds of things.
Post by stan
There are plenty of real
problems to go around. Sounds like borrowed trouble.
Nope. On my first survey trip I all of a sudden got a
furious client representative almost crushing the door
of my office. It turned out the people who produced the
end delivery data files had used a computer with
locale settings that text-formatted floating point data
with comma, not dot, as decimal mark.

The end result was that we had to go through a month's
worth of previous data deliveries and replace comma decimal
separators with dots. Thankfully, the data were tab
separated in the file, so the correction could be done
in a matter of hours. If the data had been comma separated,
we would have been in real trouble.

I can assure you that there was nothing 'borrowed',
whatsoever, with that experience.

Rune
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
stan
2009-11-18 02:15:54 UTC
Permalink
Post by Rune Allnor
Post by stan
These appear to be local (to your domain) problems.
Local to my geography / locale. Not application.
No, I meant in your personal experience domain. This isn't an
unsolvable problem out here in the wild. I'm not claiming that i18n
problems don't exist or are trivial, but they don't pose a unsolvable
problem for I/O performance or portability.

A user who needs locale specific display is very different from an I/O
bound application. Different problems; different solutions.
Post by Rune Allnor
Post by stan
Most applications
aren't facing nor solving these problems.
Maybe these problems are hidden in the 100% English-speaking
locales. In anything but 100% English locales, one would be
wise to watch out for these kinds of things.
Failure to specify an output format that you intend to pass around and
thus requires some portability is incompetence, not a technical
problem. Technology will never solve incompetence or stupidity.

Locale's are not a data file format issue, they are a user display
issue. IF the users demand correct locale correct viewing of the data,
then you clearly have to rule out binary entirely unless you are
talking to old people (that really hurt) who read things in octal.
Post by Rune Allnor
Post by stan
There are plenty of real
problems to go around. Sounds like borrowed trouble.
Nope. On my first survey trip I all of a sudden got a
furious client representative almost crushing the door
of my office. It turned out the people who produced the
end delivery data files had used a computer with
locale settings that text-formatted floating point data
with comma, not dot, as decimal mark.
The end result was that we had to go through a month's
worth of previous data deliveries and replace comma decimal
separators with dots. Thankfully, the data were tab
separated in the file, so the correction could be done
in a matter of hours. If the data had been comma separated,
we would have been in real trouble.
sed 's/,/./g'
or better
tr ',' '.'

Having humans do this manually is silly.
Post by Rune Allnor
I can assure you that there was nothing 'borrowed',
whatsoever, with that experience.
The problems you describe are the result of incompetence. If the
problem you face is stupidity then I/O formats aren't really in the
problem domain; hence you borrow the trouble.

In a case of incompletely specified formats, neither binary nor text
will be satisfactory.

If you need to pass data around among users who demand different
locales then you need to specify a portable data format and then
create a viewer/editor for users.

In other words focus on the problem not a symptom.

When you have a performance problem, find a performance solution after
profiling. When you have a data portability problem, clarify the
specification. When you have to work for stupid people work on you
resume and networking or seek help coping.

IMHO, fixing general textbooks to match you specialized domain is a
non starter. Your view of binary versus text formats might be correct
for your field, but it doesn't match the 99% of programming in other
fields. Programming is a complicated task and designing data formats
is not really one of the fundamental things a beginner should be
worrying about.

In fact for absolute beginners in a survey course, text is the only
reasonable format since clarity of understanding is what matters most
and performance and even portability is on the back burner. Somewhere
between intermediate and advanced programmers should be learning about
performance and portability issues. Future managers should not factor
in to the material covered in a beginner programmer course. Again,
focus on the problem instead of the symptom.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
stan
2009-11-13 02:09:57 UTC
Permalink
Post by Rune Allnor
On 10 Nov, 22:30, Francis Glassborow
<snip>
Post by Rune Allnor
Post by Francis Glassborow
Post by Rune Allnor
4) The speed penalty imposed by using text formats for numerical
data can easily be on the order of 100x or 200x relative to
using binary data, depending on implementations of the software
that accesses the file - not all applications are written in
C++; not all C++ applications are efficient.
I frankly do not believe that. Those kind of performance hits are almost
invariably the consequence of using the wrong algorithms.
I find the 100x or 200x claims very hard to swallow even in special
cases and they are simply wrong in the general case. The raw I/O plus
parsing performance difference is less then 100x or the programmer is
not terribly good at this sort of thing. Given a context where I/O
turns out to be a critical bottleneck it's not unreasonable for a
competant programmer to bypass canned library I/O and resort to custom
routines as needed.
Post by Rune Allnor
Below is a test I wrote in matlab, which is an increasingly
popular language for these kinds of things. The script first
generates ten million random numbers, and writes them to file
on both ASCII and binary double precision floating point formats.
The files are then read straight back in, hopefully mitigating
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
N = 10000000;
d1=randn(N,1);
t1=cputime;
save test.txt d1 -ascii
t2=cputime-t1;
disp(['Wrote ASCII data in ',num2str(t2),' seconds'])
t3=cputime;
d2=load('test.txt','-ascii');
t4=cputime-t3;
disp(['Read ASCII data in ',num2str(t4),' seconds'])
t5=cputime;
fid=fopen('test.raw','w');
fwrite(fid,d1,'double');
fclose(fid);
t6=cputime-t5;
disp(['Wrote binary data in ',num2str(t6),' seconds'])
t7=cputime;
fid=fopen('test.raw','r');
d3=fread(fid,'double');
fclose(fid);
t8=cputime-t7;
disp(['Read binary data in ',num2str(t8),' seconds'])
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
------------------------------------
Wrote ASCII data in 24.0469 seconds
Read ASCII data in 42.2031 seconds
Wrote binary data in 0.10938 seconds
Read binary data in 0.32813 seconds
------------------------------------
Binary writes are 24.0/0.1 = 240x faster than text write.
Binary reads are 42.2/0.32 = 130x faster than text read.
These numbers are representative for where I work.
Are you programming MATLAB or C++? I'm confused because this is a c++
newsgroup so MATLABS I/O perfomance in not really relevant,
convincing, or germain.

Is your argument that because MATLAB's I/O is slower for your test
that all applications must be similar?
Post by Rune Allnor
Post by Francis Glassborow
Post by Rune Allnor
5) The speed penalty imposed by using text formats for numerical
data is a *design* *choise*, on a par with using O(NlgN) quick
sort algorithms instead of O(N^2) bubble sorts algorithms.
What I am concerned, people are free to use text formats if
a) Portability is an *actual* issue - not always the case.
b) Speed *is* irrelevant - not always the case.
You can't really know the speed penalty without some code and some
benchmarks. Are you including prototypes in this *design* phase?
Post by Rune Allnor
Post by Francis Glassborow
The point is that for the overwhelming majority using text formats is
win-win. I would expect to see information about when to use binary
formats in specialist books on areas where it matters (authors of
general texts have to trim the content to meet criteria provided by
publishers and something that is important to a very small minority
would almost invariable be cut.
My point is that the people who only know a little programming
also would benefit from at least having seen these things
mentioned.
I don't mind you and other textbook authors arguing fiercly
for one approach and against the other, *provided* both
approaches are at least mentioned, and preferably described
in terms of pros & cons.
What textbooks are your referring to here? I've never seen one that
didn't give a passing nod to tradeoff's.

I'll allow that we're mixing simple technical books in with
textbooks. Textbooks in particular are targeted at survey type
courses where the coverage is very broad and general. In that context
text almost always comes out ahead. When you get down to specialized
fields then things change and specialized books usually spend more
time on relavant specialized concerns.

In a specialized field, portability is less likely to be a
concern. The playing field is different than for a more gereral
purpose application where hard experience teaches that failure to
account for maintenance and potability leads to Bad Things (tm).

Not one person has claimed that binary is absolute evil, but most have
cautioned that experience leads to the conclusion that binary formats
can lead to unpleasant surprises and that the competant programmer
must conduct due diligence to rule out possible future contexts where
binary becomes a productive as a broken bridge.

Any programmer who hasn't been bitten by a program that outlived the
hardware it was written on hasn't been programming very long.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Rune Allnor
2009-11-13 18:40:33 UTC
Permalink
Post by stan
Post by Rune Allnor
On 10 Nov, 22:30, Francis Glassborow
<snip>
Post by Rune Allnor
Post by Francis Glassborow
Post by Rune Allnor
4) The speed penalty imposed by using text formats for numerical
data can easily be on the order of 100x or 200x relative to
using binary data, depending on implementations of the software
that accesses the file - not all applications are written in
C++; not all C++ applications are efficient.
I frankly do not believe that. Those kind of performance hits are almost
invariably the consequence of using the wrong algorithms.
I find the 100x or 200x claims very hard to swallow even in special
cases and they are simply wrong in the general case. The raw I/O plus
parsing performance difference is less then 100x or the programmer is
not terribly good at this sort of thing. Given a context where I/O
turns out to be a critical bottleneck it's not unreasonable for a
competant programmer to bypass canned library I/O and resort to custom
routines as needed.
What seems to happen is that users assemble a set of general-
purpose SW products, and use different SW products to handle
different sub-tasks of the processing.

The efficiency of each SW package depends on a number of
factors outside anyone's control:

- The skill of the programmer(s) who implemented the SW
- The implemntational details of the languages and/or
libraries used for the SW
- The intended scale of the SW - some of these things
are clearly intended for industrial scale; others
might have been intened more or less as toys.

and so on.

As far as I can tell, few if anyone have reviewed these
processing chains from efficiency POVs.
Post by stan
Is your argument that because MATLAB's I/O is slower for your test
that all applications must be similar?
Matlab is, as I understand it, based on Java, so the
problem might be one with Java's IO facilities.

Matlab is an increasingly popular language to use for
these kinds of things (as I said, few if anyone are
aware of efficiency questions), so by selecting to use
text files in the process, one exposes the user to
matlab's inefficiencies.
Post by stan
Post by Rune Allnor
Post by Francis Glassborow
Post by Rune Allnor
5) The speed penalty imposed by using text formats for numerical
data is a *design* *choise*, on a par with using O(NlgN) quick
sort algorithms instead of O(N^2) bubble sorts algorithms.
What I am concerned, people are free to use text formats if
a) Portability is an *actual* issue - not always the case.
b) Speed *is* irrelevant - not always the case.
You can't really know the speed penalty without some code and some
benchmarks. Are you including prototypes in this *design* phase?
I design this processing chain for efficiency, that is, the
end user to have a chance of meeting his deadlines. The design
is based on avoiding the known bottlenecks in the present chain.
Post by stan
Post by Rune Allnor
I don't mind you and other textbook authors arguing fiercly
for one approach and against the other, *provided* both
approaches are at least mentioned, and preferably described
in terms of pros & cons.
What textbooks are your referring to here? I've never seen one that
didn't give a passing nod to tradeoff's.
Tradeoffs in general, yes. No text-book I am aware of
on C++ mentions the trade-offs between text and binary
file formtas. The C++ books in my bookshelf right now
that also treat more general aspects of programming:

- Stroustrup: "The C++ Programming Language"
- Stroustrup: "Programming"
- Glassborow: "You can do it!"
- Glassborow: "You can program in C++"
- Koenig & Moo: "Accelerated C++"

In addition there are some books like Meyer's Effective C++
books, Dewhurst's books on gotchas and common knowledge, as
well as more specialized books on the STL, templates and
efficient C++.

None of these mention the speed/system portability/user locales
trade-offs involved when choosing between text and binary file
formats.
Post by stan
I'll allow that we're mixing simple technical books in with
textbooks. Textbooks in particular are targeted at survey type
courses where the coverage is very broad and general. In that context
text almost always comes out ahead.
Sure. But students should know about the alternatives and
trade-offs as early as possible. The sad fact is that it
is not the skilled programmers that make the strategic
decisions among users: In large projects that use (as opposed
to develop) software and where software configurations might
prove to be bottlenecks, domain specialists will be making the
decisions.

These people can at most be expected to have one intro class
of programming (and quite a few din't even have that). It is
*those* people, who were forced to take that one programming
class they didn't really want to take, who need to know these
things.

Get the two specialized performance issues of tex/binary
file formats and parallelization at least mentioned in the
intro books, and the specialists can start talking with
the domain experts without running the risk of coming across
as calling them amateur or idiots and so on.

And yes, that's a real risk: It seems to be one of the
fundamental human traits that being corrected at very
fundamental levels is percieved as an accusation of being
incompetent, stupid, those kinds of things.

A process uses text files for data storage, and now
sees data hndling as a major bottleneck. If somebody points
out that "text files are at least one order of magnitude
slower than binary files" to whoever made the decision to
use text files, this is often enough percieved by the
decision maker as a statement like "you are an incompetent
fool."

Not because of any way or manner the statement was made.
Because of the obvious correctness of the stated fact.
The more basic and obvious the correction is, the worse
the percieved implicit accusation against whoever made
the wrong decision in the first place.
Post by stan
When you get down to specialized
fields then things change and specialized books usually spend more
time on relavant specialized concerns.
Those are relevant for the programmers. I am talking
about the decision makers, who at best have superficial
exposure to programming literature.
Post by stan
In a specialized field, portability is less likely to be a
concern. The playing field is different than for a more gereral
purpose application where hard experience teaches that failure to
account for maintenance and potability leads to Bad Things (tm).
That argument might make sense to somebody who lives and
works in an environment 100% of communications are in
English. Once you work where different linguistic locales
are in use, its validity might not be quite as obvious.

Are you sure you can trust that the dot (ASCII character 46)
is always the decimal separator sign? Where I live, comma
(ASCII character 44) plays that role. The dot, if used,
might sometimes be found as a digit grouping character.

The assignment

double one_million = 1.000.000,00;

might be a valid declaration in my locale.

Once you start addressing these things, you all of a
sudden need some data input validation step, using regular
expressions and so on, to ensure that the numbers in
the text file are given on a format your parser can handle.

Of course, one can limit oneself to accessing only the
files one's own computer generated, with known character
sets and locales, but then one suffer from the same
limitations as with binary files.

Once such arguments are taken into account, time penalties
using text files sky-rocket.
Post by stan
Not one person has claimed that binary is absolute evil, but most have
cautioned that experience leads to the conclusion that binary formats
can lead to unpleasant surprises and that the competant programmer
must conduct due diligence to rule out possible future contexts where
binary becomes a productive as a broken bridge.
Any programmer who hasn't been bitten by a program that outlived the
hardware it was written on hasn't been programming very long.
The problem is not hardware, the problem is architectures.
There are only so many architectures. True, the details
about converting from one to another can be nasty, but the
main problem is to find the docs about the details of each
architecture.

As for C++, one main problem is that it has no fized-size
data types like char8_t, uint32_t and so on, which are needed
to write portable access SW for the contents of the
binary files.

As I understand, such types will become stadardized with
C++0x?

Rune
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Bart van Ingen Schenau
2009-11-15 19:32:47 UTC
Permalink
Post by Rune Allnor
Post by stan
Is your argument that because MATLAB's I/O is slower for your test
that all applications must be similar?
Matlab is, as I understand it, based on Java, so the
problem might be one with Java's IO facilities.
Matlab is an increasingly popular language to use for
these kinds of things (as I said, few if anyone are
aware of efficiency questions), so by selecting to use
text files in the process, one exposes the user to
matlab's inefficiencies.
But if you have a problem with Matlab's inefficiencies, why do C++
textbooks have to be changed to address that issue?
Most likely, the issues involved are completely independent of the
language you use and that alone is reason enough not to insist that is
must be addressed in C++-specific books, but in domain-specific books so
that also the others that choose a different language (such as Java or
Matlab) can benefit from the knowledge.

<snip>
Post by Rune Allnor
Post by stan
Post by Rune Allnor
I don't mind you and other textbook authors arguing fiercly
for one approach and against the other, *provided* both
approaches are at least mentioned, and preferably described
in terms of pros & cons.
What textbooks are your referring to here? I've never seen one that
didn't give a passing nod to tradeoff's.
Tradeoffs in general, yes. No text-book I am aware of
on C++ mentions the trade-offs between text and binary
file formtas. The C++ books in my bookshelf right now
- Stroustrup: "The C++ Programming Language"
- Stroustrup: "Programming"
- Glassborow: "You can do it!"
- Glassborow: "You can program in C++"
- Koenig & Moo: "Accelerated C++"
In addition there are some books like Meyer's Effective C++
books, Dewhurst's books on gotchas and common knowledge, as
well as more specialized books on the STL, templates and
efficient C++.
None of these mention the speed/system portability/user locales
trade-offs involved when choosing between text and binary file
formats.
Curiously, none of the books you mention are about the specific domain
for which you write software.
Additionally, most of the books are entry-level C++ books, and I
consider using binary files in a proper way to be way beyond entry level
(more like expert level).
To me, it is therefore not surprising that you can't find any treatise
on the merits of binary files in those books. If it isn't beyond the
level of the intended readership, it is outside the scope that the
authors want to address.
Post by Rune Allnor
Post by stan
I'll allow that we're mixing simple technical books in with
textbooks. Textbooks in particular are targeted at survey type
courses where the coverage is very broad and general. In that context
text almost always comes out ahead.
Sure. But students should know about the alternatives and
trade-offs as early as possible. The sad fact is that it
is not the skilled programmers that make the strategic
decisions among users: In large projects that use (as opposed
to develop) software and where software configurations might
prove to be bottlenecks, domain specialists will be making the
decisions.
These people can at most be expected to have one intro class
of programming (and quite a few din't even have that). It is
*those* people, who were forced to take that one programming
class they didn't really want to take, who need to know these
things.
No, they *don't* need to know all the factors involved in those issues.
The domain experts need to know when some decisions might require
knowledge that is outside their realm of expertise and know to ask an
expert in that other domain for input.
Post by Rune Allnor
Get the two specialized performance issues of tex/binary
file formats and parallelization at least mentioned in the
intro books, and the specialists can start talking with
the domain experts without running the risk of coming across
as calling them amateur or idiots and so on.
I can tell you that you are in for a rude awakening.
I have a university degree in Chemical Engineering with a specialisation
in Software Engineering. The specialisation took about a third of the
entire four-year course and exists for the express purpose to train
people that are capable of bridging the gap between the domain experts
on the chemistry side and the programmers on the other side.
So, you think you can teach in a single introductory course what my
university set up a complete specialisation for?

<snip>
Post by Rune Allnor
Post by stan
When you get down to specialized
fields then things change and specialized books usually spend more
time on relavant specialized concerns.
Those are relevant for the programmers. I am talking
about the decision makers, who at best have superficial
exposure to programming literature.
And that means the issue must either be addressed in the relevant domain
literature, or the programmers must challenge the decisions that they
fear are not made on proper grounds.
Post by Rune Allnor
Post by stan
In a specialized field, portability is less likely to be a
concern. The playing field is different than for a more gereral
purpose application where hard experience teaches that failure to
account for maintenance and potability leads to Bad Things (tm).
That argument might make sense to somebody who lives and
works in an environment 100% of communications are in
English. Once you work where different linguistic locales
are in use, its validity might not be quite as obvious.
The exact same reasoning holds true for binary files, when you consider
using them across different architectures.
In both cases, the file format must be rigorously defined.
Post by Rune Allnor
Are you sure you can trust that the dot (ASCII character 46)
is always the decimal separator sign? Where I live, comma
(ASCII character 44) plays that role. The dot, if used,
might sometimes be found as a digit grouping character.
The assignment
double one_million = 1.000.000,00;
might be a valid declaration in my locale.
Err, no. Because C and C++ do not allow for locale-dependent number
formatting in the source files.

On the other hand, if you have the snippet
double x;
std::cin >> x;
and the user types the input
1,000
What value will be stored in x? Will it be one, or one thousand?

For situations like this, it is usually not possible to allow fully
unconstrained formatting of the input files.
Post by Rune Allnor
Once you start addressing these things, you all of a
sudden need some data input validation step, using regular
expressions and so on, to ensure that the numbers in
the text file are given on a format your parser can handle.
Of course, one can limit oneself to accessing only the
files one's own computer generated, with known character
sets and locales, but then one suffer from the same
limitations as with binary files.
Yes. The limitation that you have to specify the file format beforehand
in detail.
Post by Rune Allnor
Rune
Bart v Ingen Schenau
--
a.c.l.l.c-c++ FAQ: http://www.comeaucomputing.com/learn/faq
c.l.c FAQ: http://c-faq.com/
c.l.c++ FAQ: http://www.parashift.com/c++-faq-lite/

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Francis Glassborow
2009-11-15 19:31:52 UTC
Permalink
Post by Rune Allnor
Tradeoffs in general, yes. No text-book I am aware of
on C++ mentions the trade-offs between text and binary
file formtas. The C++ books in my bookshelf right now
- Stroustrup: "The C++ Programming Language"
- Stroustrup: "Programming"
- Glassborow: "You can do it!"
- Glassborow: "You can program in C++"
- Koenig & Moo: "Accelerated C++"
I am simply flabbergasted. "You Can Do It!" target readership would be
completely confused by mention of performance issues between text and
binary. And what is that book doing on your shelf?
"You Can Program in C++" assumes that the reader already knows about
programming. I must apologise for assuming that a programmer
instinctively realises that turning data into text and back again will
take time.

Again, I am sure that both Bjarne Stroustrup and Andy Koenig felt that
performance issues were inappropriate in the context of what they were
writing.
Post by Rune Allnor
In addition there are some books like Meyer's Effective C++
books, Dewhurst's books on gotchas and common knowledge, as
well as more specialized books on the STL, templates and
efficient C++.
None of these mention the speed/system portability/user locales
trade-offs involved when choosing between text and binary file
formats.
Please note that all the books you mention are concerned with C++ or, in
my first book, the basics of elementary programming. The choice between
binary or text format data files has _nothing_ to do with C++. It is a
program design issue and every language would have the same issues.

A good introductory programming course aimed at people who are
considering software development as a career choice would cover issues
of data format and at least mention that one criterion for choice would
be performance.

It would only be a specific C++ issue if C++ was inherently inefficient
at text based data storage. It isn't though the 'out of the box' library
facilities are often an order of magnitude less efficient than what can
be developed for special purposes.

Disk speed is so slow compared with CPU speeds that single element
read/writes are dominated by such things as latency issues. Those issues
are usually masked by use of various caching strategies, by the program
runtime, the OS and the hardware.

I will repeat what I have written before, but more explicitly:

Any program designer of tools for processing large quantities of
numerical data who is unfamiliar with the caching, conversion issues
etc. is unqualified.

Lastly on a matter of personal relationships. When I have someone who
has entirely missed a vital issue I try to apply a bit of tact. In the
case in point I would not tell anyone to use binary formatted data.
Instead I would say something like:

"I wonder if writing/reading the raw data might help."

If and when the person comes back to tell me that it did, I would
respond something along the lines of "I must remember that for future use"

There are so many other design issues that should have been considered
that I think that whether or not introductory programming courses
introduce the costs of using text data files for storing numerical data
is of very little importance. However a course on application design
should certainly cover such issues.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Rune Allnor
2009-11-16 14:27:23 UTC
Permalink
On 15 Nov, 20:31, Francis Glassborow
Post by Francis Glassborow
Post by Rune Allnor
Tradeoffs in general, yes. No text-book I am aware of
on C++ mentions the trade-offs between text and binary
file formtas. The C++ books in my bookshelf right now
- Stroustrup: "The C++ Programming Language"
- Stroustrup: "Programming"
- Glassborow: "You can do it!"
- Glassborow: "You can program in C++"
- Koenig & Moo: "Accelerated C++"
I am simply flabbergasted. "You Can Do It!" target readership would be
completely confused by mention of performance issues between text and
binary.
Then don't mention performance. Portability between character
encodings and locales vs portability between architectures might
be more interesting.
Post by Francis Glassborow
And what is that book doing on your shelf?
Taking up space.

Seriously, I bought it because I found that I needed
a fresman's eye on C++. At the time when I bought it,
I had returned to C++ after having stayed away for a
long time, only to find that the language had transformed
almost beyond recognition.

And there is the chance that I might teach or supervise
programming students. Your books are ideal starting points
for people who want to learn C++. Knowing that such books
exist at all, is half the job done.
Post by Francis Glassborow
"You Can Program in C++" assumes that the reader already knows about
programming. I must apologise for assuming that a programmer
instinctively realises that turning data into text and back again will
take time.
Don't assume anything. A large and growing fraction of present
'programmers' have matlab as their first - and all too often,
only - programming language.

Having spent a couple of decades in and around universities, it
seems that 'proper' programming training for the masses is a thing
of the past. Students use matlab. Students are exposed to matlab
during 'domain' classes, are expected to pick up matlab on their
own, and never recieve formal training: "Matlab is easy to learn."
"Matlab is not a proper programming langage, and thus not worth
teaching systematically."

Both statements are true, but contemporary matlab is powerful
enough that one can do almost anything with it. Not efficiently,
not easily, but just about anything one can do with software
can be done by matlab. So no one 'needs' to learn anything else
than matlab. Until they hit a brick wall, that is.

To give an idea about what one is up against:

For historical reasons, it is 'common knowledge' among matlab
users that fundamental algorithm primitives like FOR and WHILE
loops, SWITCH-CASE statements, conditional tests and so on, are
'evil'. Matlab is an interpreted language, and the matlab
interpreter seems to have had a bug that caused these kinds of
constructs to take orders of magnitude longer than necessary.
The bug was corrected a few years ago, but the damage was
already done: Decades of conditioning had caused matlab
users to consider nuts'n bolts programming constructs as
'evil.'

Such attitudes have been transferred to the majority of
'programmers' that leave the educational institutions
these day.
Post by Francis Glassborow
Again, I am sure that both Bjarne Stroustrup and Andy Koenig felt that
performance issues were inappropriate in the context of what they were
writing.
It's still worth mentioning in a book the size of Stroustrup's
"Programming" (>1200 pages), if only in an appendix or comment
box. In a recent book,

http://www.amazon.com/Masterminds-Programming-Conversations-Creators-Languages/dp/0596515170/ref=sr_1_12?ie=UTF8&s=books&qid=1258369784&sr=8-12

Stroustrup comments that "The [initial C++ language design] was
something that was extremely expressive and flexible, yet ran
at a speed that challenged assembler..." (lower half of page 2,
available as preview at the page linked above).

But again, speed might not be the best reason to mention this
for the novice programmer. Locales and character encodings
might be more appropriate motivations.

Rune
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Neil Butterworth
2009-11-16 21:04:17 UTC
Permalink
Post by Rune Allnor
Having spent a couple of decades in and around universities, it
seems that 'proper' programming training for the masses is a thing
of the past. Students use matlab. Students are exposed to matlab
during 'domain' classes, are expected to pick up matlab on their
own, and never recieve formal training: "Matlab is easy to learn."
"Matlab is not a proper programming langage, and thus not worth
teaching systematically."
I taught computing on various courses in two British Universities in the
early 1980s and I can assure that there was little "formal training" in
programming back then either. Also, Matlab was heavily used back then -
it is not some new phenomenon, as you seem to think. Once again, you
seem to be setting up straw men so you can knock them down, and once
again your post has minimal C++ content.

Neil Butterworth
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
stan
2009-11-17 02:17:52 UTC
Permalink
Post by Rune Allnor
On 15 Nov, 20:31, Francis Glassborow
Post by Francis Glassborow
I am simply flabbergasted. "You Can Do It!" target readership would be
completely confused by mention of performance issues between text and
binary.
Then don't mention performance. Portability between character
encodings and locales vs portability between architectures might
be more interesting.
Portability is clearly not the biggest problem a beginner faces. In
fact one could argue it's better to let a beginner get bitten the
first time so the lesson is clear and meaningful.
Post by Rune Allnor
Post by Francis Glassborow
"You Can Program in C++" assumes that the reader already knows about
programming. I must apologise for assuming that a programmer
instinctively realises that turning data into text and back again will
take time.
Don't assume anything. A large and growing fraction of present
'programmers' have matlab as their first - and all too often,
only - programming language.
Every "good" writer is forced to define or assume a target audience as
part of the writing process. When the assumptions are spelled out
clearly up front the burden shifts to the reader.
Post by Rune Allnor
Having spent a couple of decades in and around universities, it
seems that 'proper' programming training for the masses is a thing
of the past. Students use matlab. Students are exposed to matlab
during 'domain' classes, are expected to pick up matlab on their
own, and never recieve formal training: "Matlab is easy to learn."
"Matlab is not a proper programming langage, and thus not worth
teaching systematically."
Both statements are true, but contemporary matlab is powerful
enough that one can do almost anything with it. Not efficiently,
not easily, but just about anything one can do with software
can be done by matlab. So no one 'needs' to learn anything else
than matlab. Until they hit a brick wall, that is.
For historical reasons, it is 'common knowledge' among matlab
users that fundamental algorithm primitives like FOR and WHILE
loops, SWITCH-CASE statements, conditional tests and so on, are
'evil'. Matlab is an interpreted language, and the matlab
interpreter seems to have had a bug that caused these kinds of
constructs to take orders of magnitude longer than necessary.
The bug was corrected a few years ago, but the damage was
already done: Decades of conditioning had caused matlab
users to consider nuts'n bolts programming constructs as
'evil.'
Such attitudes have been transferred to the majority of
'programmers' that leave the educational institutions
these day.
Once again with the Matlab discussion in C++. First I'll address a
point made earlier in the thread where you suggested that Matlab was
written in Java. Matlab at it's core is basically some scripting
around old Fortran libraries for dealing with matrices and linear
algebra. These days some of the GUI stuff is done with Java but the
actual language is not.

Next you mention loops in Matlab. This is not historical it's simply
the nature of the linear algebra library. Loops are a terrible way to
do matrix math; much more efficient algorithms exist. Hence the best
advice is to Avoid looping over vectors or matrices. This has nothing
to do with non matrix operations, but it is commonly misunderstood by
many.

I doubt that the number of programmers getting their first experience
see Matlab initially. It is nearly universal for Engineering schools
and some other Science departments but even there other intro
programming courses are required. Nearly all of those same students
will also be exposed to Mathematica and it's programming. They will be
exposed to a variety of programming experiences. They should gain an
appreciation that programming context matters.

Most programmers don't see their first programming experience in a
classroom and I doubt the number of working programmers with
engineering degrees is growing much. Your claim that the number of
people who have been confused by Matlab is hard to swallow. You seem
to be having trouble accepting that your little view of programming
based on where you work today is very different from the vast majority
of the programming universe. Because everyone in your office knows
Matlab doesn't extrapolate well.

Any student who fails to see the difference between programming Matlab
vs c++ or Java isn't really cut out for programming and will certainly
have a long road before becoming a productive competent programmer.

I really don't see any relationship between Matlab and c++, which is
the on topic context of this newsgroup. The two are more different
than alike and nearly any conclusion drawn from one will not apply to
the other. As for I/O you are stuck with the provided library with
Matlab while c++ is capable of doing low level hardware stuff if
necessary: horses for courses.
Post by Rune Allnor
Post by Francis Glassborow
Again, I am sure that both Bjarne Stroustrup and Andy Koenig felt that
performance issues were inappropriate in the context of what they were
writing.
It's still worth mentioning in a book the size of Stroustrup's
"Programming" (>1200 pages), if only in an appendix or comment
box. In a recent book,
http://www.amazon.com/Masterminds-Programming-Conversations-Creators-Languages/dp/0596515170/ref=sr_1_12?ie=UTF8&s=books&qid=1258369784&sr=8-12
Stroustrup comments that "The [initial C++ language design] was
something that was extremely expressive and flexible, yet ran
at a speed that challenged assembler..." (lower half of page 2,
available as preview at the page linked above).
But again, speed might not be the best reason to mention this
for the novice programmer. Locales and character encodings
might be more appropriate motivations.
The context of this thread was about binary vs text I/O operations. If
speed is a problem then you won't be using the standard libraries and
locales and character sets shouldn't be a show stopper. Even if you
use standard libraries nothing forces you to use the system
locale. You can always change the locale to meet your needs. The issue
of portability is moot since we are comparing to binary. You have to
define your output explicitly in either case.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Martin B.
2009-11-15 19:55:01 UTC
Permalink
Post by Rune Allnor
(... snipsnipsnip ...)
I'll allow that we're mixing simple technical books in with
textbooks. Textbooks in particular are targeted at survey type
courses where the coverage is very broad and general. In that context
text almost always comes out ahead.
Sure. But students should know about the alternatives and
trade-offs as early as possible. The sad fact is that it
is not the skilled programmers that make the strategic
decisions among users: In large projects that use (as opposed
to develop) software and where software configurations might
prove to be bottlenecks, domain specialists will be making the
decisions.
These people can at most be expected to have one intro class
of programming (...)
Get the two specialized performance issues of tex/binary
file formats and parallelization at least mentioned in the
intro books, (...)
I agree that *technical* decisions are often made by people who do not
have the necessary technical insight. I fear this is a very simple truth
in every industry.
Given your angle on the text vs. binary problem, maybe one truth could
be learned for programming books, especially for C++, which is a
language that tends to chosen when performance is important:
*If* an author mentions the benefits of text files like portability,
human readability etc. it should be only fair to also mention the trade
offs involved, namely that non-portable formats or non-human-friendly
formats can be an order of magnitude faster.

However, I'm not so naive to think this would better the situation
significantly. For example: I'm working on a project where the
underlying file format *is* binary, but it's so freaky complicated that
a simple text base format would be just as fast.
Post by Rune Allnor
(...)
In a specialized field, portability is less likely to be a
concern. The playing field is different than for a more gereral
purpose application where hard experience teaches that failure to
account for maintenance and potability leads to Bad Things (tm).
That argument might make sense to somebody who lives and
works in an environment 100% of communications are in
English. Once you work where different linguistic locales
are in use, its validity might not be quite as obvious.
Are you sure you can trust that the dot (ASCII character 46)
is always the decimal separator sign? (...)
Once you start addressing these things, you all of a
sudden need some data input validation step, using regular
expressions and so on, to ensure that the numbers in
the text file are given on a format your parser can handle.
(...)
Once such arguments are taken into account, time penalties
using text files sky-rocket.
This is not a problem of text format per se. One could just go with a
specified locale (e.g. English) on every platform. If the programmer
naively uses locale dependent functions, it is a bug like every other.
When using binary formats, the programmer has to know how different
binary encodings will affect the processing of his files. When using
text formats, the programmer has to know how different text formatting
rules affect the processing of his files.
(I know that too much programmers are not aware of locale implications,
but I think there are also too much not even knowing that there is such
a thing as different floating point representations.)

br,
Martin
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Edward Rosten
2009-11-17 14:01:45 UTC
Permalink
Post by Rune Allnor
Post by stan
Is your argument that because MATLAB's I/O is slower for your test
that all applications must be similar?
Matlab is, as I understand it, based on Java, so the
problem might be one with Java's IO facilities.
The GUI is Java based. The guts are written in C++--the executable has
C++ symbols in it. Given that this is about a test of the I/O
performance of a large, popular C++ program, then I would claim that
the results are very relevant are relevant to the I/O performance of C+
+.

-Ed


--
(You can't go wrong with psycho-rats.)(http://mi.eng.cam.ac.uk/~er258)

/d{def}def/f{/Times s selectfont}d/s{11}d/r{roll}d f 2/m{moveto}d -1
r 230 350 m 0 1 179{ 1 index show 88 rotate 4 mul 0 rmoveto}for/s 12
d f pop 235 420 translate 0 0 moveto 1 2 scale show showpage
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
stan
2009-11-18 02:16:38 UTC
Permalink
Post by Edward Rosten
Post by Rune Allnor
Post by stan
Is your argument that because MATLAB's I/O is slower for your test
that all applications must be similar?
Matlab is, as I understand it, based on Java, so the
problem might be one with Java's IO facilities.
The GUI is Java based. The guts are written in C++--the executable has
C++ symbols in it. Given that this is about a test of the I/O
performance of a large, popular C++ program, then I would claim that
the results are very relevant are relevant to the I/O performance of C+
+.
If C++ limited you to supplied libraries like Java, then you might
have a point. This case indicates that Matlab doesn't consider I/O
performance a top priority. I can write applications that have
terribly performance in any language.

Matlab has slow text I/O
Matlab uses C++
C++ has slow text I/O.

dogs run fast
dogs eat dog food
All fast runners eat dog food.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Edward Rosten
2009-11-18 13:51:22 UTC
Permalink
Post by stan
If C++ limited you to supplied libraries like Java, then you might
have a point. This case indicates that Matlab doesn't consider I/O
performance a top priority. I can write applications that have
terribly performance in any language.
One could take that to mean that the builtin I/O facilities are
terrible.
Post by stan
Matlab has slow text I/O
Matlab uses C++
C++ has slow text I/O.
dogs run fast
dogs eat dog food
All fast runners eat dog food.
I understand your point that one can't generalize from one C++ program
to all C++ programs. Of course taken to the extreme, that makes all
benchmarks, and indeed this entire thread pointless. Let me rephrase:

Rune Allnor posted a benchmark indicating slow text I/O in C++ and
posted the program.
Rune Allnor then posted another benchmark indicating slow text I/O in
a large, popular and generally well respected program C++.
People have then posted various microbenchmarks indicating that as
expected, text has a serialization penalty.

It is true that we cannot generalize from those three examples to
assume that text is slower than binary. But I also know it to be true
having worked with images in both text based and binary formats in C+
+.

-Ed
--
(You can't go wrong with psycho-rats.)(http://mi.eng.cam.ac.uk/~er258)

/d{def}def/f{/Times s selectfont}d/s{11}d/r{roll}d f 2/m{moveto}d -1
r 230 350 m 0 1 179{ 1 index show 88 rotate 4 mul 0 rmoveto}for/s 12
d f pop 235 420 translate 0 0 moveto 1 2 scale show showpage
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
stan
2009-11-19 01:04:51 UTC
Permalink
Post by Edward Rosten
Post by stan
If C++ limited you to supplied libraries like Java, then you might
have a point. This case indicates that Matlab doesn't consider I/O
performance a top priority. I can write applications that have
terribly performance in any language.
One could take that to mean that the builtin I/O facilities are
terrible.
A more rational and less biased statement might be that they are not
optimized to every possible case. No standard general purpose library
is going to beat a well designed specific domain solution. If
performance is REALLY critical you can't neglect adding hardware to
meet the challenge.
Post by Edward Rosten
Post by stan
Matlab has slow text I/O
Matlab uses C++
C++ has slow text I/O.
dogs run fast
dogs eat dog food
All fast runners eat dog food.
I understand your point that one can't generalize from one C++ program
to all C++ programs. Of course taken to the extreme, that makes all
It's not really an extreme. One thing doesn't follow the other.
Post by Edward Rosten
Rune Allnor posted a benchmark indicating slow text I/O in C++ and
posted the program.
Rune Allnor then posted another benchmark indicating slow text I/O in
a large, popular and generally well respected program C++.
People have then posted various microbenchmarks indicating that as
expected, text has a serialization penalty.
Benchmarks are hard, and the trivialized examples offered have indeed
been pointless. Matlab is NOT highly respected for it's I/O
capability. The general purpose library I/O is basically targeted to
allow conversion from an external data format into a human display,
including i18n and other issues. The layers involved are non trivial
and clearly are not a realistic solution to specific numerical I/O
problems of converting from external data storage into internal
processing representations.

I don't claim there is no penalty, but I don't accept the 100x claims
either. Programming is about compromises and trade offs. Any
performance driven design decisions will ultimately turn on context
specific benchmarks and not dicey rules of thumb; particularly when
I/O is involved.
Post by Edward Rosten
It is true that we cannot generalize from those three examples to
assume that text is slower than binary. But I also know it to be true
having worked with images in both text based and binary formats in C+
+.
I don't claim binary is always a bad choice. I don't really doubt
that for you and Rune, in your shops and your projects it really is
the correct decision. I do question that binary is inherently superior
to text. Context is vital here. In the general case and especially
during development, text offers advantages that can't be casually
disregarded because of poor reasoning; such as claiming that since
Matlab is slow text should be avoided.

It's not unlike a mechanic that refuses to use Phillips head
screwdrivers because his slotted screwdrivers have better handles. I'm
claiming it's wrong to rush to judgement. Competence demands that one
become very familiar with the tools and use the appropriate tool for
the job at hand.

Claims the binary is better or text is better are like claiming that
an apple is better than a fountain pen.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Seungbeom Kim
2009-11-11 05:18:01 UTC
Permalink
Post by Rune Allnor
Post by Seungbeom Kim
Frankly I don't understand your point.
My point is that
1) The speed penalty imposed by using text formats for numerical
data [...]
2) The speed penalty imposed by using text formats for numerical
data [...]
3) The speed penalty imposed by using text formats for numerical
data [...]
4) The speed penalty imposed by using text formats for numerical
data [...]
5) The speed penalty imposed by using text formats for numerical
data [...]
6) The speed penalty imposed by using text formats for numerical
data [...]
As you have stated, the points you have made so far relates only to
applications dealing heavily with numerical data. Then why are you
accusing general C++ language book authors and Usenet participants?
Take a numerical processing book, or go to a numerical processing
group, and if it's said there that textual formats should be preferred,
then that's where you should make accusations and claims. (And by the
way, choosing data formats is only barely on-topic for a "C++ language"
textbook or forum.)

I'm not saying that you should not discuss such matters elsewhere,
such as here. It's just that you keep emphasizing your particular
application area and its special needs, while others are stating
more common cases and more general choices, and NOT refuting your
points in your area in particular. You and others are going parallel
to each other and not making any more progress. Your repeated claims
here would make sense only if you were arguing that binary formats
should be preferred *by default* in most, if not all, areas -- that's
where you could refute others' claims -- but you have clearly stated
above that your points are meant only for numerical applications.
This is why I said I didn't understand your point: are you just
explaining your situation, or trying to change others' opinions?
--
Seungbeom Kim

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Rune Allnor
2009-11-11 19:26:51 UTC
Permalink
Post by Seungbeom Kim
Post by Rune Allnor
Post by Seungbeom Kim
Frankly I don't understand your point.
My point is that
1) The speed penalty imposed by using text formats for numerical
data [...]
2) The speed penalty imposed by using text formats for numerical
data [...]
3) The speed penalty imposed by using text formats for numerical
data [...]
4) The speed penalty imposed by using text formats for numerical
data [...]
5) The speed penalty imposed by using text formats for numerical
data [...]
6) The speed penalty imposed by using text formats for numerical
data [...]
As you have stated, the points you have made so far relates only to
applications dealing heavily with numerical data. Then why are you
accusing general C++ language book authors and Usenet participants?
Because C++ is considered by most (although maybe not by the
regulars here) as a language for the applictaions where high
efficiency is paramount. Even so, how and when to handle binary
files through C++ is not treated in any of the learning material
I have seen.
Post by Seungbeom Kim
Take a numerical processing book, or go to a numerical processing
group, and if it's said there that textual formats should be preferred,
then that's where you should make accusations and claims. (And by the
way, choosing data formats is only barely on-topic for a "C++ language"
textbook or forum.)
Yesterday I summarized in another post a number of responses
I have recieved on these questions over the past couple of
weeks:

http://groups.google.no/group/comp.lang.c++.moderated/msg/9421be7a5c6189be

The general impression is that the C++ community as it comes
across in this and other USENET groups, as well as the textbooks,
is totally foreign to using binary file formats at all.
Post by Seungbeom Kim
I'm not saying that you should not discuss such matters elsewhere,
such as here. It's just that you keep emphasizing your particular
application area and its special needs, while others are stating
more common cases and more general choices, and NOT refuting your
points in your area in particular.
Correct. But there is a general trend towards dismissing
the stated problem as irrelevant, or my solution as a
misconception. Again, review the opinions expressed in the
post I refer to above.
Post by Seungbeom Kim
You and others are going parallel
to each other and not making any more progress. Your repeated claims
here would make sense only if you were arguing that binary formats
should be preferred *by default* in most, if not all, areas -- that's
where you could refute others' claims -- but you have clearly stated
above that your points are meant only for numerical applications.
That's where it is easy to find the problem, because that's
the application that is easy to find. These things don't
show up unless the file sizes are 5-10 MBytes or more; only
then do the delays associated with data loading become
noticeable to human users.

Once e.g. XML files, that need to be parsed etc which take
at least as much time as merely converting the numbers, start
reaching those kinds of sizes, text-based formats will become
annoying in other applications as well.
Post by Seungbeom Kim
This is why I said I didn't understand your point: are you just
explaining your situation, or trying to change others' opinions?
I am trying to make influential people here - both regular posters
on c.l.c++.m and textbook authors who might be lurking - aware of
the problem. Only when the teachers start addressing a problem
will it be reasonable to expect students to know.

I have posted numbers to demonstrate what I am talking about
on a number of occasions, e.g.

http://groups.google.no/group/comp.lang.c++.moderated/msg/2863e5d312a93f97

The common first reaction is that "This is not C++, so this
is irrelevant!" Then all the arguments we have seen in this
and recent threads appear; that

- There are faster ways than operator>> and operator<<
- The algorithm must be wrong
- My measurements are not 'exact'

and so on.

The fact is that very few people even suspected these
numbers to differ by orders of magnitude. The numbers referred
above, while not C++, are representative for the delays and
bottlenecks where I work. We can quarrel about reducing the
abolute numbers by a factor 3 or maybe 5 by using efficient
pasrers, but that requires rewriting the file parsers in
every single program already in use out there. As well as
educating the programmers etc.

The net effect is far larger if one spends the same effort
on educating the same programmers and designers about binary
files.

But in order to do that, one needs to educate the educators.

Rune
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Nick Hounsome
2009-11-10 12:56:02 UTC
Permalink
Post by Rune Allnor
Post by Nick Hounsome
The text book authors are writing for the 99% not the 1% so they are
not going to change.
I am working among the 1%. I have seen companies loose business
because of poorly performing software that went undetected. That
is, the people whose job it was to know, did not know about the
major performance issues.
In one company, which did 24/7 survey jobs and stored the data
on text format, merely reading 24 hrs worth of data from text-
formatted files imposed some 3-5 hrs idle time on behalf of human
operators. It wouldn't have been a big deal if those 3-5 hrs were
organized as one bulk (the operators in question could have had
a long break if it was), but these 3-5 hrs were intersped throughout
the process, tying the operators down in front of their terminals.
But how much of that 3-5 hours was reading and how much was processing
the data?
I can't imagine a process in which the reading was the bulk of that
time.
Post by Rune Allnor
From a human standpoint, there are several time scales. Most
(all?) readers of this newsgroup are computer programmers, so
they know what it means to be in 'The Zone' where time just
flys and work is being made.
Now, if you can get a job done with operator idle time less
than a second, the operator can stretch his neck, yawn, have
zip of coffee, and remain in 'The Zone' afterwards.
If the idle time is a couple of seconds, the waiting time
start to become noticeable and thus annoying. If the waiting
time becomes ten seconds or more, an operator already in 'The
Zone' is yanked out of 'The Zone'. If ten seconds operator
idle time is commonplace in the application, the operator never
reaches 'The Zone' in the first place.
Once we start talking about minutes of operator idle time,
operators go away to have a cup of coffee, read the newspaper,
surf the net, flirt with the 20-year-old blonde at the
swicthboard - whatever. Once that happens productivity numbers
reach the point where companies go out of business.
I agree with everything you say here.
It's just that my experience is that you can't reduce minutes to
seconds unless the fundamental design is bad.
Post by Rune Allnor
Post by Nick Hounsome
I enjoy your posts Rune but IMHO you realy do get carried away with
the wrong performance issues.
No. The performance issues I worry about are the ones that
kick users out of business. No one cares if 15 seconds or 50
seconds is the most representative number for reading 100 MBytes
of text-foematted numeric data, when the same amount of binary
formatted data easily can be loaded in 0.3 seconds.
You're mixing reading and loading. Loading = reading and processing.
Processing is the biggest user of time in almost all systems otherwise
they aren't doing anything useful.
If I take your figures at face value you can't be doing anything with
the data.
Post by Rune Allnor
These details have a profound impact where I work. The only
reason this is not recognized, is the omnipresent misconception
that the slower time working with text-formatted numeric data
is insignificant.
But you are undermining your own argument.
The people writing on this thread would hardly be saying that it was
insignificant if it cost them their jobs therefore it hasn't cost them
their jobs therefore it IS insignificant in all the projects that
they've worked on. In other words its IS insignificant for MOST people
MOST of the time just as I said.
Post by Rune Allnor
People who know their programing craft would know that one uses
binary data formats for numeric data as default, and only deviates
towards text-based formats where one can get away with them
(file sizes less than about 5-10 MBytes).
You have it backwards.

One uses text formats by default and only deviate towards binary
format when you know that you have a problem and have demonstrated
that binary formats will solve it.

Either that or you are right and Meyers, Stroustrop, all the C++ book
writers and all the designers of the C and C++ I/O libraries are
wrong.

P.S. As I think I already mentioned - If you really want the ultimate
in speed then use memory mapping.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Rune Allnor
2009-11-11 05:16:26 UTC
Permalink
Post by Nick Hounsome
Post by Rune Allnor
Post by Nick Hounsome
The text book authors are writing for the 99% not the 1% so they are
not going to change.
I am working among the 1%. I have seen companies loose business
because of poorly performing software that went undetected. That
is, the people whose job it was to know, did not know about the
major performance issues.
In one company, which did 24/7 survey jobs and stored the data
on text format, merely reading 24 hrs worth of data from text-
formatted files imposed some 3-5 hrs idle time on behalf of human
operators. It wouldn't have been a big deal if those 3-5 hrs were
organized as one bulk (the operators in question could have had
a long break if it was), but these 3-5 hrs were intersped throughout
the process, tying the operators down in front of their terminals.
But how much of that 3-5 hours was reading and how much was processing
the data?
I can't imagine a process in which the reading was the bulk of that
time.
I don't remember the details, this was several years ago, but
the gross numbers were more or less that the process in question
produced some 200 files on the order of 100 MBytes / file during
24 hours of survey. Each file needed to pass through two or three
read / write cycles during processing. That's about 600 read/writes,
each taking some 30 seconds of operator idle time. That 5 hours
wasted, right there. With a 24-hr deadline, that hurts.
Post by Nick Hounsome
Post by Rune Allnor
From a human standpoint, there are several time scales. Most
(all?) readers of this newsgroup are computer programmers, so
they know what it means to be in 'The Zone' where time just
flys and work is being made.
Now, if you can get a job done with operator idle time less
than a second, the operator can stretch his neck, yawn, have
zip of coffee, and remain in 'The Zone' afterwards.
If the idle time is a couple of seconds, the waiting time
start to become noticeable and thus annoying. If the waiting
time becomes ten seconds or more, an operator already in 'The
Zone' is yanked out of 'The Zone'. If ten seconds operator
idle time is commonplace in the application, the operator never
reaches 'The Zone' in the first place.
Once we start talking about minutes of operator idle time,
operators go away to have a cup of coffee, read the newspaper,
surf the net, flirt with the 20-year-old blonde at the
swicthboard - whatever. Once that happens productivity numbers
reach the point where companies go out of business.
I agree with everything you say here.
It's just that my experience is that you can't reduce minutes to
seconds unless the fundamental design is bad.
The fundamental bad design is to use text-formatted files.
I posted a demo I made in matlab, which is an increasingly
popular language for these kinds of things, in a reply to
Glassborow. Look at the numbers there - binary files are
100-200x faster than text formats.
Post by Nick Hounsome
Post by Rune Allnor
Post by Nick Hounsome
I enjoy your posts Rune but IMHO you realy do get carried away with
the wrong performance issues.
No. The performance issues I worry about are the ones that
kick users out of business. No one cares if 15 seconds or 50
seconds is the most representative number for reading 100 MBytes
of text-foematted numeric data, when the same amount of binary
formatted data easily can be loaded in 0.3 seconds.
You're mixing reading and loading. Loading = reading and processing.
Processing is the biggest user of time in almost all systems otherwise
they aren't doing anything useful.
If I take your figures at face value you can't be doing anything with
the data.
I only have so much time to get the job done. We can argue
over semantics till the cows come home - the deadlines stand
whether I am 'reading' or 'loading' the data.
Post by Nick Hounsome
Post by Rune Allnor
These details have a profound impact where I work. The only
reason this is not recognized, is the omnipresent misconception
that the slower time working with text-formatted numeric data
is insignificant.
But you are undermining your own argument.
The people writing on this thread would hardly be saying that it was
insignificant if it cost them their jobs therefore it hasn't cost them
their jobs therefore it IS insignificant in all the projects that
they've worked on. In other words its IS insignificant for MOST people
MOST of the time just as I said.
Well, the *intention* of what people write might be benign,
but that's not how it comes across. Some of the reactions I
have recieved similar threads here and on comp.lang.c++:

http://groups.google.no/group/comp.lang.c++/msg/0abdc440e78f98d6
Post by Nick Hounsome
1) The user's time is not yours (the programmer) to waste.
2) The users's storage facilities (disk space, network
bandwidth etc) are not yours (the programmer) to waste.
[JK] The user pays for your time. Spending it to do something which
results in a less reliable program, and that he doesn't need, is
irresponsible, and borders on fraud.

Paying attention to speed "borders on fraud."

http://groups.google.no/group/comp.lang.c++.moderated/msg/555d4053471fd368

[RA] > 2) Text-formatted numerical data take significantly longer to
read and
Post by Nick Hounsome
write
than binary formats.
[NH] Again - Never in my experience.

http://groups.google.no/group/comp.lang.c++.moderated/msg/eed5649d9ba5c2cb

[FG] This is a classic example of the 'speed at any cost' school of
thought.

http://groups.google.no/group/comp.lang.c++.moderated/msg/8aec2b00e7ab7ee5

[NH] You seem to be obssessed with speed
Post by Nick Hounsome
From such excerpts I can only conclude that most people are
oblivious to the problem and its implications.
Post by Nick Hounsome
Post by Rune Allnor
People who know their programing craft would know that one uses
binary data formats for numeric data as default, and only deviates
towards text-based formats where one can get away with them
(file sizes less than about 5-10 MBytes).
You have it backwards.
One uses text formats by default and only deviate towards binary
format when you know that you have a problem and have demonstrated
that binary formats will solve it.
Either that or you are right and Meyers, Stroustrop, all the C++ book
writers and all the designers of the C and C++ I/O libraries are
wrong.
No. The authors you list aren't wrong. In order to be wrong,
one must make a statement that can be proved or demonstrated
to be false.

The fact is that except for Stroustrup mentioning ios_base::binary
more or less in passing in his chapter on file streams, none of the
C++ textbooks I have seen mention the issue at all.

Rune
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Ulrich Eckhardt
2009-11-11 19:23:55 UTC
Permalink
Rune Allnor wrote:
[ about "binary" files ]
Post by Rune Allnor
The fact is that except for Stroustrup mentioning ios_base::binary
more or less in passing in his chapter on file streams, none of the
C++ textbooks I have seen mention the issue at all.
ios_base::binary does something completely different.

Uli
--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Francis Glassborow
2009-11-09 19:57:25 UTC
Permalink
Post by Rune Allnor
Hi all.
A couple of weeks ago I posted a question on comp.lang.c++ about some technicality
about binary file IO. Over the course of the discussion, I discovered to my
amazement - and, quite frankly, horror - that there seems to be a school of
thought that text-based storage formats are universally preferable to binary text
formats for reasons of portability and human readability.
This is a classic example of the 'speed at any cost' school of thought.
The trouble with binary formats is that they are not portable (sometimes
to the extent of not being portable between releases of the same program
-- I once met a young programmer who was going through the agony of
having used binary data formats which could no longer be read correctly
by the second release of his program much to the dismay of his customers)

The space overhead is hardly important when even simple desktop machines
can have over a terabyte of disk stirage at less than it used to cost
(in 1979)to buy a couple of boxes of single density 5.25" floppy disks.

Speed could be an issue in some cases, but measure first before choosing
to optimise. I rarely use binary files other than for scratch files
during a single execution of a program (where they are fine as long as
the program does not crash and burn)
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
REH
2009-11-09 19:58:06 UTC
Permalink
Post by Rune Allnor
Hi all.
A couple of weeks ago I posted a question on comp.lang.c++ about some technicality
about binary file IO. Over the course of the discussion, I discovered to my
amazement - and, quite frankly, horror - that there seems to be a school of
thought that text-based storage formats are universally preferable to binary text
formats for reasons of portability and human readability.
The people who presented such ideas appeared not to appreciate two details that
1) Binary files are about 70-20% of the file size of the text files, depending
on the number of significant digits stored in the text files and other
formatting text glyphs.
2) Text-formatted numerical data take significantly longer to read and write
than binary formats.
Metrics only matter when they matter. Are the larger file size or
increased processing time issues? If you are not constraint for space
or time, and the text files have value-add (e.g., convenience,
portability, etc.) then use them. Bigger and slower do not always
equate to bad.

REH
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Martin B.
2009-11-12 00:59:13 UTC
Permalink
Post by Rune Allnor
Hi all.
A couple of weeks ago I posted a question on comp.lang.c++ about some
technicality
about binary file IO. Over the course of the discussion, I discovered
to my
amazement - and, quite frankly, horror - that there seems to be a
school of
thought that text-based storage formats are universally preferable to
binary text
formats for reasons of portability and human readability.
The people who presented such ideas appeared not to appreciate two
details that
1) Binary files are about 70-20% of the file size of the text files,
depending
on the number of significant digits stored in the text files and
other
formatting text glyphs.
2) Text-formatted numerical data take significantly longer to read and
write
than binary formats.
Timings are difficult to compare, since the exact numbers depend on
buffering
strategies, buffer sizes, disk speeds, network bandwidths and so on.
I have therefore sketched a 'distilled' test (code below) to test what
overheads
are involved with formatting numerical data back and forth between
(...)
To everything that has been said in other replies I would like to add a
small test sample of mine.
It's as basic as I thought I could get and just as inaccurate and
insufficient as every other single test. (Code follows at the end)

-- Run 1 -- (1e6)
build type: RELEASE
generate data (1000000 doubles) ...
start writing ...
start reading ...
timings:
Binary write = 279 ms
ASCII write = 2048 ms (Factor 7.3)
Binary read = 211 ms
ASCII read = 1283 ms (Factor 6.1)

-- Run 2 -- (1e7)
build type: RELEASE
generate data (10000000 doubles) ...
start writing ...
start reading ...
timings:
Binary write = 11329 ms
ASCII write = 20014 ms (Factor 1.8)
Binary read = 2252 ms
ASCII read = 12922 ms (Factor 5.7)

-- Run 3 -- (1e6)
build type: RELEASE
generate data (1000000 doubles) ...
start writing ...
start reading ...
timings:
Binary write = 313 ms
ASCII write = 1911 ms (Factor 6.1)
Binary read = 212 ms
ASCII read = 1277 ms (Factor 6.0)


So what gives?
Binary is unsurprisingly faster.
Apparently somewhere between factor 5 and 10 on my box here. (And if,
something else is going on, such as in Run2, it may not even be that
much faster).

Bottom line for me:
1.) Binary *is* definitely faster.
2.) The difference is small n factors.
3.) *If* you need that speed, use binary.

br,
Martin

### CODE ###

int main()
{
using namespace std;
srand( (unsigned)time( NULL ) );

cout << "build type: " <<
#ifndef NDEBUG
"DEBUG"
#else
"RELEASE"
#endif
<< endl;

const int n = 1e6;

cout << "generate data (" << n << " doubles) ...\n";
dvec_t outdata(n, 3.14);
for(int i=0; i<n; ++i) {
outdata[i] *= double(rand());
}

cout << "start writing ...\n";
const tdiff_t wbin = write_binary(outdata);
const tdiff_t wtxt = write_ascii(outdata);

dvec_t b_in, t_in;

cout << "start reading ...\n";
const tdiff_t rbin = read_binary(b_in);
const tdiff_t rtxt = read_ascii(t_in);

cout << "timings:\n";
cout << "Binary write = " << wbin << " ms\n";
cout << "ASCII write = " << wtxt << " ms\n";
cout << "Binary read = " << rbin << " ms\n";
cout << "ASCII read = " << rtxt << " ms\n";

// cout << "check results ...\n";
// cout << "Binary w/r == equality: " << (outdata==b_in?"yes":"no") <<
endl;
// cout << "ASCII w/r == equality: " << (outdata==t_in?"yes":"no") <<
endl;
return 0;
}


tdiff_t now()
{
return timeGetTime(); // from winmm.lib + mmsystem.h (Windows)
};

tdiff_t write_binary(dvec_t const& data)
{
const tdiff_t start = now();
FILE* f = fopen("mydata.bin", "wb");
for(dvec_t::const_iterator i=data.begin(), e=data.end(); i!=e; ++i) {
fwrite(&(*i), sizeof(double), 1, f);
}
fclose(f);
const tdiff_t stop = now();
return stop-start;
}

tdiff_t read_binary(dvec_t & data)
{
data.clear();
const tdiff_t start = now();
FILE* f = fopen("mydata.bin", "rb");
if(!f)
throw std::runtime_error("No file!");
double dRead;
while(fread(&dRead, sizeof(double), 1, f) == 1) {
data.push_back(dRead);
}
fclose(f);
const tdiff_t stop = now();
return stop-start;
}

tdiff_t write_ascii(dvec_t const& data)
{
const tdiff_t start = now();
FILE* f = fopen("mydata.text", "wb");
for(dvec_t::const_iterator i=data.begin(), e=data.end(); i!=e; ++i) {
fprintf(f, "%le\n", *i);
}
fclose(f);
const tdiff_t stop = now();
return stop-start;
}

tdiff_t read_ascii(dvec_t & data)
{
data.clear();
const tdiff_t start = now();
FILE* f = fopen("mydata.text", "rb");
if(!f)
throw std::runtime_error("No file!");
double dRead;

while(fscanf(f, "%le", &dRead) == 1) {
data.push_back(dRead);
}
fclose(f);
const tdiff_t stop = now();
return stop-start;
}
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Konstantin Oznobikhin
2009-11-12 13:27:07 UTC
Permalink
On 12 ноя, 03:59, "Martin B." <***@gmx.at> wrote:

[snip]
Post by Martin B.
To everything that has been said in other replies I would like to add a
small test sample of mine.
It's as basic as I thought I could get and just as inaccurate and
insufficient as every other single test. (Code follows at the end)
And being as simple and inaccurate it's a bit less than quite
pointless. Although, I find it useful to make yet more accurate and
realistic test and got these numbers:

generate data (10000000 doubles) ...
timings:
Binary write = 1555 ms
ASCII write = 13695 ms (Factor 8.80)
Binary read = 134 ms
ASCII read = 9400 ms (Factor 70.14)
check results ...
Binary w/r == equality: yes
ASCII w/r == equality: NO

What is interesting here is that size of text file is less than that
of binary one (74Mb against 76Mb). Also, as you can see the text file
doesn't contain exact values we written there, so comparison is not
that fair. Modifying the test yet more to get exact values I've got
these numbers

generate data (10000000 doubles) ...
timings:
Binary write = 1469 ms
ASCII write = 21537 ms (Factor 14.66)
Binary read = 140 ms
ASCII read = 16318 ms (Factor 116.55)
check results ...
Binary w/r == equality: yes
ASCII w/r == equality: yes

Now the size of the text file is about twice as bigger as binary one
and the different become quite significant.

[snip]
Post by Martin B.
So what gives?
Binary is unsurprisingly faster.
Apparently somewhere between factor 5 and 10 on my box here. (And if,
something else is going on, such as in Run2, it may not even be that
much faster).
As you can see factor can be as much as 100 (and it was only 10-
minutes optimization). And still this test shows (more or less
accurate) the difference for sequential data processing only. If you
need random data access then text file is probably a no choice at all.
Post by Martin B.
1.) Binary *is* definitely faster.
+1
Post by Martin B.
2.) The difference is small n factors.
The difference is big n factors.
Post by Martin B.
3.) *If* you need that speed, use binary.
+1 again. Text files are very convenient for debugging/testing at
least even if you'll never look at real data.
Post by Martin B.
br,
Martin
### CODE ###
[snip]

And here is my modification of the code.

#include <ctime>
#include <iostream>
#include <vector>

using namespace std;

typedef vector<double> dvec_t;
typedef DWORD tdiff_t;

tdiff_t now()
{
return timeGetTime(); // from winmm.lib + mmsystem.h (Windows)
};

void write_binary(dvec_t const& data, FILE* f)
{
const dvec_t::size_type size = data.size();
fwrite(&size, sizeof(size), 1, f);

if (size == 0)
return;

const size_t chunk_size = 1024 * 1024 / sizeof(double);
const dvec_t::value_type *begin = &data[0];
const dvec_t::value_type *const end = &data[0] + size;
const size_t full_chunks = size / chunk_size;
for (size_t i = 0; i < full_chunks; ++i, begin += chunk_size)
{
fwrite(begin, chunk_size * sizeof(double), 1, f);
}

if (begin < end)
fwrite(begin, (end - begin) * sizeof(double), 1, f);
}

tdiff_t write_binary(dvec_t const& data)
{
const tdiff_t start = now();
FILE* f = fopen("mydata.bin", "wb");
write_binary(data, f);
fclose(f);
const tdiff_t stop = now();
return stop-start;
}

void read_binary(dvec_t & data, FILE* f)
{
dvec_t::size_type size;
fread(&size, sizeof(size), 1, f);

if (size == 0)
return;

data.resize(size);

const size_t chunk_size = 1024 * 1024 / sizeof(double);
dvec_t::value_type *begin = &data[0];
dvec_t::value_type *const end = &data[0] + size;
const size_t full_chunks = size / chunk_size;
for (size_t i = 0; i < full_chunks; ++i, begin += chunk_size)
{
fread(begin, chunk_size * sizeof(double), 1, f);
}

if (begin < end)
fread(begin, (end - begin) * sizeof(double), 1, f);
}

tdiff_t read_binary(dvec_t & data)
{
dvec_t().swap(data);
const tdiff_t start = now();
FILE* f = fopen("mydata.bin", "rb");
if(!f)
throw std::runtime_error("No file!");

read_binary(data, f);
fclose(f);
const tdiff_t stop = now();
return stop-start;
}

void write_ascii(dvec_t const& data, FILE* f)
{
const dvec_t::size_type size = data.size();
fprintf(f, "%lu\n", size);

if (size == 0)
return;

for(dvec_t::const_iterator i=data.begin(), e=data.end(); i!=e; +
+i) {
fprintf(f, "%lg\n", *i);
}
}

tdiff_t write_ascii(dvec_t const& data)
{
const tdiff_t start = now();
FILE* f = fopen("mydata.text", "wb");
write_ascii(data, f);
fclose(f);
const tdiff_t stop = now();
return stop-start;
}

void read_ascii(dvec_t & data, FILE* f)
{
dvec_t::size_type size;
fscanf(f, "%lu", &size);

if (size == 0)
return;

data.resize(size);

dvec_t::value_type *begin = &data[0];
dvec_t::value_type *const end = &data[0] + size;
for (; begin != end; ++begin)
{
fscanf(f, "%lg", begin);
}
}

tdiff_t read_ascii(dvec_t & data)
{
dvec_t().swap(data);
const tdiff_t start = now();
FILE* f = fopen("mydata.text", "rb");
if(!f)
throw std::runtime_error("No file!");

read_ascii(data, f);
fclose(f);
const tdiff_t stop = now();
return stop-start;
}

int _tmain(int argc, _TCHAR* argv[])
{
srand( (unsigned)time( NULL ) );

cout << "build type: " <<
#ifndef NDEBUG
"DEBUG"
#else
"RELEASE"
#endif
<< endl;

const int n = 10000000;

cout << "generate data (" << n << " doubles) ...\n";
dvec_t outdata(n, 3.14);
for(int i=0; i<n; ++i) {
outdata[i] *= double(rand());
}

cout << "start writing ...\n";
const tdiff_t wbin = write_binary(outdata);
const tdiff_t wtxt = write_ascii(outdata);

dvec_t b_in, t_in;

cout << "start reading ...\n";
const tdiff_t rbin = read_binary(b_in);
const tdiff_t rtxt = read_ascii(t_in);

cout << "timings:\n";
cout << "Binary write = " << wbin << " ms\n";
cout << "ASCII write = " << wtxt << " ms\n";
cout << "Binary read = " << rbin << " ms\n";
cout << "ASCII read = " << rtxt << " ms\n";

cout << "check results ...\n";
cout << "Binary w/r == equality: " << (outdata==b_in?"yes":"no")
<< endl;
cout << "ASCII w/r == equality: " << (outdata==t_in?"yes":"no")
<< endl;

return 0;
}

--
Konstantin Oznobikhin.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Hyman Rosen
2009-11-12 18:38:31 UTC
Permalink
Post by Konstantin Oznobikhin
ASCII w/r == equality: NO
generate data (10000000 doubles) ...
ASCII w/r == equality: yes
Now the size of the text file is about twice as bigger as binary one
and the different become quite significant.
And you also need to make sure that the doubles you're
generating span the full range of numbers you expect in
your actual program, including very small and very large
numbers. Ideally, when you write them out as text (and I
assume this means in decimal), you should be writing them
out with the smallest number of digits which will convert
back into exactly the original number. I don't believe
there is any standardized way to do this, so there's an
added complication.

You might want to try writing your number out as textified
binary and see if that speeds things up - binary floating
point numbers all look like ±1.mantissa×2^±exponent, with
exceptions for 0, ±∞, and error forms (NaN). Doing it this
way will still give you a portable representation, but you
can avoid all the work of going from binary to decimal and
that shoudl make things faster.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Francis Glassborow
2009-11-13 20:57:08 UTC
Permalink
I have snipped all the code because it is just consuming space and the
reader can look back if they are interested.
The comment I want to make on both programs is that they use fscanf and
fprintf. Those functions have a reputation (deserved in my experience)
for poor performance. This is due, at least in part, to the fact that
those functions have to determine what type is to be written/read before
making the conversion.

As I/O is generally a slow operation the performance hit in using
xprintf and xscanf (substitute for x as appropriate) is usually
acceptable. However like the problem of binary versus text, when large
volumes of data are being written/read they are not the right tool.

I note also that the binary is being read and written in large chunks.
Yes, binary makes this easy (when you want data in substantial blocks.)
but it is possible to read and write text in large blocks as well as
opposed to one item at a time (with all the associated function call
overheads)

Once again, I emphasise that I have nothing against the use of binary
when appropriate but it is much harder to benchmark binary v text then
most programmer seem to think.

When we program in C++ it is vital to understand what the bottlenecks
are and how to cope with them. C++ gives us the tools but it is up to
the programmer to learn to use their tools correctly. And that includes
selecting the appropriate tools for the needs of a problem domain.

The problem is much less with technical books and textbooks than it is
with the individual attitudes of too may programmers. Many programmers
lack craftsmanship (and that includes taking pride in ones work) and
have little if any understanding of choosing the right tool for the job.

I hasten to add that those who read and write in this newsgroup are
already an elite because they actually value discussing their work with
others and realise that they can learn from each other. But look how
small a fraction of programmers ever come near this newsgroup (or nay
other newsgroup related to programming)
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Martin B.
2009-11-15 19:32:01 UTC
Permalink
Post by Francis Glassborow
(...)
The comment I want to make on both programs is that they use fscanf and
fprintf. Those functions have a reputation (deserved in my experience)
for poor performance. (...)
(...) However like the problem of binary versus text, when large
volumes of data are being written/read they are not the right tool.
(...)
When we program in C++ it is vital to understand what the bottlenecks
are and how to cope with them. C++ gives us the tools but it is up to
the programmer to learn to use their tools correctly. And that includes
selecting the appropriate tools for the needs of a problem domain.
While we are at it - xprintf may not be the fastest, but it's waaay
faster than iostreams, at least on the VS/MS implementation.
I think it's a shame that all presumably "fast" C++ functions in this
area seem to be C ones, with a C-ish interface (atof, strtod, ...)

br,
Martin
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Jonathan Thornburg
2009-11-12 01:06:33 UTC
Permalink
Post by Rune Allnor
A couple of weeks ago I posted a question on comp.lang.c++ about some
technicality
about binary file IO. Over the course of the discussion, I discovered
to my
amazement - and, quite frankly, horror - that there seems to be a
school of
thought that text-based storage formats are universally preferable to
binary text
formats for reasons of portability and human readability.
The people who presented such ideas appeared not to appreciate two
details that
[[binary files are smaller & processing them is faster]]

I think it's safe to say that people who say that text is *always*
best have probably never dealt with multi-hundred-terabyte RAID arrays
that are 90% full. The arguments about who (of a dozen users) gets
to delete 50 terabytes in the next 12 hours can get pretty ugly....
Post by Rune Allnor
I have therefore sketched a 'distilled' test (code below) to test what
overheads
are involved with formatting numerical data back and forth between
text and
binary formats. To eliminate the impact of peripherical devices, I
have used
a std::stringstream to store the data. The binary bufferes are
represented
by vectors, and I have assumed that a memcpy from the file buffer to
the
destination memory location is all that is needed to import the binary
format
from the file buffer. (If there are significant run-time overheads
associated with
moving NATIVE binary formats to the destination, please let me
know.)
A related issue is that people doing heavy-duty floating-point work
often use fancy binary--data-file libraries like FIPS, HDF5, NetCDF,
GRIB, etc. These typically provide storage of multi-dimensional arrays
(including dimension metadata) in a variety of datatypes, with automagic
conversion from on-disk to host formats for things like endianness and
floating-point formats. The on-disk formats are almost always binary
for just the reasons you outline (text is too bulky and too slow).

If you wanted to expend your benchmark in that direction, it would
be interesting to benchmark (say) HDF5 as well as "raw binary".

ciao,
--
-- "Jonathan Thornburg [remove -animal to reply]" <***@astro.indiana-zebra.edu>
Dept of Astronomy, Indiana University, Bloomington, Indiana, USA
"Washing one's hands of the conflict between the powerful and the
powerless means to side with the powerful, not to be neutral."
-- quote by Freire / poster by Oxfam

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Loading...