Post by Ulrich EckhardtPost by Edward DienerI was told that the C++ standard requires that wide characters written
to an output stream via wofstream be converted to a narrow character
before being written and that characters being read in using wifstream
be read as single characters and converted to a wide character after
being read.
This confuses a few things. The internally uses character type is converted
to external bytes, regardless of whether the internal type is char or
wchar_t. Now, the external type is always represented as char (which is a
bit unfortunate, unsigned char would have been better), so this might be
the cause of your confusion.
I do not know what you mean by internal type and external type in your
reply. Please explain.
My assumption was that a wofstream and wifstream reads and writes
directly what is considered "wide characters" for that implementation,
however the implementation chooses to define a "wide character". In
Windows and VC++ the wchar_t would naturally be a UTF16 character. Yet I
find in the VC++ implementation that when I try to stream wide
characters to a wofstream, VC++ does a wide character to multibyte
conversion and then writes the resulting multibyte character to the
file. Similarly, because of the way it implements wofstream, when I use
a wifstream to read a wide character in a file to a wchar_t variable,
wifstream reads the charactare from the file as a multibyte character
and converts it back to a wchar_t before setting my wchar_t variable
with it.
Notice that using wide streams in the VC++ implementation it is
impossible to write wide characters to a file. This appeared to me
really amazing.
When I questioned this on a Microsoft C++ online forum I was told that
the latest C++ standard in section 27.8.1 File Streams mandated this
behavior and that was probably the reason why Microsoft implemented
wofstream and wifstream the way they did, in other words they were only
following the C++ standard.
In section 27.8.1, footnote 306 I find:
"A File is a sequence of multibyte characters. In order to provide the
contents as a wide character sequence, filebuf should
convert between wide character sequences and multibyte character sequences."
This makes no sense to me and I question why it is in the standard. I
have no idea why the C++ standard decided that a file is a sequence of
multi-byte characters. It basically says to me that the C++ committee
has mandated that a file created under C++ can never be a Unicode 16 or
32 bit sequence of characters, or any other type of character even
though C++ does support a native character type, wchar_t, which does not
have to be a multibyte character.
Furthermore C++ programmers may implement their own character type, of
more than one byte, via a library. Not allowing these other character
types in a file can not be right.
Post by Ulrich EckhardtNote: the external character type is actually part of the fstream template
parameters. The conversion takes place using the codecvt<> facets of the
locale.
Post by Edward DienerThis was all great news to me. I had always assumed that the wofstream
and wifstream output and input wide characters respectively.
The standard doesn't even require any particular representation for a
wchar_t, so what would "output wide characters" mean? No, sorry, if you
want to write to a file, you should first know the file format (e.g.
various Unicode encodings) and then convert your internal representation to
the external one accordingly.
I disagree completely with you. Surely if I have a sequence of wide
characters, I should be able to write those characters to a file stream
as wide characters. It is irrelevant what encoding those characters
represent when I write them, although of course some program should
understand the encoding if it needs to read them back in again. Why must
codecvt convert them to multibyte characters ? I did not ask for that
and I don't want it. The default should naturally be to leave them as is.
In the practical case which I ran up against, I needed to emulate an OS
file type which was written as a file of UTF16 characters. So my wide
characters happened to be valid UTF16 characters. But when I tried to
write them out as wide characters they gor reduced back down to
multlibyte characters in the file, and naturally the file format was now
wrong.
I know I am supposed to write my own codecvt to tell the default codecvt
for my wide character stream to leave the characters alone and not
convert each wide character to its mutlibyte(s) equivalent. But I do not
believe this should have been necessary in the first place.
I always liked C++ mindset of "trust the programmer". The footnote I
quoted in 27.8.1 appears to go against that belief. But perhaps I have
misinterpreted that footnote.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]