Reading and writing files using wifstream and wofstream

Discussion:

(too old to reply)

Edward Diener

2009-02-22 23:05:55 UTC

{ Note: the same article was recently posted to [comp.std.c++]. -mod }

I was told that the C++ standard requires that wide characters written
to an output stream via wofstream be converted to a narrow character
before being written and that characters being read in using wifstream
be read as single characters and converted to a wide character after
being read.

Where in the C++ standard is this specified ?

Why was this specified in the C++ standard ?

This was all great news to me. I had always assumed that the wofstream
and wifstream output and input wide characters respectively.

Does the same output and input processing occur with all wide
character streams ? I can hardly believe that. If the file streams are
the only IO stream where this was specified, why were they made the
exception ?

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Ulrich Eckhardt

2009-02-23 18:24:06 UTC

Permalink

Post by Edward Diener
I was told that the C++ standard requires that wide characters written
to an output stream via wofstream be converted to a narrow character
before being written and that characters being read in using wifstream
be read as single characters and converted to a wide character after
being read.

This confuses a few things. The internally uses character type is converted
to external bytes, regardless of whether the internal type is char or
wchar_t. Now, the external type is always represented as char (which is a
bit unfortunate, unsigned char would have been better), so this might be
the cause of your confusion.

Note: the external character type is actually part of the fstream template
parameters. The conversion takes place using the codecvt<> facets of the
locale.

Post by Edward Diener
This was all great news to me. I had always assumed that the wofstream
and wifstream output and input wide characters respectively.

The standard doesn't even require any particular representation for a
wchar_t, so what would "output wide characters" mean? No, sorry, if you
want to write to a file, you should first know the file format (e.g.
various Unicode encodings) and then convert your internal representation to
the external one accordingly.

Uli

--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Edward Diener

2009-02-23 23:59:52 UTC

Permalink

Post by Ulrich Eckhardt

I do not know what you mean by internal type and external type in your
reply. Please explain.

My assumption was that a wofstream and wifstream reads and writes
directly what is considered "wide characters" for that implementation,
however the implementation chooses to define a "wide character". In
Windows and VC++ the wchar_t would naturally be a UTF16 character. Yet I
find in the VC++ implementation that when I try to stream wide
characters to a wofstream, VC++ does a wide character to multibyte
conversion and then writes the resulting multibyte character to the
file. Similarly, because of the way it implements wofstream, when I use
a wifstream to read a wide character in a file to a wchar_t variable,
wifstream reads the charactare from the file as a multibyte character
and converts it back to a wchar_t before setting my wchar_t variable
with it.

Notice that using wide streams in the VC++ implementation it is
impossible to write wide characters to a file. This appeared to me
really amazing.

When I questioned this on a Microsoft C++ online forum I was told that
the latest C++ standard in section 27.8.1 File Streams mandated this
behavior and that was probably the reason why Microsoft implemented
wofstream and wifstream the way they did, in other words they were only
following the C++ standard.

In section 27.8.1, footnote 306 I find:

"A File is a sequence of multibyte characters. In order to provide the
contents as a wide character sequence, filebuf should
convert between wide character sequences and multibyte character sequences."

This makes no sense to me and I question why it is in the standard. I
have no idea why the C++ standard decided that a file is a sequence of
multi-byte characters. It basically says to me that the C++ committee
has mandated that a file created under C++ can never be a Unicode 16 or
32 bit sequence of characters, or any other type of character even
though C++ does support a native character type, wchar_t, which does not
have to be a multibyte character.

Furthermore C++ programmers may implement their own character type, of
more than one byte, via a library. Not allowing these other character
types in a file can not be right.

Post by Ulrich Eckhardt
Note: the external character type is actually part of the fstream template
parameters. The conversion takes place using the codecvt<> facets of the
locale.

Post by Edward Diener
This was all great news to me. I had always assumed that the wofstream
and wifstream output and input wide characters respectively.

I disagree completely with you. Surely if I have a sequence of wide
characters, I should be able to write those characters to a file stream
as wide characters. It is irrelevant what encoding those characters
represent when I write them, although of course some program should
understand the encoding if it needs to read them back in again. Why must
codecvt convert them to multibyte characters ? I did not ask for that
and I don't want it. The default should naturally be to leave them as is.

In the practical case which I ran up against, I needed to emulate an OS
file type which was written as a file of UTF16 characters. So my wide
characters happened to be valid UTF16 characters. But when I tried to
write them out as wide characters they gor reduced back down to
multlibyte characters in the file, and naturally the file format was now
wrong.

I know I am supposed to write my own codecvt to tell the default codecvt
for my wide character stream to leave the characters alone and not
convert each wide character to its mutlibyte(s) equivalent. But I do not
believe this should have been necessary in the first place.

I always liked C++ mindset of "trust the programmer". The footnote I
quoted in 27.8.1 appears to go against that belief. But perhaps I have
misinterpreted that footnote.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Marsh Ray

2009-02-24 07:48:29 UTC

Permalink

On Feb 23, 5:59 pm, Edward Diener

Post by Edward Diener
"A File is a sequence of multibyte characters. In order to provide the
contents as a wide character sequence, filebuf should
convert between wide character sequences and multibyte character sequences."
This makes no sense to me and I question why it is in the standard. I
have no idea why the C++ standard decided that a file is a sequence of
multi-byte characters. It basically says to me that the C++ committee
has mandated that a file created under C++ can never be a Unicode 16 or
32 bit sequence of characters, or any other type of character even
though C++ does support a native character type, wchar_t, which does not
have to be a multibyte character.

Often the term "multibyte encoding" is used to mean the encoding is
not constrained to use a fixed organization of bytes (code units,
whatever) per character. In other words, "anything goes", and the
system is not going to make assumptions about how characters map to
bytes without knowing the details of the specific encoding. Multibyte
seems to be the most general class of encoding schemes and has fixed-
size schemes as a subset.

So I would be really surprised if writing UTF-16 would fall afoul of
an international standard's intent to require "multibyte" characters.

Some APIs however (Win32 comes to mind) tend to toss around terms like
"multibyte", "ASCII", "ANSI", etc. to the point that they have an
implied meaning that isn't really correct.

Post by Edward Diener
I disagree completely with you. Surely if I have a sequence of wide
characters, I should be able to write those characters to a file stream
as wide characters. It is irrelevant what encoding those characters
represent when I write them, although of course some program should
understand the encoding if it needs to read them back in again.

Then just write bytes, not characters. Use something other than text
streams if you don't wan't it to manage your text.

Post by Edward Diener
Why must
codecvt convert them to multibyte characters ? I did not ask for that
and I don't want it. The default should naturally be to leave them as is.

It seems even on UTF-16-centric systems the great majority of "plain
text files" and code written to work with them expects some poorly-
specified ASCII-based mostly-narrow encoding scheme. On some systems,
this even change based on program environment variables specifying
locales and so on.

It's a mess. Sometimes I wish everyone would just standardize on UTF-8
(until the need for case-insensitive string comparison function comes
up).

- Marsh

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Edward Diener

2009-02-24 18:55:08 UTC

Permalink

Post by Marsh Ray
On Feb 23, 5:59 pm, Edward Diener

Often the term "multibyte encoding" is used to mean the encoding is
not constrained to use a fixed organization of bytes (code units,
whatever) per character. In other words, "anything goes", and the
system is not going to make assumptions about how characters map to
bytes without knowing the details of the specific encoding. Multibyte
seems to be the most general class of encoding schemes and has fixed-
size schemes as a subset.
So I would be really surprised if writing UTF-16 would fall afoul of
an international standard's intent to require "multibyte" characters.

So you are saying that the term 'multibyte' in the above refers only to
the fact that it is a sequence of bytes and not to what is commonly
known as "multibyte encodings" as opposed to "wide character encodings" ?

Post by Marsh Ray
Some APIs however (Win32 comes to mind) tend to toss around terms like
"multibyte", "ASCII", "ANSI", etc. to the point that they have an
implied meaning that isn't really correct.

Then just write bytes, not characters. Use something other than text
streams if you don't wan't it to manage your text.

That was my eventual solution when I realized that the default codecvt
for wofstream using VC++ was to coerce my wide characters to multibyte
encoding. But this means that instead of using wofstream I had to use
ofstream and treat each one of my characters as 2 single byte
characters. That's a kludge. I am trying to write C++ and I do not like
kludges.

Post by Marsh Ray

Post by Edward Diener
Why must
codecvt convert them to multibyte characters ? I did not ask for that
and I don't want it. The default should naturally be to leave them as is.

It seems even on UTF-16-centric systems the great majority of "plain
text files" and code written to work with them expects some poorly-
specified ASCII-based mostly-narrow encoding scheme. On some systems,
this even change based on program environment variables specifying
locales and so on.
It's a mess. Sometimes I wish everyone would just standardize on UTF-8
(until the need for case-insensitive string comparison function comes
up).

Windows is UTF16. That's what MS promotes. I use their compiler and the
latest C++ techniques and I can't write UTF16 files without a kludge. If
the 'multibyte' in the footnote I quoted does indeed just mean a
'sequence of bytes' rather than a 'multibyte encoding', then VC++'s
default encoding for wofstream is wrong IMO.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Marsh Ray

2009-02-24 20:34:42 UTC

Permalink

On Feb 24, 12:55 pm, Edward Diener

Post by Edward Diener

Post by Marsh Ray
On Feb 23, 5:59 pm, Edward Diener

...

Post by Edward Diener
So you are saying that the term 'multibyte' in the above refers only to
the fact that it is a sequence of bytes and not to what is commonly
known as "multibyte encodings" as opposed to "wide character encodings" ?

No, I'm saying that "multibyte" as I've come across the term seems to
mean that the number of bytes per char is not known to be fixed, i.e.,
you can't use a simple array index operation to access characters.
Since this is essentially a negative statement, my guess is that the
standard's intent was to allow for truly multibyte encodings without
prohibiting fixed-width encodings.

Another way to look at it is that UTF-8 is effectively a fixed-width
encoding for a large set of common strings, and the standard would
certainly not prohibit the writing of those!

Post by Edward Diener
Windows is UTF16. That's what MS promotes. I use their compiler and the
latest C++ techniques and I can't write UTF16 files without a kludge. If
the 'multibyte' in the footnote I quoted does indeed just mean a
'sequence of bytes' rather than a 'multibyte encoding', then VC++'s
default encoding for wofstream is wrong IMO.

Yep. Even writing to std::wcout to a console (which natively supports
Unicode) filters your text through some narrow code page. The defaults
for that kind of thing are probably chosen on the basis of what breaks
the least code.

I've never seen any of MS's code use iostream, they usually promote
proprietary libraries or APIs for writing to files. Just wait 'til you
need to open a filename with non-ASCII chars in it.

C++03 wchar_t is not hard-wired for UTF-16 though, so the current
situation is perhaps understandable. C++0x should improve things
significantly in these areas.

- Marsh

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Ulrich Eckhardt

2009-02-24 13:26:46 UTC

Permalink

Post by Edward Diener

Post by Ulrich Eckhardt

I do not know what you mean by internal type and external type in your
reply. Please explain.

Files store bytes, i.e. raw memory without any intrinsic meaning. To your
application, that is the external type. Internally, your application could
use whatever it wants to represent strings, be it char, wchar_t or
something else and with whatever meaning attached (Unicode, some
codepage..), but when it wants to store those strings on disk, it must
eventually convert those strings to bytes.

Post by Edward Diener
My assumption was that a wofstream and wifstream reads and writes
directly what is considered "wide characters" for that implementation,
however the implementation chooses to define a "wide character". In
Windows and VC++ the wchar_t would naturally be a UTF16 character.

Actually, the way that wchar_t is written to file is not defined. Just using
memcpy()-like semantics is as okay as any other. However, that also means
it would be stupid for the programmer to rely on those unspecified
semantics, they should make a choice.

Post by Edward Diener
Notice that using wide streams in the VC++ implementation it is
impossible to write wide characters to a file. This appeared to me
really amazing.

That is not true. You can and should select an encoding, using an according
codecvt facet. There is one problem though: the C++ standard doesn't define
any. Many implementations come with some facets though, some even use UTF-8
per default for wchar_t streams.

Post by Edward Diener
"A File is a sequence of multibyte characters. In order to provide the
contents as a wide character sequence, filebuf should convert between
wide character sequences and multibyte character sequences."
This makes no sense to me and I question why it is in the standard. I
have no idea why the C++ standard decided that a file is a sequence of
multi-byte characters. It basically says to me that the C++ committee
has mandated that a file created under C++ can never be a Unicode 16 or
32 bit sequence of characters, or any other type of character even
though C++ does support a native character type, wchar_t, which does not
have to be a multibyte character.

Sorry, but the wording in above quote is not good and your understanding is
wrong. You can write codepages or UTF-32, even though those are not
considered multibyte encodings.

Post by Edward Diener
It is irrelevant what encoding those characters
represent when I write them, although of course some program should
understand the encoding if it needs to read them back in again. Why must
codecvt convert them to multibyte characters ? I did not ask for that
and I don't want it. The default should naturally be to leave them as is.
In the practical case which I ran up against, I needed to emulate an OS
file type which was written as a file of UTF16 characters.

Little or big endian? See, that is the point: you should choose your
external encoding, not take some unspecified default that is, as you see,
not portable. The standard is trying to make you make a choice, as it can't
provide a sane default. The default you want would actually be a bad one,
as it doesn't give a uniform encoding and thus make files unreadable on
other systems. Note that nowadays, using UTF-8 would be a sane default, but
I think that wasn't as widely accepted as it is now.

Uli

Edward Diener

2009-02-24 18:55:42 UTC

Permalink

Post by Ulrich Eckhardt

Post by Edward Diener

Post by Ulrich Eckhardt

I do not know what you mean by internal type and external type in your
reply. Please explain.

Ok, I understand that.

Post by Ulrich Eckhardt

Whether it is defined or not by the C++ standard, it certainly seems
natural to me that if I am writing a wchar_t to a file using wofstream
that the file now contains that 'wchar_t', not a 'char'.

Post by Ulrich Eckhardt

Post by Edward Diener
Notice that using wide streams in the VC++ implementation it is
impossible to write wide characters to a file. This appeared to me
really amazing.

You are correct. What I should have said is that it was impossible using
the default codecvt. Writing wide characters to a file in VC++ using
wide streams requires me to create my own codecvt to do it. That still
seems ridiculous to me.

Post by Ulrich Eckhardt

Post by Edward Diener
"A File is a sequence of multibyte characters. In order to provide the
contents as a wide character sequence, filebuf should convert between
wide character sequences and multibyte character sequences."
This makes no sense to me and I question why it is in the standard. I
have no idea why the C++ standard decided that a file is a sequence of
multi-byte characters. It basically says to me that the C++ committee
has mandated that a file created under C++ can never be a Unicode 16 or
32 bit sequence of characters, or any other type of character even
though C++ does support a native character type, wchar_t, which does not
have to be a multibyte character.

Sorry, but the wording in above quote is not good and your understanding is
wrong. You can write codepages or UTF-32, even though those are not
considered multibyte encodings.

Then the wording above is poor, as you say. Evidently, in your
interpretation of it, the wording in the footnote does not mean
'multibyte encoding' when it refers to 'multibyte characters', but just
'sequence of bytes'.

Post by Ulrich Eckhardt

Little or big endian? See, that is the point: you should choose your
external encoding, not take some unspecified default that is, as you see,
not portable.

I assume that whether it is little endian or big endian should resolve
itself as to whether a 'wchar_t' in the compiler ( VC++ 9 ) is little
endian or big endian ( it''s little endian ).

Post by Ulrich Eckhardt
The standard is trying to make you make a choice, as it can't
provide a sane default. The default you want would actually be a bad one,
as it doesn't give a uniform encoding and thus make files unreadable on
other systems. Note that nowadays, using UTF-8 would be a sane default, but
I think that wasn't as widely accepted as it is now.

I still do not understand why you, or anybody, would think that if I
have a wchar_t being output to a wide stream the default should be
anything other than outputting that wchar_t as is without any code
conversion via a codecvt.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Frank Birbacher

2009-02-25 19:51:37 UTC

Permalink

Hi!

Post by Edward Diener
Whether it is defined or not by the C++ standard, it certainly seems
natural to me that if I am writing a wchar_t to a file using wofstream
that the file now contains that 'wchar_t', not a 'char'.

Before understanding locales and wide streams I would have agreed. But
limiting the wide streams to only input/output exactly a 'wchar_t' is

Post by Edward Diener
You are correct. What I should have said is that it was impossible using
the default codecvt. Writing wide characters to a file in VC++ using
wide streams requires me to create my own codecvt to do it. That still
seems ridiculous to me.

Well, yes, seems rediculous. You should not need to write your own
facet. The whole locale system ist actually pretty smart. You just need
to select the correct facet for your system. This is system/compilier
specific. An example:

On my Gentoo Linux system the glibc (c library) is responsible for
locales. I have following locales installed (first column is the county
code suffixed with a name (like @euro), second the encoding used):

de_DE ISO-8859-1
***@euro ISO-8859-15
de_DE.UTF-8 UTF-8

Unfortenuately my system cannot support UTF-16 :( Well, given this
knowledge I can use wstreams to operate on __ANY__ encoding.

#include <iostream>
#include <fstream>
#include <locale>

using namespace std;

void writeTest(const char* const filename, locale const& filelocale)
{
wofstream file;
file.imbue(filelocale);
file.open(filename);

//compiler needs to understand source file encoding
//in order to parse this string litteral:
file << L"Test using umlauts ÄÖÜäöü\n";

file.close();
}
void readTest(const char* const filename, locale const& filelocale)
{
wifstream file;
file.imbue(filelocale);
file.open(filename);

//converts from filelocale to global locale:
wcout << file.rdbuf();

file.close();
}
int main()
{
//implementation specific
const char* const localename = "de_DE.UTF-8";

//set global locale which is used as default for all streams:
//static function std::locale::global
locale::global(locale(localename));

//create a locale object from a different string:
const locale filelocale("de_DE"); //my ISO-8859-1 locale

static const char* const filename = "testLocale.txt";
writeTest(filename, filelocale);
readTest(filename, filelocale);
}

Post by Edward Diener
I still do not understand why you, or anybody, would think that if I
have a wchar_t being output to a wide stream the default should be
anything other than outputting that wchar_t as is without any code
conversion via a codecvt.

The codecvt enables you to use different encodings simultaneously. This
is good. For your problem it boils down to specifying the correct locale
which your compiler vendor or system vendor shall supply. So you need to
figure out which locale string refers to the UTF-16 locale you need.

HTH,
Frank

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Edward Diener

2009-02-26 04:43:06 UTC

Permalink

Post by Frank Birbacher
Hi!

Before understanding locales and wide streams I would have agreed. But
limiting the wide streams to only input/output exactly a 'wchar_t' is
pretty silly.

I never said that wide streams should be limited to only input/output
wchar_ts. What I said is that the default should be that wide streams
output wide charcacters, in other words that no code conversion takes place.

That a wide stream should input/output wide characters by default,
without any code conversion taking place, is so apparent to me that
arguments otherwise still make no sense to me.

Post by Frank Birbacher

Well, yes, seems rediculous. You should not need to write your own
facet. The whole locale system ist actually pretty smart. You just need
to select the correct facet for your system. This is system/compilier
On my Gentoo Linux system the glibc (c library) is responsible for
locales. I have following locales installed (first column is the county
de_DE ISO-8859-1
de_DE.UTF-8 UTF-8
Unfortenuately my system cannot support UTF-16 :( Well, given this
knowledge I can use wstreams to operate on __ANY__ encoding.
#include <iostream>
#include <fstream>
#include <locale>
using namespace std;
void writeTest(const char* const filename, locale const& filelocale)
{
wofstream file;
file.imbue(filelocale);
file.open(filename);
//compiler needs to understand source file encoding
file << L"Test using umlauts ÄÖÜäöü\n";
file.close();
}
void readTest(const char* const filename, locale const& filelocale)
{
wifstream file;
file.imbue(filelocale);
file.open(filename);
wcout << file.rdbuf();
file.close();
}
int main()
{
//implementation specific
const char* const localename = "de_DE.UTF-8";
//static function std::locale::global
locale::global(locale(localename));
const locale filelocale("de_DE"); //my ISO-8859-1 locale
static const char* const filename = "testLocale.txt";
writeTest(filename, filelocale);
readTest(filename, filelocale);
}

I am glad that just setting the correct locale will change the way that
streams work, without having to deal with codecvt. I am glad to be shown
this technique. But once again I can not understand why the default
locale for a wide stream would not write wide characters.

Post by Frank Birbacher

OK, I will try to find this out regarding VC++ 9. I recall attempting to
do this in the past only to meet a brick wall which said that no locale
other than the standard C locale was provided.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Ulrich Eckhardt

2009-02-27 00:08:06 UTC

Permalink

[...] once again I can not understand why the default
locale for a wide stream would not write wide characters.

Have you ever helped someone troubleshoot code like this:

struct foo {
char x;
int y;
double z;
};
foo f = ...;
fwrite( some_stream, &f, ...);

...where the author wondered why suddenly they can't read their files on
system X or with compiler Y? The error in this code is that the author did
not define a file format for their data but just used the in-memory layout
of the compiler, which is by far not a constant.

This laziness makes files nonportable, as neither the layout and size of
doubles, ints or the structure is defined, but seems to be a beginners'
mistake made over and over again. What you are asking for is essentially
the same, i.e. a bitwise copy of compiler-dependent things into a file.

Why would anyone in a sane state of mind make such an atrocity the default?
Instead, a sane default would either be a file format that allows easily
reading back the content on any system (like UTF-8) or one that raises a
big red flag to the programmer to notify them that what they are doing is
wrong.

Uli

Edward Diener

2009-02-27 06:45:37 UTC

Permalink

Post by Ulrich Eckhardt

[...] once again I can not understand why the default
locale for a wide stream would not write wide characters.

struct foo {
char x;
int y;
double z;
};
foo f = ...;
fwrite( some_stream, &f, ...);
...where the author wondered why suddenly they can't read their files on
system X or with compiler Y? The error in this code is that the author did
not define a file format for their data but just used the in-memory layout
of the compiler, which is by far not a constant.
This laziness makes files nonportable, as neither the layout and size of
doubles, ints or the structure is defined, but seems to be a beginners'
mistake made over and over again. What you are asking for is essentially
the same, i.e. a bitwise copy of compiler-dependent things into a file.

Yes. I am concerned that what I write out is what I described, not
whether on another system/compiler it can be read back in. If I really
were concerned about portability I would choose a library, like Boost's
Serialization library, or perhaps conversion to XML ( the Serialization
library also has an XML archive which can be used ).

Post by Ulrich Eckhardt
Why would anyone in a sane state of mind make such an atrocity the default?

Because they do not care about what you care about when outputting data
to a file. They assume that they get what they are asking for, a bitwise
rendering of the data which they are outputting.

Post by Ulrich Eckhardt
Instead, a sane default would either be a file format that allows easily
reading back the content on any system (like UTF-8) or one that raises a
big red flag to the programmer to notify them that what they are doing is
wrong.

I understand your point but when I write out to a file a wchar_t on a
system whose basic support is for the UTF16 encoding, I expect a file
which has a sequence of UTF16, aka 16 bit, characters, not a file which
has some equivalent multibyte encoding. If I wanted code conversion I
could use some locale or codecvt to do it. That the default should
signify such code conversion seems logical to you and illogical to me.

However what you have told me, and others have told me, is that the C++
standard does not mandate that code conversion between internal wide
characters and external single character sequences has to take place,
but only that the resulting file stream is viewed as a sequence of
single characters.

Those who claimed on a Microsoft forum that the code conversion between
wide characters and multibyte encodings, which VC++ 9 does using the
default locale, is part of the C++ standard said that it was mandated by
27.8.1. amd footnote 306. In footnote 306 I read the term 'multibyte
character sequences' to refer to just a sequence of single characters
and not a multibyte encoding. Thus in my reading of 27.8.1 and footnote
306, it is perfectly valid to output a wchar_t to its bit equivalent of
two bytes which is little endian on Windows. That Microsoft chose not to
do so is not the problem of the C++ standard it seems to me, but I just
wanted to clarify it here. At least I can go back to the Microsoft forum
and argue that the C++ standard does not enforce the VC++ 9 behavior.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Bart van Ingen Schenau

2009-02-27 15:08:32 UTC

Permalink

Post by Edward Diener
I understand your point but when I write out to a file a wchar_t on a
system whose basic support is for the UTF16 encoding, I expect a file
which has a sequence of UTF16, aka 16 bit, characters, not a file
which has some equivalent multibyte encoding.

The file will contain a sequence of multibyte characters anyway, because
an UTF-16 coding also fits the C++ definition of a multibyte character.

In clause 1.3.8, the C++ standard defines the term 'multibyte character'
as "a sequence of one or more bytes representing a member of the
extended character set of either the source or the execution
environment. The extended character set is a superset of the basic
character set (2.2)."
Additionally, I can find no requirement that an extended character set
must use the same representations for characters that are also a part
of the basic character set.

Post by Edward Diener
If I wanted code
conversion I could use some locale or codecvt to do it. That the
default should signify such code conversion seems logical to you and
illogical to me.
However what you have told me, and others have told me, is that the
C++ standard does not mandate that code conversion between internal
wide characters and external single character sequences has to take
place, but only that the resulting file stream is viewed as a sequence
of single characters.

You are right. The conversion that is mandated is between a 'sequence of
bytes' and a 'sequence of wide characters'. This conversion can be an
identity conversion, where multiple bytes make up a wide character.

Post by Edward Diener
Those who claimed on a Microsoft forum that the code conversion
between wide characters and multibyte encodings, which VC++ 9 does
using the default locale, is part of the C++ standard said that it was
mandated by 27.8.1. amd footnote 306. In footnote 306 I read the term
'multibyte character sequences' to refer to just a sequence of single
characters and not a multibyte encoding. Thus in my reading of 27.8.1
and footnote 306, it is perfectly valid to output a wchar_t to its bit
equivalent of two bytes which is little endian on Windows. That
Microsoft chose not to do so is not the problem of the C++ standard it
seems to me, but I just wanted to clarify it here. At least I can go
back to the Microsoft forum and argue that the C++ standard does not
enforce the VC++ 9 behavior.

Yes. It seems that those others misinterpreted the term 'multibyte
character' to a narrower meaning than it has in C++.

Bart v Ingen Schenau

--
a.c.l.l.c-c++ FAQ: http://www.comeaucomputing.com/learn/faq
c.l.c FAQ: http://c-faq.com/
c.l.c++ FAQ: http://www.parashift.com/c++-faq-lite/

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Frank Birbacher

2009-02-27 15:09:26 UTC

Permalink

Hi!

UTF-16 is a multibyte encoding, and it is not a fixed-multibyte
encoding. For "higher" Unicode characters the translation into UTF-16
results in _two_ _two_-byte sequences (high and low surrogates), that
means 4 bytes in total. The encoding of 16bit characters to exactly
these two-byte codes is called "UCS-2" and can only represent the first
2^16 characters of Unicode. I guess this is what you meant.

Well, remember Unicode has more than 2^16 characters defined, which is
why there are 32bit wchar_t implementations and UCS-4 and UTF-32.

Frank

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Jeff Koftinoff

2009-02-27 23:56:22 UTC

Permalink

Post by Frank Birbacher
UTF-16 is a multibyte encoding, and it is not a fixed-multibyte
encoding. For "higher" Unicode characters the translation into UTF-16
results in _two_ _two_-byte sequences (high and low surrogates), that
means 4 bytes in total. The encoding of 16bit characters to exactly
these two-byte codes is called "UCS-2" and can only represent the first
2^16 characters of Unicode. I guess this is what you meant.
Well, remember Unicode has more than 2^16 characters defined, which is
why there are 32bit wchar_t implementations and UCS-4 and UTF-32.

And in fact the surprising thing is that even in UTF-32 you can have
characters that take multiple UTF-32 words to describe a single
'grapheme'...

--jeffk++

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]