Discussion:
[BitPim-devel] Unicode string handling
Joe Pham
2006-04-10 22:48:04 UTC
Permalink
After doing some work with the new unicode string, I'm throwing in my $0.02:

1. We need to propely handle unicode 'terminator'.
2. By default, the 'encoding', 'read_encoding', and 'write_encoding' should be None, which then does no encoding/decoding (old behaviour). The reasons for this might be: (a) coders may want to do their own weird encode/decode, (b) some phones use STRING fields as data fields.
3. Need to decide on how to handle value keys like 'sizeinbytes', 'constant', truncation processing, etc. for both string and unicode.

-Joe Pham



_____________________________________________________________________
Call Anyone, Anytime, Anywhere in the World - FREE!
Free Internet calling from NetZero Voice
Visit http://www.netzerovoice.com today!
Simon C
2006-04-11 03:25:11 UTC
Permalink
Post by Joe Pham
3. Need to decide on how to handle value keys like
'sizeinbytes', 'constant', truncation processing, etc. for
both string and unicode.
I have taken the string keywords/parameters as applying to the strings
written to or read from the phone not bitpim, not to the unicode strings
they are converted to.
Post by Joe Pham
1. We need to propely handle unicode 'terminator'.
At the moment the user sets the terminator as a parameter. There is a
problem at the moment with handling terminators of more than one byte when
the length is restricted, the code assumes the terminator is 1 byte long.
Ther terminator parameter has no effect on the string when it is converted
to unicode.
Post by Joe Pham
2. By default, the 'encoding', 'read_encoding', and
'write_encoding' should be None, which then does no
encoding/decoding (old behaviour). The reasons for this
might be: (a) coders may want to do their own weird
encode/decode, (b) some phones use STRING fields as data fields.
If maintaining compatability with existing code is most important I think we
should just have a different class for unicde strings.
I think the better "techincial" solution is to have one STRING class so that
all strings look the same to the common code regardless of the phone type
and fix the phones that use the string as data to use the DATA class instead
(or list of bytes).


Simon
Joe Pham
2006-04-11 05:20:56 UTC
Permalink
Post by Simon C
I have taken the string keywords/parameters as applying to the
strings written to or read from the phone
As an exmample, you're writing a unicode string to the phone and its physical length is more 'sizeinbytes', how would you truncate the string/value/field?
Post by Simon C
There is a problem at the moment with handling terminators of more
than one byte
Untill this issue is addressed, current handling of unicode strings will not work in all cases.
Post by Simon C
If maintaining compatability with existing code is most important
IMHO, I think it is just because we may not have access to all the phones implemented in BitPim to easily make (and test) any updates.
Post by Simon C
fix the phones that use the string as data to use the DATA class
instead (or list of bytes).
Ideally yes, however my previous point stands.

-Joe Pham




_____________________________________________________________________
Call Anyone, Anytime, Anywhere in the World - FREE!
Free Internet calling from NetZero Voice
Visit http://www.netzerovoice.com today!
Roger Binns
2006-04-11 05:49:36 UTC
Permalink
Post by Joe Pham
As an exmample, you're writing a unicode string to the phone and
its physical length is more 'sizeinbytes', how would you truncate
the string/value/field?
Surely the Unicode string is first encoded into what the phone
uses, and then sizeinbytes truncates that? (Did you notice how
prescient I was in making sure that this 'size' field could
never be ambiguous :-)
Post by Joe Pham
Post by Simon C
There is a problem at the moment with handling terminators of more
than one byte
Untill this issue is addressed, current handling of unicode strings
will not work in all cases.
The terminator applies to the string on the phone which is in
the phone encoding. I don't see where the problem is.

Roger
Simon C
2006-04-11 05:58:35 UTC
Permalink
Post by Simon C
Post by Simon C
I have taken the string keywords/parameters as applying to
the strings
Post by Simon C
written to or read from the phone
As an exmample, you're writing a unicode string to the phone
and its physical length is more 'sizeinbytes', how would you
truncate the string/value/field?
You would truncate it until it fits the size available, the byte length is
the limiting factor not the number of characters.
I have implemented it to work this way at the moment if you look at the
_update function.
Also python has interesting behavior with the 'len' of byte srings, it
returns the number of bytes not characters, for unicode strings it returns
the number of characters.
Post by Simon C
Post by Simon C
There is a problem at the moment with handling terminators
of more than
Post by Simon C
one byte
Untill this issue is addressed, current handling of unicode
strings will not work in all cases.
None of the phones use more than a single byte for terminating characters
afaik, so the limitation is theoretical, it was also present before I made
any changes.
Conversion to unicode has always happened in bitpim, the changes to the
string class move it to a different point in the code, the advantage to the
doing it this way is that we can now control the character set used for the
converstion.
Post by Simon C
Post by Simon C
If maintaining compatability with existing code is most important
IMHO, I think it is just because we may not have access to
all the phones implemented in BitPim to easily make (and
test) any updates.
I think the safest way to do this would be to deprecate the STRING class and
leave it untouched and introduce a new one for unicode support. We would
then be able to migrate the existing code over as time allowed.
It would be possible to add the graceful degrade to ascii feature to the
existing STRING class without significant risk, its just a one or two line
change.

I see your point about backward compatability, it will not take long to add
a separate class. Anyone object?


Simon
Joe Pham
2006-04-11 22:16:50 UTC
Permalink
Post by Simon C
You would truncate it until it fits the size available, the byte
length is the limiting factor not the number of characters. I have
implemented it to work this way at the moment if you look at the
_update function.
Yes, I see that you back off 1-char at a time until the encoding result fits the length. However, the single-byte terminator (if present) still would mess it up. Consider:

100 STRING { 'encoding': 'utf_16le', 'terminator': ',' } test

then

*.test='abc' would produce 'a\x00b\x00c\x00,' which would certainly fail to decode.
Post by Simon C
None of the phones use more than a single byte for terminating
characters afaik,
I believe that is true.
Post by Simon C
so the limitation is theoretical,
Not quite, I'm working on a Motorola phone that supports UTF_16LE natively and expects a u'\x00' ('\x00\x00') as the terminator.
Post by Simon C
it was also present before I made any changes.
Also true. But your current implementation leaves no way around the encoding/decoding process (hence my previous suggestion of leaving 'encoding' as None).
Post by Simon C
I think the safest way to do this would be to deprecate the STRING
class and leave it untouched and introduce a new one for unicode
support.
I'm beginning to think that it's not such a bad idea after all!
Post by Simon C
The terminator applies to the string on the phone which is in the
phone encoding. I don't see where the problem is.
The problem is when the phone (a Motorola in this case) supports unicode string natively (UTF_16LE) with a terminator of '\x00\x00', and anything after that is garbage which most certainly will fail to decode.

-Joe Pham



_____________________________________________________________________
Call Anyone, Anytime, Anywhere in the World - FREE!
Free Internet calling from NetZero Voice
Visit http://www.netzerovoice.com today!
Roger Binns
2006-04-12 04:47:01 UTC
Permalink
Post by Joe Pham
*.test='abc' would produce 'a\x00b\x00c\x00,' which would certainly fail to decode.
That is because you are treating the terminator as a byte. The
terminator should be in the encoding of the string. Additionally
there is no reason why the terminator can't be specified as
u',' or ",\x00".

I don't see any reason why this unified string class won't work
providing careful attention is paid to when conversions happen
and what encoding pieces of data (like the terminator) are in.

I think it is especially dangerous abandoning old phones. If we
don't bring their code up to date then they will slowly but
surely diverge from the main codebase. When someone finally
does use them, the exception will be confusing and not similar
to current exceptions. To fix anything would require two
steps. One is to bring the code up to date, and then the
second would be to fix the underlying issue. The difficulty
of the first (which gets harder and harder over time) means
the latter becomes less and less likely to be fixable.
See the lgtm_520 module for a good example of exactly this
happening.

Quite frankly BitPim is a community project. If people in
that community stop paying attention, or don't care about
particular models then I don't see the point of BitPim
also doing so.

Roger
Simon C
2006-04-11 23:22:20 UTC
Permalink
Post by Joe Pham
Not quite, I'm working on a Motorola phone that supports
UTF_16LE natively and expects a u'\x00' ('\x00\x00') as the
terminator.
Post by Simon C
it was also present before I made any changes.
Also true. But your current implementation leaves no way
around the encoding/decoding process (hence my previous
suggestion of leaving 'encoding' as None).
Oh bugger, it sounds like this has to be fixed, I was quietly hoping that
no-one would need more than single-byte byte strings :(. I'll spend some
time fixing up the code, I'll try to get something done in the next day.
Post by Joe Pham
Post by Simon C
I think the safest way to do this would be to deprecate the STRING
class and leave it untouched and introduce a new one for unicode
support.
I'm beginning to think that it's not such a bad idea after all!
I'll do this then. I'll call it USTRING.
Post by Joe Pham
Post by Simon C
The terminator applies to the string on the phone which is
in the phone
Post by Simon C
encoding. I don't see where the problem is.
The problem is when the phone (a Motorola in this case)
supports unicode string natively (UTF_16LE) with a terminator
of '\x00\x00', and anything after that is garbage which most
certainly will fail to decode.
The function that extracts the string from the packet (readfrombuffer) stops
when it hits the terminator even if there is more data available.
The current code is hard coded for one-byte terminators, but once fixed it
should work. I think the terminator would have to be assessed on a byte
boundary which is a multiple of its length to prevent \x8000\x0080 from
triggering terminator detection.
The terminator is currently treated as an integer not a character, so I
think an extra parameter "terminator_length" is going to be required to make
0 into a 16-bit 0.

Simon
Joe Pham
2006-04-12 01:45:57 UTC
Permalink
I'll spend some time fixing up the code, I'll try to get something
done in the next day.
If we all decide to go with a new class, I'd suggest restoring the STRING class and working on the new one. The STRING class works fine for string fields and needs no changes at this time. The terminator issue applies to unicode strings only, which would hopefully be handled properly in the new class.

-Joe Pham



_____________________________________________________________________
Call Anyone, Anytime, Anywhere in the World - FREE!
Free Internet calling from NetZero Voice
Visit http://www.netzerovoice.com today!
Simon C
2006-04-12 02:20:58 UTC
Permalink
Post by Joe Pham
If we all decide to go with a new class, I'd suggest
restoring the STRING class and working on the new one. The
STRING class works fine for string fields and needs no
changes at this time. The terminator issue applies to
unicode strings only, which would hopefully be handled
properly in the new class.
Agreed.

Simon
Simon C
2006-04-12 06:16:27 UTC
Permalink
Post by Roger Binns
I think it is especially dangerous abandoning old phones. If
we don't bring their code up to date then they will slowly
but surely diverge from the main codebase. When someone
finally does use them, the exception will be confusing and
not similar to current exceptions. To fix anything would
require two steps. One is to bring the code up to date, and
then the second would be to fix the underlying issue. The
difficulty of the first (which gets harder and harder over
time) means the latter becomes less and less likely to be fixable.
See the lgtm_520 module for a good example of exactly this happening.
Quite frankly BitPim is a community project. If people in
that community stop paying attention, or don't care about
particular models then I don't see the point of BitPim also doing so.
How do we conclude this?


Simon
Joe Pham
2006-04-12 12:28:55 UTC
Permalink
Post by Roger Binns
That is because you are treating the terminator as a byte.
Because that's how it is currently defined and implemented. That's not to say it can't be changed, but it is what it is.
Post by Roger Binns
The terminator should be in the encoding of the string. Additionally
there is no reason why the terminator can't be specified as u','
or ",\x00".
Agreed, and that was why I pointed out this issue wrt unicode processing.
Post by Roger Binns
I don't see any reason why this unified string class won't work
providing careful attention is paid to when conversions happen and
what encoding pieces of data (like the terminator) are in.
I don't think so either, but some underlying changes must be made to support unicode strings. The issue is not to break current phones with the changes, but allows coders to trannsition their phone support to unicode strings smoothly and properly. That was why I suggested the idea of not doing any unicode processing if no codec is specified. Otherwise, I have no preference between a unified STRING class and a new unicode STRING class.
Post by Roger Binns
I think it is especially dangerous abandoning old phones.
I don't think that's what we're saying, but we do need access to the phone to test out any changes. In my particular case, I no longer have access to the Samsung A650 and LG VX8100/9800.
Post by Roger Binns
If people in that community stop paying attention, or don't care
about particular models then I don't see the point of BitPim also
doing so.
I think we do care, otherwise we wouldn't be concerned about breaking support to existing/old phones.

-Joe Pham



_____________________________________________________________________
Call Anyone, Anytime, Anywhere in the World - FREE!
Free Internet calling from NetZero Voice
Visit http://www.netzerovoice.com today!
Roger Binns
2006-04-13 03:25:57 UTC
Permalink
Post by Joe Pham
Post by Roger Binns
That is because you are treating the terminator as a byte.
Because that's how it is currently defined and implemented.
That's not to say it can't be changed, but it is what it is.
The terminator should definitely be in the phone encoding since
the data comes back from the phone in the phone's encoding.
Post by Joe Pham
That was why I suggested the idea of not doing any unicode
processing if no codec is specified.
Ultimately unicode processing has to happen somewhere. At the
moment it happens when we try to store the strings into the
database when the user gets a bewildering exception. I way
prefer making that a known exception somewhat closer to the
code that deals with phone data. ie today we already generate
exceptions with non-ascii data. Having that exception move
into one place in the STRING class seems way more sensible.
Post by Joe Pham
Post by Roger Binns
I think it is especially dangerous abandoning old phones.
I don't think that's what we're saying,
If there are two string classes then that is exactly what
is happening. The old phones will all end up using STRING
and the new code/models will use UNICODE. And newly
developed and updated code (eg new phonebook, calendar,
memo etc) will only be tested against UNICODE and slowly
lose compatibility with STRING. Let a period of time pass
and it will be a big developer burden to update the old
phone code. That is why I think it is better to update the
code anyway. If people do use and do encounter issues
they will be way easier to fix.
Post by Joe Pham
Post by Roger Binns
If people in that community stop paying attention, or don't care
about particular models then I don't see the point of BitPim also
doing so.
I think we do care, otherwise we wouldn't be concerned about
breaking support to existing/old phones.
That is ultimately your call. I personally don't care about
breaking support for old phones. I'd rather have the code be
up to date so if someone does use it we can easily fix it,
than let it languish, working in theory, but not be able to
fix it if someone reports an issue.

Roger
s***@bitpim.org
2006-04-13 03:30:48 UTC
Permalink
Post by Joe Pham
If we all decide to go with a new class, I'd suggest
restoring the STRING class and working on the new one. The
STRING class works fine for string fields and needs no
changes at this time. The terminator issue applies to
unicode strings only, which would hopefully be handled
properly in the new class.
I've checked in the USTRING class and restored STRING.
I've also converted 9 phones to use the new class and tested them as much as
I can. Between them they exercise most of the new functionality.
I think we should plan to delete the old STRING class within some time
period (4-6 weeks) meaning we will have to upgrade all the other phones
(~40) but without the pressure of having to get it all done before the next
build, the new class is contains all the same functionality as the old.
Stephen can you do the sanyo/samsungs you have? I have a couple of sanyos so
I could have a go at these if you want.

We also need to fix the CSVSTRING class to fully implement support for phone
codesets.

The new class allows you to set a multi-byte terminator. To declare a string
with a two byte 0 terminator using utf-16le you would do.
40 USTRING { 'terminator':0, 'terminator_length':2, 'encoding':'utf-16le' }
name
The 2 byte terminator counts towards the number of bytes when calculating
'sizeinbytes'. The keywords only affect the string written to/read from the
phone.

I have not tested with a terminator longer than 1 byte, but I can look at
logs if it does not work.

Simon
Simon C
2006-04-13 03:41:17 UTC
Permalink
Post by Roger Binns
The terminator should definitely be in the phone encoding
since the data comes back from the phone in the phone's encoding.
Out of the 350+ terminators defined in the phones so far only about 9% are
characters, the rest are 0 or non-character codes e.g. 0xA, and this is not
counting STRINGs without explicitly defined terminators that use the default
0.

I don't think it is worth the time to code and debug the ability to use
characters *and* integers as terminators.

Simon
Roger Binns
2006-04-13 04:48:32 UTC
Permalink
Post by Simon C
Out of the 350+ terminators defined in the phones so far only about 9% are
characters, the rest are 0 or non-character codes e.g. 0xA, and this is not
counting STRINGs without explicitly defined terminators that use the default
0.
I don't think it is worth the time to code and debug the ability to use
characters *and* integers as terminators.
Changing them all to characters/strings would make the most sense.
It is generally bad offering too much choice since that makes
it easier for bugs to exist in the code, and makes more codepaths
to maintain.

Roger
Simon C
2006-04-13 05:30:43 UTC
Permalink
Post by Roger Binns
Changing them all to characters/strings would make the most sense.
I don't think it is possible to only use strings/characters.
The pm225 uses the byte 0xA as a terminator for a variable length string,
not a valid iso-8859-1 character, it will give a codec error.
The iso-8859-x sets don't define a NUL so iso-8859-x cstrings would have the
same problem.


Simon
Roger Binns
2006-04-13 06:49:47 UTC
Permalink
Post by Simon C
I don't think it is possible to only use strings/characters.
The pm225 uses the byte 0xA as a terminator for a variable length string,
not a valid iso-8859-1 character, it will give a codec error.
The iso-8859-x sets don't define a NUL so iso-8859-x cstrings would have the
same problem.
The terminator is not part of the string returned and hence isn't
supplied to the codec for conversion. Technically 0x0a is part
of Unicode and ISO-8859-1.

http://www.unicode.org/charts/PDF/U0000.pdf

Roger
Simon C
2006-04-13 07:06:41 UTC
Permalink
Post by Roger Binns
The terminator is not part of the string returned and hence
isn't supplied to the codec for conversion. Technically 0x0a
is part of Unicode and ISO-8859-1.
What I mean is that if we =define terminators as characters/strings, as I
thought you suggested, instead of integers (like we do now) we will not be
able to support terminators that are not representable as characters.

Simon
Roger Binns
2006-04-13 07:35:21 UTC
Permalink
Post by Simon C
What I mean is that if we =define terminators as characters/strings, as I
thought you suggested, instead of integers (like we do now) we will not be
able to support terminators that are not representable as characters.
The stated problem was that we allow multiple types for terminators
including integers and strings. We should only allow one type with
very well defined semantics. I don't care if it is strings, integers
or sequence of integers. It should just be consistent. The secondary
point is that the terminator is not passed to the codecs so it doesn't
matter if Python's codecs leave particular codepoints in or out.

Roger
Simon C
2006-04-13 07:46:20 UTC
Permalink
Post by Roger Binns
The stated problem was that we allow multiple types for
terminators including integers and strings. We should only
allow one type with very well defined semantics. I don't
care if it is strings, integers or sequence of integers. It
should just be consistent. The secondary point is that the
terminator is not passed to the codecs so it doesn't matter
if Python's codecs leave particular codepoints in or out.
Meanings were garbled in e-mail. We are in violent agreement :-)
As it stands in SVN the new code uses integers only for terminators, and
does not pass the terminator to python's codecs.

Simon
Joe Pham
2006-04-14 18:48:32 UTC
Permalink
The old phones will all end up using STRING and the new code/models
will use UNICODE. And newly developed and updated code (eg new
phonebook, calendar, memo etc) will only be tested against UNICODE
and slowly lose compatibility with STRING. Let a period of time
pass and it will be a big developer burden to update the old phone
code.
I see your point, though I don't completely agree with it. But in any case, Simon' approach with coming up with a new class and phasing out the old one would make this a non-issue altogether.

-Joe Pham



_____________________________________________________________________
Call Anyone, Anytime, Anywhere in the World - FREE!
Free Internet calling from NetZero Voice
Visit http://www.netzerovoice.com today!

Loading...