[BitPim-devel] Phone encodings (revisited)

Discussion:

[BitPim-devel] Phone encodings (revisited)

Simon C

2006-04-01 16:24:10 UTC

Roger,

As a partial fix for the encoding problem can we do the following?
Add an encoding type attribute to the phone profile (default None), it is up
to the developer to specify the phone's encoding otherwise the default
(ascii) is used.
When send/get phone data is selected change the default encoding to what the
phone supports.
When send/get complete (or on exception) set encoding back to default(ascii)

This still leaves graceful failure to be implemented.

I did some tests on my phone and it worked retrieving and sending non-ascii
characters using iso-8859-1.

For changing the encoding I used: (in mainwindow)

def __init__:
self.default_unicode_encoding=sys.getdefaultencoding()

def BPSetUnicodeEncoding(self, encoding):
if encoding==None:
encoding=self.default_unicode_encoding
print "setting encoding to "+`encoding`
if sys.getdefaultencoding()!=encoding:
# set locale for phone
if not hasattr(sys, 'setdefaultencoding'):
savedStdout = sys.stdout
savedStderr = sys.stderr
reload(sys)
sys.stdout = savedStdout
sys.stderr = savedStderr
sys.setdefaultencoding(encoding)
return

Simon

Roger Binns

2006-04-01 18:39:57 UTC

Post by Simon C
As a partial fix for the encoding problem can we do the following?

It isn't a fix - it a sweeping under the rug.

Post by Simon C
sys.setdefaultencoding(encoding)

That affects all threads including the main gui one. The best thing
to do is to fix this properly.

Go and fix one model writing the code needed to get it right.
(Remember things like the phone filesystem may be in a different
encoding or may not support non-ascii chars). Rinse and repeat
on a seperate and totally unrelated model. Do for a 3rd model.
Don't forget the pelephone model.

At this point you'll have an idea as to how to abstract this
out and provide common code for each model to use.

Roger

Simon C

2006-04-02 00:07:54 UTC

Post by Roger Binns
Go and fix one model writing the code needed to get it right.
(Remember things like the phone filesystem may be in a
different encoding or may not support non-ascii chars).
Rinse and repeat on a seperate and totally unrelated model.
Do for a 3rd model.
Don't forget the pelephone model.

Ok I've done this for three phones with the phonebook names and allowed for
acsii and another encoding to co-exist on the same phone. It is compatible
with Pelephone.

I added a new PACKET data type of UNICODE_STRING, it is the same as STRING
except for writetobuffer and readfrombuffer, they convert the data to and
from unicode when reading/writing the buffer. Internally it stores the
string as unicode, the conversion occurs as the data is written/read to/from
the phone
The encoding used is set by a function in prototypes and is global for all
UNICODE_STRINGS, it can be changed at any time. The __init__ of the phone
profile sets the encoding to the 'phone_encoding' attribute of the profile,
if defined, or ascii if not.
If we need to support more than one encoding on a phone then we could create
another string class UNICODE_STRING_2.

To add support for an encoding on a phone the developer has to do this.
1) add the attribute phone_encoding=XXXX where XXXX is the encoding used.
2) In the protocol file use UNICODE_STRING for strings that are encoded.

I'm not going to change the p_brew packets to use this, I don't know if
non-ascii works and I con't want to bugger up my phone trying.

Simon

Joe Pham

2006-04-02 04:48:15 UTC

Post by Simon C
I added a new PACKET data type of UNICODE_STRING,

I don't think there's a need for this new type. All STRING should be able to handle unicode. Ideally, the Phone class can:

1. Define 0 (default/as-is), 1 or more charsets that this phone supports.
2. Common routines to iterate and handle unicode conversion.

To add unicode handling to a phone model, (1) define the appropriate charset, (2) update the conversion routines.

-Joe Pham

_____________________________________________________________________
Call Anyone, Anytime, Anywhere in the World - FREE!
Free Internet calling from NetZero Voice
Visit http://www.netzerovoice.com today!

Simon C

2006-04-02 18:04:51 UTC

Post by Joe Pham

Post by Simon C
I added a new PACKET data type of UNICODE_STRING,

I don't think there's a need for this new type. All STRING

How can the STRING class determine which charset to use if the phone
supports more than one?

Simon

Joe Pham

2006-04-02 18:55:40 UTC

Post by Simon C
How can the STRING class determine which charset to use if the phone
supports more than one?

It should iterate through the supported charsets untill the conversion is successful. If none is found, it should behave as-is.

-Joe Pham

_____________________________________________________________________
Call Anyone, Anytime, Anywhere in the World - FREE!
Free Internet calling from NetZero Voice
Visit http://www.netzerovoice.com today!

Roger Binns

2006-04-02 19:53:36 UTC

Post by Joe Pham
It should iterate through the supported charsets untill the
conversion is successful.

That is by far the worst way of doing it! If you had a random
stream of bytes and wanted to convert to unicode then telling
it to use almost any encoding would work. Conversely going
from unicode to a particular byte encoding will often give
you errors, even if the encoding is correct. The *ONLY* way
to do conversions correctly is if you know what the encoding is.

Joel has a good article on Unicode:

The Absolute Minimum Every Software Developer Absolutely,
Positively Must Know About Unicode and Character Sets
(No Excuses!)

http://www.joelonsoftware.com/articles/Unicode.html

Roger

Simon C

2006-04-02 19:11:26 UTC

Post by Joe Pham

Post by Simon C
How can the STRING class determine which charset to use if the phone
supports more than one?

It should iterate through the supported charsets untill the
conversion is successful. If none is found, it should behave as-is.

This will prevent control over the charset of a particular field in a
packet.
Say the filesystem on a phone uses the ascii charset and the phonebook uses
latin-1.
A user creates a directory with a name containing an accented e.
The STRING class will iterate through the supported charsets, convert it to
latin-1 and send it to the phone instead of giving an error.

Simon

Simon C

2006-04-02 20:58:39 UTC

Unicode decode errors occur if the phone is sending us text in a charset we
are not expecting. This can be fixed by using the UNICODE_STRING and
configuring it to match the phone. I propose adding a new exception
(PhoneStringDecodeException) for this case, but no exception handler, no rug
sweeping here:-).

Unicode encode errors occur if the user enters text into bitpim (or gets
data from another phone) that we cannot convert into the charset of the
phone. In the case of ascii we can try to silently degrade, if another
charset is used or there is no graceful degrading possible I propose to add
a new exception (PhoneStringEncodeException) which is handled in the gui
exception handler. Screenshot attached.

I plan to change the regular string class to use unicode internally, it will
then be the same as the UNICODE_STRING class, but hardcoded to use ascii.
I've found some graceful degrading code examples for latin-1 on-line I plan
to add.

Thoughts?

Simon

Roger Binns

2006-04-02 23:11:31 UTC

Post by Simon C
I plan to change the regular string class to use unicode internally, it will
then be the same as the UNICODE_STRING class, but hardcoded to use ascii.

I like a combination of this and your earlier plan. Make the STRING
class take an encoding parameter which defaults to ascii. Internally
it stores as UNICODE but reads and writes in the encoding.

If all variants of a phone use the same encoding then it can just be
added in the p_ file. If there is some other factor that determines
the encoding, then com_ file will need to specify the parameter
at decode/encode time. Note that PACKETs can take a field of type
P which is accessible to all other fields.

The really difficult part is to make sure that round trips work
correctly. For example if a contact is created named Grīd
but is written to the phone as Grid. Later on when we
read back from the phone, we'll just see Grid. (Even
worse we may have dropped that letter). Similarly we
may write out a ringtone with that name and then read
back in a ringtone with the munged name. The usual
solution to this is some sort of escaping. However
we have to be careful as the phone won't like it.
Also names are stored in index files and may not match
the name ultimately written to disk.

I think it would be best if there are two parameters for
encoding - one for reading and one for writing.

The good thing about handling this internally in the STRING
class is we can generate a specific exception as you showed
which means that encoding issues dealing with the phone
won't be confused with encoding issues elsewhere.

Roger

Simon C

2006-04-08 05:29:20 UTC

Post by Roger Binns
I like a combination of this and your earlier plan. Make the STRING
class take an encoding parameter which defaults to ascii. Internally
it stores as UNICODE but reads and writes in the encoding.

During testing I've found a problem with internal storage as unicode.
Some packets contain "deleted" data, and the strings are garbage. Other
parts of the code check the packet to determine it is garbage and the packet
is never used, but if we convert to unicode on read it throws an exception
on the garbage.
I've changed the code to convert on "getvalue" instead, a just in time
conversion. This will throw an exception if the conversion fails.
If a packet is read from the phone and then written back without the value
being gotten no conversion occurs and the same data as read is written back.

Simon

Simon C

2006-04-03 01:23:55 UTC

Post by Roger Binns
I like a combination of this and your earlier plan. Make the
STRING class take an encoding parameter which defaults to
ascii. Internally it stores as UNICODE but reads and writes
in the encoding.
If all variants of a phone use the same encoding then it can
just be added in the p_ file. If there is some other factor
that determines the encoding, then com_ file will need to
specify the parameter at decode/encode time. Note that
PACKETs can take a field of type P which is accessible to all
other fields.
I think it would be best if there are two parameters for
encoding - one for reading and one for writing.

Exampes with new keywords.
* STRING { 'encoding': 'latin-1' } field1
* STRING { 'read_encoding': 'latin-1' } field # ascii used for writing
* STRING { 'write_encoding': 'latin-1' } field # ascii used for reading
* STRING { 'read_encoding': 'latin-1', 'write_encoding': 'latin-2' } field

Post by Roger Binns
The really difficult part is to make sure that round trips
work correctly. For example if a contact is created named
Grid but is written to the phone as Grid. Later on when we
read back from the phone, we'll just see Grid. (Even worse
we may have dropped that letter). Similarly we may write out
a ringtone with that name and then read back in a ringtone
with the munged name. The usual solution to this is some
sort of escaping. However we have to be careful as the phone
won't like it.
Also names are stored in index files and may not match the
name ultimately written to disk.

I am working on simple dumbing down, removing accents and substitutions, no
dropping characters.

We will have the problem of distinguishing real change in the phones data
from munged data.

Contacts should be OK because of the index, although we will dumb down when
we read back and is that really a bad thing?

We already munge the ringtone filenames (RINGTONE_FILENAME_CHARS) not that
we shouldn't fix it.
Are you ok with handling unmunging later as a separate item. I cannot think
of how to do it using escaping, some fields are very limited on space.

Simon

Simon

Roger Binns

2006-04-03 01:49:23 UTC

Post by Simon C

Post by Roger Binns
The really difficult part is to make sure that round trips
work correctly. For example if a contact is created named
Grid but is written to the phone as Grid.

Hey your mailer munged what I wrote. The first "grīd"
has an i with a macron (horizontal line) over it.

Post by Simon C
We will have the problem of distinguishing real change in the phones data
from munged data.

The phonebook merge code believes the name already in BitPim
over that coming from the phone.

Post by Simon C
Contacts should be OK because of the index, although we will dumb down when
we read back and is that really a bad thing?

I have friends with accented letters in their names. It is unbelievably
bad how they keep being stripped off in places. (Even worse is the guy
whose name is Stéphane. Many places assume he is actually a she and
unable to spell her own name, Stephanie.)

Post by Simon C
We already munge the ringtone filenames (RINGTONE_FILENAME_CHARS) not that
we shouldn't fix it.

That is for names as the user adds to the BitPim user interface. There
have been cases in the past where people have downloaded ringtones
over the air and they have had accented characters in the filename.
(One even had a backslash IIRC).

Post by Simon C
Are you ok with handling unmunging later as a separate item. I cannot think
of how to do it using escaping, some fields are very limited on space.

I don't think anything should be done for tomorrow's build. After that
stick some sort of plan in dev-doc. Worst case we could use UTF-7 :-)

Roger

Simon C

2006-04-03 03:20:50 UTC

Post by Roger Binns
The really difficult part is to make sure that round trips work
correctly. For example if a contact is created named Grid but is
written to the phone as Grid.

Hey your mailer munged what I wrote. The first "grid"
has an i with a macron (horizontal line) over it.

I'm using outlook, sometimes it munges so bad that I see precisely nothing,
I get a 6K e-mail with no displayed characters, so I have to view the
archive on sourceforge, at least it degraded gracefully in this case. If you
know of a good mac client that can read an outlook .pst file...

I don't think anything should be done for tomorrow's build.

The modified STRING will be ready in time and a couple of phones adjusted
and tested, do you want me to hold off with this change?
I tried the vx4400, it dislikes non-ascii characters, but at least has the
good sense to display a ?. This phone will stay ascii only.

Simon

Roger Binns

2006-04-03 03:45:08 UTC

If you know of a good mac client that can read an outlook .pst file...

The standard Mac mailer can do pop3 so you could forward all your mail
to yourself.

Using COM against Outlook (script from Python-com) you can get at
all the messages and dump them in whatever format you want.

Alternatively you can use Thunderbird to get them out of Outlook
on the Windows box and then copy the Thunderbird data over to
the Mac.

Post by Roger Binns
I don't think anything should be done for tomorrow's build.

The modified STRING will be ready in time and a couple of phones adjusted
and tested, do you want me to hold off with this change?

Given we failed to release the last two builds, it is worth waiting
until we release something :-)

I tried the vx4400, it dislikes non-ascii characters, but at least has the
good sense to display a ?. This phone will stay ascii only.

You can use ISO-8859-1 with the 4400. It supports Spanish.

Roger

Greg Pratt

2006-04-03 04:06:29 UTC

On Sunday, 02 April 2006, Simon C is rumored to have said...

Post by Simon C

Post by Roger Binns
I don't think anything should be done for tomorrow's build.

The modified STRING will be ready in time and a couple of phones adjusted
and tested, do you want me to hold off with this change?
I tried the vx4400, it dislikes non-ascii characters, but at least has the
good sense to display a ?. This phone will stay ascii only.

Actually, I was still using a VX4400 when I first started using BitPim.
At that point, I discovered that if I put "Fernández" into BitPim's
phone book and wrote that to the phone, it would show up correctly on
the phone side. While it's possible there are some characters that
aren't used correctly on the phone, it seems clear that the VX4400 uses
ISO 8859-1 (Latin-1) or some close variant.

At some point, however -- I don't recall when -- I notice this same name
would be written to my phone as "FernÃ¡ndez". It looked fine in BitPim,
but something changed, and raw UTF-8 characters were being written out
to the phone. I never reported it as a bug at the time, as I didn't
know enough about how BitPim worked to give a proper diagnosis inside of
the package. (I barely do now.)

So, ISO-8859-1 should be a close enough match for the VX4400. I'm sure
some test entries could be created to see what, if any, characters from
that set are *not* copied correctly.

--
Gregory Pratt ***@panix.com
East Rutherford, NJ, USA http://www.panix.com/~gp/
"The only good spammer is a dead spammer."
PGP Key Fingerprint: DC60 FCDE 91E2 3D41 91A3 45DB B474 3D3A 3621 AAFE

Roger Binns

2006-04-03 04:25:15 UTC

Post by Greg Pratt
While it's possible there are some characters that
aren't used correctly on the phone, it seems clear
that the VX4400 uses ISO 8859-1 (Latin-1) or some
close variant.

In the testing that I did at the time, all I found
was close compliance with 8859-1.

The original version of BitPim was not Unicode.
Roughly speaking, the bytes that came in were the
bytes that went out. However the user interface
also ran in its own codeset. Consequently the
data would get mangled depending on where it came
from and/or where you editted it. The advent of
switching to unicode fixed all that :-)

Roger

Simon C

2006-04-04 05:35:27 UTC

Post by Roger Binns
I don't think anything should be done for tomorrow's build.
After that stick some sort of plan in dev-doc. Worst case we
could use UTF-7 :-)

I've checked in the changes for the STRING class. I've added a help file for
the exception handler, it needs a help ID and gui.py needs updating with it.

p_lgvx8100.py contains an example of how to add support for non-ascii
encoding, search for 'encoding' with the quotes.
I also added strings_in_bitpim.html under the dev-docs with a bit more info.

Simon

Joe Pham

2006-04-05 22:16:54 UTC

Post by Simon C
I've checked in the changes for the STRING class. I've added a help
file for the exception handler, it needs a help ID and gui.py needs
updating with it.

This change broke the Filesystem 'Backup directory' and 'Backup entire tree' functions.

-Joe Pham

_____________________________________________________________________
Call Anyone, Anytime, Anywhere in the World - FREE!
Free Internet calling from NetZero Voice
Visit http://www.netzerovoice.com today!

Simon C

2006-04-05 23:39:15 UTC

Post by Joe Pham
This change broke the Filesystem 'Backup directory' and
'Backup entire tree' functions.

Patched, the zip library does not work with unicode.
If any phones support non-ascii characters they will not create correctly
named backups.
Add looking for a new zip library to the todo list?

Simon

Joe Pham

2006-04-06 00:32:08 UTC

zi.filename=k[strip:].encode('ascii', 'replace')

For encoding, the replacement char is a "?", which is an illegal windows filename char.

-Joe Pham

_____________________________________________________________________
Call Anyone, Anytime, Anywhere in the World - FREE!
Free Internet calling from NetZero Voice
Visit http://www.netzerovoice.com today!

Joe Pham

2006-04-08 17:28:56 UTC

Post by Simon C
During testing I've found a problem with internal storage as unicode.

Since this affects all the phones, I think it's better to do a thorough test on it. I'm planning build next Monday to clean up all the known discrepancies, and leave this one out. This would allow more test time as well as having a reasonable stable build. IMHO, being able to handle Unicode is nice, but not worth the risk at this point.

-Joe Pham

_____________________________________________________________________
Call Anyone, Anytime, Anywhere in the World - FREE!
Free Internet calling from NetZero Voice
Visit http://www.netzerovoice.com today!

Simon C

2006-04-08 18:45:09 UTC

Post by Joe Pham
Since this affects all the phones, I think it's better to do a thorough
test on it. I'm planning build next Monday to clean up all the known
discrepancies, and leave this one out. This would allow more test time as
well as having a reasonable stable build. IMHO, being able to handle
Unicode is nice, but not worth the risk at this point.

To get an idea of coverage, has anyone done any phone testing this week with
these changes included?
I've tried several phones (including a couple of sanyos) and so far have
only found 2 issues which I've fixed.

Simon

Simon C

2006-04-10 04:39:21 UTC

Post by Joe Pham

Post by Simon C
During testing I've found a problem with internal storage as unicode.

Since this affects all the phones, I think it's better to do
a thorough test on it. I'm planning build next Monday to
clean up all the known discrepancies, and leave this one out.
This would allow more test time as well as having a
reasonable stable build. IMHO, being able to handle Unicode
is nice, but not worth the risk at this point.

The CSVSTRING class needs to be modified to handle non-ascii charsets in
order for some of the samsungs and gsm phones to have non-ascii charset
capabilities.
I cannot test this class, so someone else will have to change it. I've
re-organised the STRING class to make this easier.

Simon

23 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Simon C 2006-04-01 16:24:10 UTC

Roger Binns 2006-04-01 18:39:57 UTC

Simon C 2006-04-02 00:07:54 UTC

Joe Pham 2006-04-02 04:48:15 UTC

Simon C 2006-04-02 18:04:51 UTC

Joe Pham 2006-04-02 18:55:40 UTC

Roger Binns 2006-04-02 19:53:36 UTC

Simon C 2006-04-02 19:11:26 UTC

Simon C 2006-04-02 20:58:39 UTC

Roger Binns 2006-04-02 23:11:31 UTC

Simon C 2006-04-08 05:29:20 UTC

Simon C 2006-04-03 01:23:55 UTC

Roger Binns 2006-04-03 01:49:23 UTC

Simon C 2006-04-03 03:20:50 UTC

Roger Binns 2006-04-03 03:45:08 UTC

Greg Pratt 2006-04-03 04:06:29 UTC

Roger Binns 2006-04-03 04:25:15 UTC

Simon C 2006-04-04 05:35:27 UTC

Joe Pham 2006-04-05 22:16:54 UTC

Simon C 2006-04-05 23:39:15 UTC

Joe Pham 2006-04-06 00:32:08 UTC

Joe Pham 2006-04-08 17:28:56 UTC

Simon C 2006-04-08 18:45:09 UTC

Simon C 2006-04-10 04:39:21 UTC

about - legalese

Loading...