[Bitpim-devel] Re: CSV Export Code / Merging urls

Discussion:

Roger Binns

2004-05-29 03:23:29 UTC

[Adit has been working on some of the import/export code. This is a
response to his most recent email to me which I have posted here so
it ends up in the archives, as well as being of interest to others.]

The only problem is, currently, the import csv dialog only permits 1 url

Actually it permits as many as you want. The trick is that you
simply name all the columns the same thing ("Web Page"). The first column
named "Web Page" becomes the first URL, the second becomes the second etc.
(At some point I should really write doc about that. The same principle
applies to almost any of the columns. :-)

url which is a plain type (neither home nor business).

You can use the following columns:

Web Page
Home Web Page
Business Web Page

there is an entry for home or business url in the phonebook, the new
value will get compared, but if it needs to be replaced, it adds it as
an new url item without a type, instead of replacing. I didn't know if
we should replace it, since there is a possibility that the user may
want to add an additional url instead of replacing it. Let me know if I
should change this behavior.

This issue actually arises for all fields. Later on when we have proper
sync methods and code a more complicated algorithm will have to generate
transaction logs from incoming data and merge them into the existing
data, taking into account that the same transactions could already have
been applied from a different data source.

If it is a complete toss up between whether the imported data or the
existing data is correct, use the existing data. For CSV import it
is probably the imported data that is more correct, unless it is
a simpler form of the existing data (eg if imported phone number is
123 456 7890 and existing is (123) 456-7890 then existing is preferred.)

At some point I would also like to calculate a score of how certain
the import code is that it did the right thing. That way users
will be presented with the least certain first and can adjust them
and ignore most of the rest.

http://datamining.anu.edu.au/projects/linkage.html

The software looks great, but sadly is under a license that doesn't
allow it to be used with GPL projects. See this link
http://www.gnu.org/philosophy/license-list.html

The febrl license is a modified version of the Mozilla license but with
stuff removed including the section that allowed for mutliple licensing.
I have emailed the febrl people to see if they are prepared to do anything
about this.

I found their string comparison routines (specifically winkler) to be
more accurate than the built-in python difflib. You can run
stringcmp.py to see a comparison of test matches.

I would recommend adding a layer of indirection so that either winkler or
difflib can be used. Something like this. I just made up the method names
and code, but it gives the general idea:

try:
import winkler
def cmpfunc(foo,bar):
return winkler.routine(foo,bar)/3
except:
import difflib
def cmpfunc(foo,bar):
f=difflib.routine(foo,bar,7)
if len(f): return f[0]
return 0

Roger

Adit Panchal

2004-06-01 07:10:51 UTC

Permalink

I got some more work done on the merging this weekend.

I added a separate routine called comparestrings (origitem, impitem)
which does the try, exception blocks with winkler and difflib as you
mentioned. I tested that and it works fine when stringcmp.py isn't
present.

I also added some code that will maintain the type if the entry
matches, but the type is different. For example, in the phonebook, you
have "sourceforge.net" as home url. You import a CSV which has
"sourceforge.net" as a business url. Since we assume that the CSV is
more correct in this regard, we don't import the new entry, but we do
want to update the type. Therefore, the home url gets changed to the
business url. (I hope this is the behavior we want. We can throw a flag
in there, so that when the transaction log gets written, we can have it
take notice and do something else.)

Also, the regular expression now converts everything to lowercase, so
case irregularities shouldn't be a problem.

I tested the code again and it works as expected if there are multiple
entries of the same type and also mixed entries (home, business and
plain). In my test file, if you set the 3 url columns to "web page"
(plain), it doesn't add the entry if it already exists, but adds new
ones if they don't exist. Also, in my test file, if you set all 3 url
columns to "home" (the same would obviously apply to business) for
example, the types will be switched to home if there is a match, and
the 2nd time if there is no match, the existing value will be
overwritten. To me this seems rather strange, but it does follow the
rules of replacing if different, and not replacing if the same. Should
we have another flag so that a 2nd home column (if different from the
first) should get it's own new entry instead of overwriting the
original?

The final thing I wanted to mention was that sometimes during the
import dialog I would get a Bus Error or a Segmentation Fault. This
would usually happen if I would try to import and cancel after
verification of data, but before the final import OK button is pressed.
Sometimes it would happen after going to import -> CSV... for the 2nd
or 3rd time during testing. Not really sure what the problem here is,
but it doesn't happen to me anywhere else in the program.

Another thing that I was wondering about before I head to sleep, was
that should we append "http://" if the url does not contain the url
prefix? I wasn't planning to do so, because someone might use "ftp://"
or some other protocol and that would make things messy. I think
stripping the prefixes at merge time is good enough.

Sorry if there are any mistakes in the above message or things don't
make sense (I need sleep). The code and examples should speak for
themselves. You can play with the column headers in the import dialog
to see different results with different combinations. Please let me
know if you see any outstanding issues. The file is renamed to .zio
from .zip because neither comcast nor sourceforge likes zip files.
*shrugs*

Adit

Roger Binns

2004-06-02 06:05:23 UTC

Permalink

Post by Adit Panchal
I got some more work done on the merging this weekend.

Thanks I have merged it in, with minor changes.

Post by Adit Panchal
I added a separate routine called comparestrings (origitem, impitem)
which does the try, exception blocks with winkler and difflib as you
mentioned. I tested that and it works fine when stringcmp.py isn't
present.

I emailed the febrl folks and they will be doing a new release of febrl
over the next few months that will be dual licensed under ANUOS and
GPL. They may also do an earlier release that is just to change the
license.

Post by Adit Panchal
business url. (I hope this is the behavior we want. We can throw a flag
in there, so that when the transaction log gets written, we can have it
take notice and do something else.)

At the moment all I would want is some sort of number as to how certain
the code is that it has done the right thing. There isn't any infrastructure
yet to record that certainty though. The final number shown in the UI
will include the certainty that the right existing entry was matched
combined with the certainty that the merging operation did the right
thing. It is the least certain ones that will require most user
attention.

Post by Adit Panchal
Also, the regular expression now converts everything to lowercase, so
case irregularities shouldn't be a problem.

I made the "cleaner" routine be a passed in parameter.

Post by Adit Panchal
(if different from the
first) should get it's own new entry instead of overwriting the
original?

Pick whatever you think is most consistent :-)

Post by Adit Panchal
Not really sure what the problem here is,
but it doesn't happen to me anywhere else in the program.

Seg faults/bus errors can be tracked in your debugger. On the Mac,
wxWindows 2.4 is a bit flaky and the import dialogs are a little
richer than other windows. I think the one showing you the results
of the import could use some better design as it takes up too much
space and is complicated. Ideas on a post card ...

Post by Adit Panchal
Another thing that I was wondering about before I head to sleep, was
that should we append "http://" if the url does not contain the url
prefix? I wasn't planning to do so, because someone might use "ftp://"
or some other protocol and that would make things messy. I think
stripping the prefixes at merge time is good enough.

I would strip them only if they are http. For all other protocols (including
https) leave it on. Outlook likes to force the presence of http://,
Evolution leaves it as whatever you typed in, and Palm Desktop doesn't
even know about URLs. The http:// prefix is also messy and of no value on
phones. All the browsers also treat something without a scheme as http
(or in some cases as a file).

Post by Adit Panchal
make sense (I need sleep). The code and examples should speak for
themselves.

It would be nice to have multiple CSV files. The first base is imported
to a clean phonebook, and then the rest get merged in various horrific
ways with that. Your test file is good for testing just the one field.
There is also experiments\gentestdata.py which makes huge test cases
(it is more of a stress tester than anything else).

Roger

Adit Panchal

2004-06-03 16:41:54 UTC

Permalink

Post by Roger Binns
At the moment all I would want is some sort of number as to how certain
the code is that it has done the right thing. There isn't any
infrastructure
yet to record that certainty though. The final number shown in the UI
will include the certainty that the right existing entry was matched
combined with the certainty that the merging operation did the right
thing. It is the least certain ones that will require most user
attention.

I was thinking about this and I can return the value that the
comparison routine spits back as to how certain the match is. If it is
a 100% match, it will return 1, otherwise it will return a value
between 0.0 and 1.0. While I was looking at the rest of the code, there
looks like there is some redundancy, as you already wrote comparison
code to calculate the score to determine which records match each
other. That is different from what I am doing - trying to determine how
close the import field of a specific entry is close to the phonebook
field. But then I am getting confused and it is going in a circle in my
head, because your code is comparing those same values, but in the more
general sense to determine the best match to merge. *help*

Post by Roger Binns

Post by Adit Panchal
(if different from the
first) should get it's own new entry instead of overwriting the
original?

Pick whatever you think is most consistent :-)

I left the code it as is (it overwrites the existing entry), but in my
opinion, it should add a new field. That way, if the user wants to keep
the original entry, it is at his discretion. If that is ok with you, I
will change it in my next diff.

Maybe down the road once there is a transaction log in place, we can
have the import results dialog, for each field that changed, show a
drop down box to change back to the original? That way, if the user
doesn't intervene, the default behavior is to overwrite. If they do
intervene, they can keep their existing data. Something to think about
I suppose.

Post by Roger Binns
I would strip them only if they are http. For all other protocols (including
https) leave it on. Outlook likes to force the presence of http://,
Evolution leaves it as whatever you typed in, and Palm Desktop doesn't
even know about URLs. The http:// prefix is also messy and of no value on
phones. All the browsers also treat something without a scheme as http
(or in some cases as a file).

I fixed this in the attached diff. I was also wondering, if we don't
need to replace the entry, should we also strip it on entries which are
already in the phonebook, or should that be done in another function
somewhere else?

Post by Roger Binns
It would be nice to have multiple CSV files. The first base is imported
to a clean phonebook, and then the rest get merged in various horrific
ways with that. Your test file is good for testing just the one field.
There is also experiments\gentestdata.py which makes huge test cases
(it is more of a stress tester than anything else).

I am working on making better test files. Something that is human
readable - I tried making a shorter version of the phonebook generated
from gentestdata.py, and working with that, but my head started
spinning.

Apparently the best way to use that import dialog is to include the
headers as the first line. It makes things so much easier (and faster)
to import. I suppose I should have tried that a long time ago. :)

Adit