Roger Binns
2004-05-29 03:23:29 UTC
[Adit has been working on some of the import/export code. This is a
response to his most recent email to me which I have posted here so
it ends up in the archives, as well as being of interest to others.]
simply name all the columns the same thing ("Web Page"). The first column
named "Web Page" becomes the first URL, the second becomes the second etc.
(At some point I should really write doc about that. The same principle
applies to almost any of the columns. :-)
Web Page
Home Web Page
Business Web Page
sync methods and code a more complicated algorithm will have to generate
transaction logs from incoming data and merge them into the existing
data, taking into account that the same transactions could already have
been applied from a different data source.
If it is a complete toss up between whether the imported data or the
existing data is correct, use the existing data. For CSV import it
is probably the imported data that is more correct, unless it is
a simpler form of the existing data (eg if imported phone number is
123 456 7890 and existing is (123) 456-7890 then existing is preferred.)
At some point I would also like to calculate a score of how certain
the import code is that it did the right thing. That way users
will be presented with the least certain first and can adjust them
and ignore most of the rest.
allow it to be used with GPL projects. See this link
http://www.gnu.org/philosophy/license-list.html
The febrl license is a modified version of the Mozilla license but with
stuff removed including the section that allowed for mutliple licensing.
I have emailed the febrl people to see if they are prepared to do anything
about this.
difflib can be used. Something like this. I just made up the method names
and code, but it gives the general idea:
try:
import winkler
def cmpfunc(foo,bar):
return winkler.routine(foo,bar)/3
except:
import difflib
def cmpfunc(foo,bar):
f=difflib.routine(foo,bar,7)
if len(f): return f[0]
return 0
Roger
response to his most recent email to me which I have posted here so
it ends up in the archives, as well as being of interest to others.]
The only problem is, currently, the import csv dialog only permits 1 url
Actually it permits as many as you want. The trick is that yousimply name all the columns the same thing ("Web Page"). The first column
named "Web Page" becomes the first URL, the second becomes the second etc.
(At some point I should really write doc about that. The same principle
applies to almost any of the columns. :-)
url which is a plain type (neither home nor business).
You can use the following columns:Web Page
Home Web Page
Business Web Page
there is an entry for home or business url in the phonebook, the new
value will get compared, but if it needs to be replaced, it adds it as
an new url item without a type, instead of replacing. I didn't know if
we should replace it, since there is a possibility that the user may
want to add an additional url instead of replacing it. Let me know if I
should change this behavior.
This issue actually arises for all fields. Later on when we have propervalue will get compared, but if it needs to be replaced, it adds it as
an new url item without a type, instead of replacing. I didn't know if
we should replace it, since there is a possibility that the user may
want to add an additional url instead of replacing it. Let me know if I
should change this behavior.
sync methods and code a more complicated algorithm will have to generate
transaction logs from incoming data and merge them into the existing
data, taking into account that the same transactions could already have
been applied from a different data source.
If it is a complete toss up between whether the imported data or the
existing data is correct, use the existing data. For CSV import it
is probably the imported data that is more correct, unless it is
a simpler form of the existing data (eg if imported phone number is
123 456 7890 and existing is (123) 456-7890 then existing is preferred.)
At some point I would also like to calculate a score of how certain
the import code is that it did the right thing. That way users
will be presented with the least certain first and can adjust them
and ignore most of the rest.
http://datamining.anu.edu.au/projects/linkage.html
The software looks great, but sadly is under a license that doesn'tallow it to be used with GPL projects. See this link
http://www.gnu.org/philosophy/license-list.html
The febrl license is a modified version of the Mozilla license but with
stuff removed including the section that allowed for mutliple licensing.
I have emailed the febrl people to see if they are prepared to do anything
about this.
I found their string comparison routines (specifically winkler) to be
more accurate than the built-in python difflib. You can run
stringcmp.py to see a comparison of test matches.
I would recommend adding a layer of indirection so that either winkler ormore accurate than the built-in python difflib. You can run
stringcmp.py to see a comparison of test matches.
difflib can be used. Something like this. I just made up the method names
and code, but it gives the general idea:
try:
import winkler
def cmpfunc(foo,bar):
return winkler.routine(foo,bar)/3
except:
import difflib
def cmpfunc(foo,bar):
f=difflib.routine(foo,bar,7)
if len(f): return f[0]
return 0
Roger