Openrefine: Support BibTex import into records

Created on 15 Oct 2012  路  16Comments  路  Source: OpenRefine/OpenRefine

_Original author: thadguidry (November 11, 2010 19:38:44)_

Ed Laurent (Freebase expert) would like to be able to use Google Refine to import BibTex files to then load data into Freebase regarding citations & scholarly works. He currently uses EndNote which he can use to export into XML, BibTex, and EndNote format among others. His other tools only have BibTex support unfortunately hence the need for this use case. (Note: Some academics also use Mendeley.com and Zotero Firefox plugin which also support BibTex and EndNote formats) (SubNote: Mendeley.com has a public api ! - darn, it's no longer truly public, but requires free registration.)

_Original issue: http://code.google.com/p/google-refine/issues/detail?id=195_

enhancement good second issue help wanted import imported from old code repo new data format Medium

Most helpful comment

@ettorerizza There is now a robust Java Bibtex parser https://github.com/jbibtex/jbibtex

All 16 comments

_From tfmorris on November 11, 2010 20:08:44:_
I've checked a few possible parser candidates:

JabRef - GPL

javabib - GPL

j4bib - BSD license, no recent activity, http://sourceforge.net/projects/j4bib/files/

bibparse - no stated license (even in source zip), author's home page hasn't been updated since 2005 after fairly regular updates before that, so he may be retired or deceased http://ftp.math.utah.edu/pub//bibparse/

I'll take a look at j4bib unless someone comes up with a better alternative. It's not a very complex format, so writing from scratch is an option as well.

_From mcnamara.[email protected] on November 15, 2010 02:10:19:_
Writing from scratch wouldn't be too hard at all, it's just a format of key:value pairs. ..especially as an importer probably wouldn't need to do validation.

_From [email protected] on November 15, 2010 21:17:29:_
I frequently use BibTex so I give this +1!

_From thadguidry on November 19, 2010 23:08:46:_
attached single BibTex record from Google Books export [[http://books.google.com/books?id=d1tIAAAAYAAJ&pg=PR3#v=onepage&q&f=false]] for quality checking with diacritic characters when this feature is implemented.

_From [email protected] on September 28, 2011 11:26:17:_
I attached a more complicated record from Web of Science (first article for the query "google"). Note especially the multiple values in some fields.

Google refine would be great for address cleaning and such things... Does it have a "address guesser"?

_From tfmorris on October 15, 2011 17:31:39:_
Some additional possibilities for starting points:

bibtext2rdf Apache 2.0 license, JavaCC grammar
http://sourceforge.net/projects/bibtex2rdf/

ANTLR grammar for BibTex - no stated license
http://stackoverflow.com/questions/7583982/bibtex-grammar-for-antlr

MIT SIMILE bibtext-converter - MIT License, JavaCC grammar - doesn't attempt to interpret LaTex
http://code.google.com/p/simile-widgets/source/browse/babel/trunk/converters/bibtex-converter
https://simile.mit.edu/repository/babel/trunk/converters/bibtex-converter/

j4bib (mentioned above) - BSD license, uses JLex and CUP
https://downloads.sourceforge.net/project/j4bib/j4bib/j4bib-0.2/j4bib-src-0.2.tar.gz

I take back what I said last year about the format being simple. On the surface it is, but because one can embedded arbitrary LaTex code, you'd need a full parser/render to faithfully parse everything. Even for a basic level of support, you'd need to handle things like LaTex character composition e.g. {\'E}mile

_From [email protected] on December 13, 2011 18:54:21:_
If the latex thingy is a problem, maybe a RIS importer can be used, which does not allow latex commands.

Almost all bug databases can export RIS or bibtex and there are some bibtex to RIS converter, which should help if you are stuck with bibtex exports.

_From tfmorris on December 13, 2011 20:29:26:_
Thanks for the suggestion. The entity substitution issue that I mentioned as an example of LaTex processing is actually pretty simple, so we'd probably do that first and see how if it covers the bulk of what people need.

RIS or EndNote XML would be other bibliographic data formats to consider supporting for import, but I'm not sure they'd replace BibTex since many of the BibTex files are old hand-maintained bibliographies, not necessarily exports from a bib. web site or program.

_From [email protected] on December 13, 2011 21:51:10:_
The interesting things for biblimetricians are probably the name and address cleaning part. Maybe even name disambiguating: is "Chen, C" of the first work in the list the same "Chen, C" as in the 1245th work? Or "Meyer-L眉denscheid, CW" the same as "Meyer Luedenscheid, C". Unfortunately, in the end, this is manual work, so I'm not sure how refine can help here. A string comparer which clusters names based on their string-distance function would be nice and also a cluster-algo based on the keywords/words in title/words in addresses (there are quite a few papers on Author name unambiguity, which use such methods) or the results of a google query (if there are similar authors and a google-query based on both titles returns some results, it is probably because of the authors webpage, which lists both works).

The name disambiguating part is probably interesting for others as well: merging two address databases, ...

_From tfmorris on December 13, 2011 22:36:03:_
We're getting off-topic (at least for this issue), so we should probably move the discussion to the mailing list/Google Groups, but Refine excels (so to speak) at precisely the kind of thing you're talking about -- allowing for and amplifying human judgments.

Facets based on author name clusters, edit distances, keywords, and a number of other things are possible. Various types of name cleanups is one of the current major uses of Refine.

As I said, if you want to discuss bibliographic data use cases more, let's move it to the list/group.

(Zotero,Mendeley,EndNote,etc) Bibtex -> (Jabref) csv -> OpenRefine

A problem with exporting from Mendeley to Bibtex is that non-ascii characters are converted into a LaTeX format, for example 脡 becomes {'{E}}.

My workaround was to export citations from Mendeley as an Endnote XML and import that into openrefine. I had to Join a few multi-valued cells after the import, but I avoided character encoding errors along the way.

@DFoltzMorrison The workflow "Bibtex -> JabRef : export to CSV -> Open Refine" explained above seems to work quite well.

In most cases, yes. But Mendeley -> Bibtex can cause character encoding issues. 脡 becomes {'{E}}, 枚 becomes {"{o}}, etc. That's obviously an issue with Mendeley that hasn't been addressed (example 1 , example 2)

That's why I've suggested Mendeley -> export to Endnote XML -> openrefine as an alternative.

Ouch, the Python bibtex parser mentioned in your first link is anything but simple. I hope there is something similar in Java, otherwise we will not see this importer anytime soon...

@ettorerizza There is now a robust Java Bibtex parser https://github.com/jbibtex/jbibtex

Was this page helpful?
0 / 5 - 0 ratings