How to parse dblp.xml?

The encoding used for the XML file is plain ASCII. To represent characters outside of the 7-bit range we use symbolic or numeric entities. All symbolic entities are defined in the DTD

Detailed information on the XML structure of the dblp records and several design decisions can also be found in the following paper:

Example parser

As an example, we provide a simple parser to process the dblp data, written in Java. Please load the files

into a directory and compile them:

javac Parser.java

The dblp.xml and dblp.dtd files should be stored into the same directory. You may start the program with the command

java -mx900M -DentityExpansionLimit=1000000 Parser dblp.xml > out.txt

This seems to work for the Java virtual machine 1.5.* but not for 1.6.* and above. We yet don't understand the problem with Java VM 1.6, but the problem has been reported by others.

If you want to use Java 1.6+, you should download the Apache Xerces XML parser. It does not have the problem reported above. You only have to copy the file xercesImpl.jar from the Xerces distribution to a loaction covered by your classpath.

The machine should have › 1.5G main memory, the option -mx900M sets the heap space to 900M. The option -DentityExpansionLimit is necessary to resolve the (many, many) symbol entities used in the large XML file. Depending on your machine the program should run a few minutes. The result is stored in 'out.txt' ...

The first part of out.txt contains some simple statistics about the dblp data:

The main part of out.txt shows how we try to locate variations of name spellings:

Hongli Deng: Linda Shapiro - Linda G. Shapiro

This means: There is a person named 'Hongli Deng' who has coauthors 'Linda Shapiro' and 'Linda G. Shapiro', which may (or may not) be the same person.

Details on Parser.java

This class contains the static main method and the methods necessary to use the XML SAX parser shipped with the standard Java distribution. It produces the first part of the statistics. The main approaches to parse XML are DOM and SAX parsers:

In our application we are only interested in person names and not in titles, conference names, page numbers, publication years etc. We view a publication as a list of author (or editor) fields, any other information is skipped. The 'startElement' method recognizes two situations:

The 'characters' method simply appends the input text to 'Value' string. This should only happen if we are inside of an author or editor element. Whithout the test 'if (insidePerson)' the program remains correct, but it becomes very slow because we produce several millions of garbage objects.

The method 'endElement' works similar to 'startElement':

Attachments