How to parse dblp.xml?
The encoding used for the XML file is plain ASCII. To represent characters outside of the 7-bit range we use symbolic or numeric entities. All symbolic entities are defined in the DTD
Detailed information on the XML structure of the dblp records and several design decisions can also be found in the following paper:
- Michael Ley: DBLP - Some Lessons Learned. Proceedings of the VLDB Endowment, Volume 2: 1493-1500 (2009).
Example parser
As an example, we provide a simple parser to process the dblp data, written in Java. Please load the files
into a directory and compile them:
javac Parser.java
The dblp.xml and dblp.dtd files should be stored into the same directory. You may start the program with the command
java -mx900M -DentityExpansionLimit=1000000 Parser dblp.xml > out.txt
This seems to work for the Java virtual machine 1.5.* but not for 1.6.* and above. We yet don't understand the problem with Java VM 1.6, but the problem has been reported by others.
If you want to use Java 1.6+, you should download the Apache Xerces XML parser. It does not have the problem reported above. You only have to copy the file xercesImpl.jar from the Xerces distribution to a loaction covered by your classpath.
The machine should have 1.5G main memory, the option -mx900M sets the heap space to 900M. The option -DentityExpansionLimit is necessary to resolve the (many, many) symbol entities used in the large XML file. Depending on your machine the program should run a few minutes. The result is stored in 'out.txt' ...
The first part of out.txt contains some simple statistics about the dblp data:
- How many persons have a name with a given length.
- How many persons have published a given number of publications (or more - dblp is always incomplete).
- The program builds the coauthor graph and produces a simple histogram of the node degrees: How many persons have a given number of coauthors.
- Names are decomposed into name parts, delimiters are spaces and '-'. How many persons have names composed of 1,2,3, ... parts.
The main part of out.txt shows how we try to locate variations of name spellings:
Hongli Deng: Linda Shapiro - Linda G. Shapiro
This means: There is a person named 'Hongli Deng' who has coauthors 'Linda Shapiro' and 'Linda G. Shapiro', which may (or may not) be the same person.
Details on Parser.java
This class contains the static main method and the methods necessary to use the XML SAX parser shipped with the standard Java distribution. It produces the first part of the statistics. The main approaches to parse XML are DOM and SAX parsers:
- A DOM parser produces an in-memory tree representation for the XML input. This is nice for small or medium sized XML documents, but it is not practical for a 400M document like dblp.xml.
- A SAX parser provides a lower level call back interface. The methods 'startElement', 'endElement' and 'charcters' are called if an open tag, end tag, or any characters between the tags are recognized.
In our application we are only interested in person names and not in titles, conference names, page numbers, publication years etc. We view a publication as a list of author (or editor) fields, any other information is skipped. The 'startElement' method recognizes two situations:
- If the parser is located at the beginning of an author or editor field, is sets the Boolean variable 'insidePerson' to true.
- Bibliographic records are elements like 'article', 'inproceedings', etc. (mainly BibTeX terminology, see DTD). The open tags on the record level always contain the attribute 'key'. Our startElement method simply looks for 'key'-attributes. It stores the key and the recordTag.
The 'characters' method simply appends the input text to 'Value' string. This should only happen if we are inside of an author or editor element. Whithout the test 'if (insidePerson)' the program remains correct, but it becomes very slow because we produce several millions of garbage objects.
The method 'endElement' works similar to 'startElement':
- If we are at the end of an author/editor element, we store the name in the temporary array 'persons'.
- As soon as we see the end of a publication record, we copy the information from the 'persons' array into a new array of the required size and call the constructor of the Publication class ...
Attachments
- dblpxml[1].pdf (167 kB)



