What do I find in dblp.xml?

The dblp XML format is modeled after the BibTeX *.bib file format. The format is defined in the DTD file in the same directory. Please understand that (by design) our DTD is not very strict, as it makes no restriction to element order or multiplicity, and even allows nonsensical child elements (e.g., ‹school› tags in ‹article› elements, ‹editor› and ‹author› elements at the same time) that you will never find in the actual dblp data set. Our priority was to keep the definition clean and simple, and not to model every aspect of the publication landscape.

More information on the XML structure of the dblp records and several design decisions can be found in the following paper:

In general, our XML is a shallow but very long list of XML records. The root element has several million child elements, but usually no element is deeper than level three. An excerpt of the XML file looks like this:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>

[...]

<article key="journals/cacm/Gentry10" mdate="2010-04-26">
<author>Craig Gentry</author>
<title>Computing arbitrary functions of encrypted data.</title>
<pages>97-105</pages>
<year>2010</year>
<volume>53</volume>
<journal>Commun. ACM</journal>
<number>3</number>
<ee>https://doi.acm.org/10.1145/1666420.1666444</ee>
<url>db/journals/cacm/cacm53.html#Gentry10</url>
</article>

[...]

<inproceedings key="conf/focs/Yao82a" mdate="2011-10-19">
<title>Theory and Applications of Trapdoor Functions (Extended Abstract)</title>
<author>Andrew Chi-Chih Yao</author>
<pages>80-91</pages>
<crossref>conf/focs/FOCS23</crossref>
<year>1982</year>
<booktitle>FOCS</booktitle>
<url>db/conf/focs/focs82.html#Yao82a</url>
<ee>https://doi.ieeecomputersociety.org/10.1109/SFCS.1982.45</ee>
</inproceedings>

[...]

<www mdate="2004-03-23" key="homepages/g/OdedGoldreich">
<author>Oded Goldreich</author>
<title>Home Page</title>
<url>https://www.wisdom.weizmann.ac.il/~oded/</url>
</www>

[...]
</dblp>

Level 1: data records

The children of the root element represent the individual data records that are stored in dblp. In general, there are two types of records: publication records and person records.

Publication records are inspired by the BibTeX syntax and are given by one of the following elements:

Please note that while the bibtex type of the records does define certain categories on the dblp data records, these record categories are actually slightly different from the publication types that are used throughout the dblp website.


Please note that while there is a record type for proceedings volumes, there is no record type for journal volumes. Consequently, the dblp XML file contains no data entities for whole journal volumes or series. This is a (sometimes unfortunate) heritage of the BibTeX data model.

Person records are described separately here.

All records share a number of common attributes:

The values of the publtype attribute are from a controlled vocabulary. Multiple publtypes can be provided as a space-separated list. The following table lists the publtypes in use for records. Here, scope denotes whether a type is used exclusively for for publication records or person records. In the past, certain slightly different publtype values have been used which are given as deprecated values. Note that the annotation is only partially complete. E.g., only a small amount of edited publications are annotated as edited.

scopecurrent valuedeprecated valuedescription
publicationencyclopediaencyclopedia entryPublication is reference work, e.g., an encyclopedia article.
publicationinformalinformal publicationPublication is gray literature, e.g., a preprint publications.
publicationeditededited publicationEdited publication, e.g., an editorial or a news anouncement.
publicationsurvey--Publication is a survey article.
publicationdata--This publication is a data artifact.
publicationsoftware--This publication is a software artifact.
publicationwithdrawn--Publication was officially withdrawn by the publisher.
persondisambiguationdisambiguation pageThe bibliography associated with this person record does not represent a single author. See here for details.
persongroup--The bibliography associated with this person record represents a group or a consortium of authors, and not just a single person. We usually don't know the composition of that group.
personnoshow--Denotes an unlisted bibliography.

Level 2: bibliographic metadata

Record elements do not contain any text, but they contain a number of child elements to specify the record's bibliographic metadata entries. See the Wikipedia page on BibTeX to learn which data entries are meaningful in which record type.

Note that in contrast to BibTeX, there are no key elements since the key is already an attribute of the record node. Also, there is a custom url element to specify a local hyperlink relative to the dblp websites homepage.

Most record elements can have one or more of the following optional attributes:

A detailed description of record elements can be found at How are data annotations used in dblp.xml.

Level 3: optional HTML markup

In the XML file, only title or booktitle elements contain optional HTML markups, and only a selected few markup elements are allowed:

In theory, the elements of this level may be nested arbitrarily deep to describe complex structures like formulas, e.g.

<i>x<sub>y<sup>2</sup></sub></i>


to describe x. However, such cases are rare.

Entities

The dblp XML file is encoded in plain ASCII. Additional ISO/IEC 8859-1 (latin-1) characters are defined as named entities in the DTD and used whenever necessary.

At the moment, most parts of dblp are restricted to ISO-8859-1 (latin-1) characters, i.e. the first 255 Unicode characters. With exception to the the ‹author›- or ‹editor›-elements, where you will still find only latin-1 characters, you may find numerical entities outside of this range. For example, ‹title›-elements my contain Greek letters like an ε, or the ‹note›-elements of a person record may contain a Chinese name in the original Unicode spelling. All characters above the first 255 Unicode characters are given as numerical entities.