Abstract
VOTable is a new XML-based format for representing astronomical catalogues. It has been developed for use in the 'Virtual Observatory', an ambitious proposal to provide uniform, convenient access to disparate, geographically dispersed archives of astronomical data from software which runs on the computer on the astronomer's desktop. Catalogues, that is tables of the properties of celestial objects (celestial coordinates, brightness etc.), are very important in astronomy and constitute a significant component of the Virtual Observatory. The VOTable has been defined in terms of XML in order to take advantage of computer-industry standards and to utilise standard software and tools. At the same time it is important not to lose the previous investment in astronomy-specific standards, such as the tables variants of the FITS (Flexible Image Transport System) format. Also, astronomical tables are rich in 'metadata', which in this context means annotation, interpretable by either computers or humans, both of the tables and the individual columns that they contain. It is important that these metadata should be preserved with the table and the VOTable has features to permit this.
The development of the VOTable format is briefly reviewed, the format itself is described and a number of open issues, which may be addressed in future enhancements of the format, are discussed. Some of these issues only directly affect the astronomical community which uses VOTable. Others have wider implications. For example, enhancements to XML, particularly in the handling of bulk data, could allow significant improvements to the VOTable and similar formats.
Keywords
Table of Contents
Astronomy is an observational science which proceeds by acquiring and interpreting observations of celestial phenomena. Moreover, modern telescopes and instruments produce data at an enormously greater rate than hitherto and also generate them directly in a digital form. Observations are made over most of the electromagnetic spectrum, from gamma-rays to radio wavelengths and extremely varied types of observation are made, though direct images and spectra are common. One specific example is sky surveys, which produce pixel images of a substantial fraction of the sky. These surveys can be Terabytes in size. Once observations have been made it is desirable to preserve, share and reuse them, both because they are expensive to acquire and, in the case of astronomical objects which move or vary in brightness, are irreproducible.
These requirements have led to the idea of the 'Virtual Observatory' (Virtual Observatory (VO); the same acronym is used in the GRIDS community for 'Virtual Organisation'. The astronomer's 'Virtual Observatory' is a GRIDS 'Virtual Organisation', though it is not clear whether this increases or decreases the scope for confusion). The VO comprises a collection of the major astronomical datasets and archives, dispersed geographically but connected logically, which astronomers are able to search and access using software which is simple to operate and runs on the computers on their desktop. There is a fortuitous confluence between the requirements of the VO and modern ideas in distributed computing, such as GRIDS, and related technologies such as XML and Simple Object Access Protocol (SOAP). These developments provide much of the infrastructure needed for the VO. A number of projects to build the VO are now in progress, including: AstroGrid in the UK, National Virtual Observatory (NVO) in the US and Astrophysical Virtual Observatory (AVO) in Europe. These projects are agreeing low level inter-operability standards under the umbrella of the 'International Virtual Observatory Alliance' (International Virtual Observatory Alliance (IVOA)) [1].
Astronomical catalogues are tables of the positions and measured properties (such as brightness) of a list of celestial objects. Such catalogues are very important in astronomy and traditional printed catalogues have been used for hundreds of years. Most modern catalogues are produced in a computer-readable form and many are available; for example the Centre de Données astronomiques de Strasbourg (Centre de Données astronomiques de Strasbourg (CDS)) has assembled a collection of several thousand. Most of these catalogues are of moderate size, though there are few large ones. For example, the catalogues constructed from the major sky surveys can contain entries of up to 109 or more objects. Further, there is little standardisation between catalogues: different quantities are tabulated, in different units with different names used for similar quantities. Though most astronomical catalogues are lists of celestial objects, other sorts of tabular data are sometimes encountered, such as atomic line lists or X-ray event lists. Catalogues will be an important component of the VO.
The VOTable [2] is a new format specifically developed to represent subsets of rows and columns selected from catalogues in the VO and returned to the user for further analysis. The VOTable has been defined in terms of XML in order to take advantage of computer-industry standards and to utilise standard software and tools. However, the VOTable is not the first standard format for astronomical catalogues and it is important to continue to capitalise on the investment and expertise in these earlier astronomy-specific standards. The Flexible Image Transport System (FITS) (Flexible Image Transport System) [3] standard is easily the most important standard data format in astronomy. It was introduced in its original form in 1981. It has proven popular and enduring and is now ubiquitous. The original FITS standard addressed only image (or array, or pixel) data. However, subsequently two enhancements have been introduced for catalogue data: the ASCII table extension [4] and the binary tables extension [5]. The latter has proven the more popular because it provides a concise representation for numeric data.
The immediate ancestors of the VOTable format are Astrores [6] developed by the CDS and eXtensible Scientific Interchange Language (XSIL)[7], both of which were based on XML. The VOTable project started in October 2001 with discussions between the groups developing these two formats. The first draft of the VOTable standard was prepared in December 2001 and was discussed extensively at an 'inter-operability meeting' held in Strasbourg during January 2002. Version 1.0 of the VOTable was subsequently released on 15 April 2002. Other table formats are sometimes encountered in astronomy, such as the Tab-Separated Value (Tab- Separated Value (TSV)) format [8] or the Small Text List used by Catalogue Utilities for Reporting, Selecting and Arithmetic (CURSA)[9]. Such formats are not specifically part of the ancestry of the VOTable, though some of the ideas in them have influenced its design.
The VOTable is only one component of the VO. In particular, formats will need to be defined for 'bulk data' such as images or spectra, retrieved from archives and also for the queries sent to remote catalogues and archives to select data. Work is underway on these formats, but none are as well-developed as the VOTable. We anticipate that most of these formats will also be based on XML, so the VOTable is acting as a prototype for future work.
HR Star-name R.A. Dec. V Sp.Typ. Par. R.V.
Hours Degrees mag. arcsec km/sec
.
.
6601 17 43 46.9 -7 4 46 6.30 B1.5V -26V
6602 83 HER 17 42 28.3 +24 33 50 5.52 K4III +0.20 -27
6603 60 Beta OPH 17 43 28.3 +4 34 2 2.77 K2III +0.33 -12V
6604 17 43 21.8 +14 17 43 6.24 F5II -42
6605 17 40 35.9 +57 18 38 6.76R K0 -14
.
.
.
Figure 1.
Figure Figure 1 shows a few rows extracted from a typical astronomical catalogue. This extract is largely just a table of values, similar to any other tabular dataset. However, it illustrates a few points about astronomical catalogues.
Most columns contain numeric values, though a few contain character strings.
Missing values are common (for example column Par. has entries for only two of the five stars).
The Right Ascension (R.A.) and Declination (Dec.) columns list the celestial coordinates of the objects (which are broadly analogous to terrestrial longitude and latitude respectively). These quantitates traditionally have units of hours and degrees respectively and are displayed with sexagesimal subdivisions into minutes and seconds. Though this format has often been a fruitful source of confusion and bugs in astronomical catalogue software it is essentially a presentation issue and will not be discussed further here.
In addition to its name the physical units of each column are also shown.
This last point is an example of the metadata associated with the catalogue. In this context 'metadata' means annotation, interpretable by either computers or humans, both of the catalogue and the individual columns which it contains. Astronomical catalogues, like other astronomical and, indeed, scientific data, are usually rich in metadata, and these metadata are needed to interpret the catalogues correctly. An abstract description of the components which make up an astronomical catalogue might include the following:
description,
column details,
parameter details,
table of values.
The table of values is the table which constitutes the bulk of the catalogue. The remaining items are metadata. The description is a free-text description of the catalogue intended to be read by a human. The column details describe each of the columns. For each column, in addition to a name, various other items are specified, including the physical units, data type (int, float, double, etc.) and a brief description. Parameters are single items of information which pertain to the entire catalogue. At least conceptually they can be thought of as additional columns which have the same value for every row. In addition to a name and value they have other details which are similar to those of columns.
<?xml version="1.0"?>
<!DOCTYPE VOTABLE SYSTEM "http://us-vo.org/xml/VOTable.dtd">
<VOTABLE version="1.0">
<DEFINITIONS>
<COOSYS ID="catCelCoord" equinox="2000.0" epoch="2000.0" system="eq_FK5"/>
</DEFINITIONS>
<RESOURCE name="example">
<DESCRIPTION>Simple example catalogue.</DESCRIPTION>
<PARAM ID="AUTHOR" name="AUTHOR" datatype="char"
arraysize="*"
value="J. Smith.">
</PARAM>
<PARAM ID="INSTITUTE" name="INSTITUTE" datatype="char"
arraysize="*"
value="Mount Pumpkin Observatory.">
</PARAM>
<TABLE name="Stars">
<FIELD ID="Star-Name" name="Star-Name" datatype="char"
arraysize="10">
<DESCRIPTION>Star name.</DESCRIPTION>
</FIELD>
<FIELD ID="RA" name="RA" datatype="float"
unit="Hours" ref="catCelCoord" ucd="POS_EQ_RA">
<DESCRIPTION>Right Ascension.</DESCRIPTION>
</FIELD>
<FIELD ID="DEC" name="DEC" datatype="float"
unit="Degrees" ref="catCelCoord" ucd="POS_EQ_DEC">
<DESCRIPTION>Declination.</DESCRIPTION>
</FIELD>
<FIELD ID="V" name="V" datatype="float" unit="Mag">
<DESCRIPTION>V magnitude.</DESCRIPTION>
</FIELD>
<DATA>
<TABLEDATA>
<TR>
<TD>Procyon</TD><TD>7.655</TD><TD>5.225</TD><TD>0.34</TD>
</TR>
<TR>
<TD>Vega</TD><TD>18.616</TD><TD>38.784</TD><TD>0.03</TD>
</TR>
<TR>
<TD>Sirius</TD><TD>6.753</TD><TD>-16.716</TD><TD>-1.47</TD>
</TR>
</TABLEDATA>
</DATA>
</TABLE>
</RESOURCE>
</VOTABLE>
Figure 2.
The VOTable format is fully described by Williams et al. [2]. Here it will be described informally using a simple example. Figure Figure 2 shows a simple VOTable in which the table consists of three rows and four columns. The first thing to note is that version 1.0 of the VOTable is defined using a DTD. We expect future versions to be defined using an XML schema.
The <DEFINITIONS> element includes the definition of the celestial coordinate system used in the VOTable. A VOTable can contain several catalogues, each of which is enclosed in a <RESOURCE> element; Figure Figure 2 contains only one catalogue. The elements within each <RESOURCE> correspond broadly to the abstract components of an astronomical catalogue described in the previous section. The <RESOURCE> element starts with a <DESCRIPTION> which contains a free-text description of the catalogue intended to be read by a human. The <DESCRIPTION> tag is optional and can occur within many of the other elements. The <PARAM> tags specify parameters which pertain to the entire catalogue, here its author and his home institution. In this example both parameters are character strings, though in a real catalogue some are likely to be numeric.
The <TABLE> element contains the column definitions and associated tabular data. Each column is defined by a <FIELD> tag, with its various details being specified as attributes. The meaning of the various attributes are fairly obvious, at least for the purposes of this example. However, the ucd attribute requires explanation. The Unified Content Descriptor (UCD) (Unified Content Descriptor) [11] is a standard classification of columns developed by the CDS. It allows columns listing similar quantities in different catalogues to be identified automatically. Though columns usually contain scalar values they can contain vectors, for example column Star-Name is a character vector column.
The <DATA> element holds the body of the table. In the example the table is represented using a <TABLEDATA> element and the columns and rows of the table are expressed inside <TABLEDATA> using entirely standard XML. The order in which the <FIELD> elements occur (top to bottom) corresponds to the order in which columns appear in every row of the table (left to right). The <TABLEDATA> element has the advantage of being entirely standard XML, but the character representation of numbers and the tags defining the table mean that it is a very inefficient way of representing tables, particularly ones which are mostly numeric. There are two alternatives to <TABLEDATA> which offer a more efficient representation: <FITS> and <BINARY>. The <FITS> tag allows the table to be held as a binary table in a separate FITS file. In this case the VOTable is acting as an XML wrapper for a standard FITS file. A <STREAM> tag within the <FITS> element gives the location of the FITS file as either a local file specifier or a URL. For example:
<FITS extnum=2> <STREAM href="ftp://archive.cacr.caltech.edu/myfile.fit"> </FITS>
(FITS files can contain several binary tables and other components and, briefly, the <extnum> attribute specifies the one required). FITS files contain metadata of their own and the VOTable specification imposes no requirement that the VOTable and FITS metadata are kept in step, and neither does it prescribe which metadata are preferred if they differ. The <BINARY> element allows the table to be represented as a simple binary stream of bytes which may be in a separate file or encoded and included in the XML file. Again the <STREAM> tag gives the details. For example:
<BINARY> <STREAM file="file://usr/home/me/myfile.dat"> </BINARY>
<FITS> and <BINARY> data may optionally be compressed with standard utilities such as <gzip>, again with <STREAM> giving the details. Finally, there are a couple of tensions between catalogues and XML which are worth mentioning. Firstly, catalogue data, being tabular, are inherently flat, whereas XML is inherently hierarchical. Secondly, and perhaps more importantly, the simple XML representation of columns would be to have bespoke tags for each column. So, for example, a row in a catalogue which has columns RA, DEC and V magnitude might be represented:
<ROW> <RA>7.655</RA><DEC>5.225</DEC><V>0.34</V> </ROW>
In an astronomical context where every catalogue has different columns such a scheme would necessitate having a separate DTD for every catalogue. Instead, in the VOTable we chose to abstract all the column details into the <FIELD> element and to use the standard <TABULAR> element for representing all tables in XML. This arrangement made it feasible to have a single DTD and, perhaps, made the introduction of the <FITS> and <BINARY> tags as alternative representations of tables more feasible.
Version 1.0 of the VOTable format was released in April 2002 and we anticipate that enhanced versions will be released in the future. This section briefly discusses some of the open issues which might be addressed in future versions of the format.
The FITS format is widespread in astronomy and is likely to remain so. Consequently, it is important to be able to continue to access FITS files. The VOTable was deliberately designed so that, with a few minor exceptions that are unlikely to be important in practice, any FITS binary table can be converted to a VOTable without loss of information. The converse, however, is not true.
One respect in which the two formats differ is that the VOTable is a streaming format whereas FITS tables are not. The reason is that in a FITS table an item of metadata (a 'keyword' in FITS jargon) specifying the number of rows in the table must be included in header information at the front of the table. Consequently a server transmitting a FITS table must have access to the complete table, so it can insert the number of rows in the header before it can start transmitting the table. Conversely, the VOTable format contains no knowledge of the number of rows in a table. Consequently a server can start transmitting a VOTable whilst the query creating the selection from which the VOTable is generated is still in progress. It remains to be seen whether this distinction proves important in practice.
The VOTable was invented as an interchange format for selections extracted from a catalogue and returned to a remote client via the Internet. However, experience with other formats, most notably FITS but also TSV, indicates that VOTable will probably also be used as a storage format (FITS began as in interchange format; remember that the acronym stands for 'Flexible Image Transport System'). Many data archives now use FITS as their storage format.
Many astronomical tables are large and the subsets extracted from them as a result of selections can also be large. Representing a table as a sequence of characters using the <TABULAR> element increases its size enormously compared to a binary representation. This increase in size is acceptable for small tables, but for large ones can be a serious problem. Moreover, the problem appears irrespective of whether the VOTable is being used as a transport format (there are more bytes to move) or as a storage format (the files are larger). The mechanisms invented in the VOTable to circumvent this problem, the <FITS> and <BINARY> tags, are something of a compromise and not really in the spirit of XML. Also, if the table is stored as a separate file in FITS or binary format then there is always the possibility that the files will become separated and one or the other will be lost. In the future we are likely to develop XML-based formats for retrieving 'bulk data' such as images and spectra and in these cases having an efficient representation of binary numeric data is even more important than it is for catalogues. Consequently, it would be very beneficial if an efficient method of representing binary data could be added to XML.
An appendix to the VOTable standard described a <LINK> element, which was not part of version 1.0 but which might be included in future revisions of the standard. This element allows columns to be created (projected in the jargon of relational databases) on-the-fly when a table is read. As an example, suppose that a catalogue contained the column Star-Name and that in some external resource further details about the stars in the catalogue were available with the names listed in column Star-Name as the key or identifier. Then the <TABLE> element of the VOtable could contain a <LINK> tag of the form:
<LINK href="http://us-vo.org/lookup?Star=${Star-Name}"/>
When the table was read an additional column would be projected on-the-fly with ${Star-Name} substituted by the names of the stars. For the example in Figure Figure 2 this column would become:
http://us-vo.org/lookup?Procyon
http://us-vo.org/lookup?Vega
http://us-vo.org/lookup?Sirius
An application reading the table could use these URLs to access the remote data. This example is just one use of a potentially powerful facility.
Although the VOTable is primarily intended for representing catalogues of astronomical objects, it should also be capable of representing other sorts of astronomical tables, such as atomic line lists, X-ray event lists and the tables encountered in related disciplines such as Solar Physics and Solar-Terrestrial Physics. The VOTable is sufficiently flexible to handle these requirements, but for Solar and Solar-Terrestrial work the additional coordinate systems used in these disciplines will need to be supported. In version 1.0 the coordinate systems allowed are specified in the DTD, which makes adding new coordinate systems a major revision of the format.
Currently the <DESCRIPTION> element can contain only plain text. Future versions should be able to include HTML, or a subset thereof, with hyper-links to external URLs.
The VOTable, FITS and similar formats such as TSV, are basically syntactic rather than semantic standards. That is, they are mostly concerned with how to represent items of information and do not ascribe meaning to particular items of information. That is, columns and parameters representing similar quantities will appear with different names in different catalogues. For the level of inter-operability envisaged for the VO, with automatic identification of catalogues relevant to some query, it will be necessary to assign standard quantities with agreed meanings to catalogues. The CDS's UCDs for classifying columns are a very important step in this regard. At a higher level, astronomy also has a thesaurus of agreed terms [12]. However, there is nothing equivalent to the biologist's taxonomy of organisms or the chemist's nomenclature for compounds. There are similar problems with the units in which quantities are stored in catalogues. Though there are recommended standard units [13], a wider range of un-standardised units are encountered in practice. These deficiencies have ramifications in the wider VO, beyond the VOTable, and need to be addressed if the VO is to function as envisaged.
This paper has briefly summarised the VOTable XML format for representing astronomical catalogues. XML has proven a powerful and flexible tool which is more than capable of representing an astronomical catalogue. The document specifying version 1.0 of the VOTable format had ten authors in five different countries. A much larger group of people contributed to the discussions which resulted in the standard. Nonetheless it was possible to reach agreement relatively quickly. One reason was that a determined effort was made to progress the discussions and keep them focussed on a VO format for tabular data. However another factor was doubtless that the format was not being created ab ovo, but rather was being defined in terms of XML, a familiar and well-specified computer-industry standard.
Since version 1.0 was finalised in April 2002 a number of archive centres have started offering catalogues in the VOTable format and several applications have been developed which can read it. Many of these applications use standard XML parsers and tools. The availability of these tools speeds development by reducing the amount of code which has to be written.
There are, however, still a number of open issues. Some, such as agreeing syntactic standards assigning a particular meaning to given items of information, are purely an astronomical problem. Others are potentially relevant to the wider XML community. In particular the VOTable would greatly benefit from an efficient way of representing binary data in XML.
We are grateful to P.F. Ortiz, C.G. Page and M.B. Taylor for useful comments on the draft version of this paper. Any mistakes, of course, remain our own.
[1] IVOA, see URL: http://www.ivoa.net/
[2] VOTable: A Proposed XML Format for Astronomical Tables, version 1.0, R. Williams, F. Ochsenbein, C. Davenhall, D. Durand, P. Fernique, D. Giaretta, R. Hanisch, T. McGlynn, A. Szalay and A. Wicenec, 15 April 2002. See URL: http://vizier.u-strasbg.fr/doc/VOTable/
[3] The original FITS format is described in D.C. Wells, E.W. Greisen and R.H. Harten, 1981, Astron. Astrophys. Suppl, 44, pp363-370. The FITS format is now maintained and documented by the FITS Support Office of the Astrophysics Data Facility at the NASA Goddard Space Flight Center, see URL: http://fits.gsfc.nasa.gov/fits_home.html. Though FITS is basically an astronomical format it is sometimes mentioned in books about standard image formats. See, for example, D.C. Kay and J.R. Levine, 1995, Graphics File Formats, second edition (Windcrest/McGraw-Hill: New York), in particular Chapter 18, pp235-244.
[4] R.H. Harten, P. Grosbøl, E.W. Greisen and D.C. Wells, 1988, Astron. Astrophys. Suppl, 73, pp365-372.
[6] F. Ochsenbein, M. Albrecht, A. Brighton, P. Fernique, D. Guillaume, R.J. Hanisch, E. Shaya and A. Wicenec, 2000, in Astronomical Data Analysis Software and Systems IX, eds, N. Manset, C. Veillet and D. Crabtree (Astron. Soc. Pacific: San Francisco), Astron. Soc. Pacific Conference Series 216, pp83-86.
[7] K. Blackburn, A. Lazzarini, T. Prince and R. Williams, 1999, in HPCN'99 (Amsterdam), p513. See URL: http://citeseer.nj.nec.com/blackburn99xsil.html
[8] A.C. Davenhall, 2000, SSN/75.1: Writing Catalogue and Image Servers for GAIA and CURSA (Starlink). Note that in SSN/75 the TSV format is called the TST (Tab-Separated Table) format.
[9] A.C. Davenhall, 2001, SUN/190.10: CURSA: Catalogue and Table Manipulation Applications (Starlink). See also URL: http://www.starlink.rl.ac.uk/cursa/
[10] D. Hoffleit and C. Jachek, 1982, Bright Star Catalogue, Fourth Edition (Yale Univ. Observatory: New Haven, Connecticut).
[11] UCD, see URL: http://vizier.u-strasbg.fr/UCD/
[12] IAU (International Astronomical Union) Astronomy Thesaurus, see URL: http://msowww.anu.edu.au/library/thesaurus/
[13] The CDS has a list of standard units at URL: http://vizier.u-strasbg.fr/doc/catstd-3.2.htx and the IAU has one at URL: http://www.iau.org/IAU/Activities/nomenclature/units.html. The latter is taken from G.A. Wilkins, 1989, The IAU Style Manual, bound in D. McNally (ed.), 1990, Transactions of the IAU, XXB (Kluwer: Dordrecht). See also E.W. Greisen and M.R. Calabretta, 2002, Astron. Astrophys, 395, pp1061-1075 and in particular Tables 4-6, pp1070-1071.
Astrophysical Virtual Observatory
Centre de Données astronomiques de Strasbourg
Catalogue Utilities for Reporting, Selecting and Arithmetic
Declination
Flexible Image Transport System
International Virtual Observatory Alliance
National Virtual Observatory
Right Ascension
Simple Object Access Protocol
Tab- Separated Value
Unified Content Descriptor
Virtual Observatory
eXtensible Scientific Interchange Language
![]() ![]() |
Design & Development by deepX Ltd. |