XML Europe 2002 logo

Filing and Processing Patent Data Using XML - A World Standard

Abstract

After close consultation with member States of the Patent Cooperation Treaty (PCT) Union, and taking into account active participation from the European Patent Office (EPO), Japan Patent Office (JPO) and United States Patent and Trademark Office (USPTO), the World Intellectual Property Organization (WIPO) has finalized instructions and standards to implement the electronic filing, processing, and storage of international applications for patents. The standards are intended to allow applicants to file an international patent application (E-PCT application) in electronic form which is acceptable to those Patent Offices around the world which have agreed to accept electronic filing. An important part of the standards is a set of XML DTDs to support E-PCT applications - including the authoring of patents using XML. In addition it is intended that national Patent Offices will use the standards as the basis for their own national electronic applications for patents. For complete details see: http://pcteasy.wipo.int/efiling_standards/EFPage.htm

Keywords


Table of Contents

1. Introduction
2. The DTDs
3. Packaging the data
4. Procedural Data
5. The Application
5.1. The application DTD
5.2. Authoring the patent application
6. A note on the PCT e-filing DTD architecture
Acknowledgements
Biography

1. Introduction

The patent world may be an esoteric and 'hidden' world as far as data processing and publication go but in terms of sheer volume of data processed there can be few rivals in the world to the large patent offices. These offices include the European Patent Office (EPO), Japan Patent Office (JPO), United States Patent and Trademark Office (USPTO) and World Intellectual Property Organization (WIPO) who between them handle nearly 1 million patent applications per year; that is, they process approximately 500 000 pages of text and images per week. However, with the exception of the JPO, the vast majority of data still arrives on paper and, although this is less common, may then also be exchanged at a later date, again on paper, between patent offices. It is an important priority for the major patent offices to move towards full electronic processing of patent data from start to finish.

The EPO and USPTO have fledgling online filing systems (for filing patent applications) in place now and WIPO are in the process of designing one. The JPO have had such a system since 1990 - being the first patent office in the world to accept online filing. The processing of this data, generally speaking, entails the data capture of procedural data (names, addresses, fee data, title of invention, etc) and application data (abstract, description, claims and drawings). The later data often involves costly data processing, capture and storage. If you do not know anything about patents a reasonable analogy is a scientific periodical article which, on average, consists of 35 pages of text and images - it may contain tables, chemical and mathematical formulae, figures and a wide range of characters. For examples search: http://ep.espacenet.com

In order to be used in, for example, search databases all this data has to be scanned, converted (generally using optical character recognition (OCR)) and marked up - the EPO and USPTO already have very large full text databases based on SGML mark up. Without going into detail it is quite possible for the same data to be processed several times. All the offices mentioned use different computer systems, processes and software, which is not surprising, but more importantly (or critically) they all use different methods of tagging shared data.

Therefore, there was a real need to standardise data exchange, especially for international applications, between the applicant and patent offices, and patent office to patent office, and WIPO have, with the close cooperation of Trilateral (EPO, JPO, USPTO) Work Groups, started to develop, over the last 18 months, a suite of legal rules, regulations and DTDs for the electronic filing and processing of international applications (see: http://pcteasy.wipo.int/efiling_standards/EFPage.htm ) - known as the E-PCT or Annex F standard. With the gradual implementation of the standard it is hoped that the enormous patent paper mountain, data re-entry and duplication of effort will be reduced. Although primarily aimed at international applications filed and processed under the Patent Cooperation Treaty (PCT) it is hoped that the standard will also be used by other patents offices, besides WIPO - this is certainly the case for the EPO, JPO and USPTO all of whom have made a commitment to adopt, as far as possible, the standard discussed in this paper. A major part of the effort has been the creation of several XML DTDs - this paper gives some details about these DTDs and how they will be used in the near future.

2. The DTDs

So far 11 DTDs have been published in Annex F - on the WIPO web site: http://pcteasy.wipo.int/efiling_standards/schemaDocs/schemaDocs.htm. It is not the purpose of this paper to describe all these DTDs in detail but to give a general overview of how they are intended to be used. All the DTDs are based on several years work analysing the requirements for online filing (legally and technically), the procedures, forms and methods currently in place in WIPO and patent documentation in general; where possible existing standards have been used or referenced; for example, all the DTDs must conform to W3C XML version 1. There are already more draft DTDs on the WIPO web site cited above and work will continue over the next few years to build a complete suite of required DTDs. The DTDs so far published split into three main groups:

  • for packaging the data for online filing

  • for procedural data such as names and addresses

  • for "documents", that is, patent applications

Each one of these groups is described below.

3. Packaging the data

When an applicant wants to file a patent online, or a patent office wants to send data to another office, there is a requirement for a very high degree of security. Once the applicant has written the application and, usually, entered various data into an online form (for example see the EPO's online filing system at http://www.epoline.org/ and for the USPTO: http://www.uspto.gov/ebc/efs/index.html) the complete data has to be "packaged" for online transmission. To reach a high level of security solutions implemented under the standard must satisfy the following four basic criteria for electronic data exchange:

(a) authentication – the process of validating an identity claimed by or for an entity;

(b) integrity – ability to verify that data is unchanged from its source and has not been accidentally or maliciously modified, altered, or destroyed;

(c) non-repudiation – ensure that strong and substantial evidence is available to the sender of data that the data has been delivered (with the cooperation of the recipient), and to the recipient of the sender’s identity, sufficient to prevent either from successfully denying having possessed the data; this includes the ability of a third party to verify the integrity and origin of the data;

(d) confidentiality – ensure that information can be read only by authorized entities.

The standard supports, in particular, a solution relying on a public key infrastructure (PKI) for authentication and data security in the Internet environment. However, it also envisages that there may in the future be other solutions which satisfy the above four security criteria.

In order to achieve this level of security a packaging structure has been devised which is based on XML coding. Electronic international application submissions will contain many different types of documents and information. Text, images, and sequence listings can all be printed on paper, but each of these requires a different electronic representation. For example, text can be stored in “character codes,” while images can be stored in grids of picture elements called “bitmaps.” The concept is further complicated by the fact that most information can be stored in multiple electronic formats. Printed text can be optically scanned and stored as an image. The following figure gives an overview of all the types of data which could be contained in an XML package:

click image for full size view

As can be seen in the figure above an application will generally consist of several files; it is useful to assemble these files together into a single electronic “package” for transmission. Two package types are included under the standard: non-PKI and PKI-based packages. The wrapped application documents file (“WAD”) is a non-PKI package. The two forms of PKI-based packages are a wrapped and signed package (“WASP”) and a signed and encrypted package (“SEP”). All electronic document exchange files under the standard must be first packaged as a WAD. WAD, WASP, and SEP package types are permitted in the Applicant-to-Office sector while only WASP or SEP data types are permitted in the Office-Office sector.

The DTD packaging the bulk of the data is called "package-data" and this DTD references all the main contents (and other DTDs) of a patent application and also contains the important data such as, for example, the electronic signature and the package type of the wrapped document. Here is the layout of the DTD:

click image for full size view

The E-PCT standard does, of course, go into a lot more detail regarding submission protocols, formats, etc. It is a complicated area and cannot be covered in this paper which is more concerned with data content than transmission protocols.

4. Procedural Data

To file a patent it is necessary, in most cases, to fill in a request form with details of name, address, inventor details, title of invention, etc. Obviously this data entry can be highly structured by using, for example, data entry forms and this is in fact the case for all the four big offices. Some WIPO applicants have been using software - called PCT-EASY - for several years to fill in an application form, store it on diskette, and then send in this diskette together with their paper application. The data is stored with SGML markup added and this is then fed into WIPO's procedural database without the need for data entry operators to re-type the data.

Therefore, converting this type of data to XML is relatively straightforward and the move to online filing equally so - the same software and data entry 'panels' being re-used with minimum re-configuration. As far as possible this data can then be transferred to other offices as the application moves around the patent world. However, as elsewhere, different tag sets have been used by each office for the same data. With the adoption of the E-PCT standard it is hoped that this situation will be greatly reduced. Some procedural data is relevant to a specific office. In this case the office-specfic tag can be preceded with a so-called country code and the DTD must, of course, be transmitted to any receiving office with an explanation of the unique data element. However, under a new Patent Law Treaty (PLT) each national patent office will have to use the E-PCT request.dtd also for national patent applications. Therefore, the request.dtd will become not only an E-PCT application standard but also a standard for applications in the rest of the world. This is the structure of the request dtd:

click image for full size view

The advantage of a standard set of data elements in this area is also important to companies and agents who file a lot of patents since many of them have their own patent management systems where procedural data is stored and they can then transfer or convert that data to the standard data elements ready for online filing.

5. The Application

When an inventor/applicant writes their invention, ready for filing to a patent office, they will almost certainly use a word processor. To file that patent they will, in the case of the EPO, USPTO and WIPO, print it out and send it by mail. Well over 90% of applications are filed on paper with an average of 35 pages so, and as mentioned at the beginning, this means a lot of paper going through the mail and having to be processed by the patent offices. WIPO, for example, received 100,000 applications in 2000. Thus we have a situation where we have an original electronic file (created using a word processor) turned into a paper file.

To be useful in anyway at all to patent offices the paper file must be turned back into an electronic file. At the most basic this could simply be carried out by scanning the paper and storing it as a facsimile file with some indexing for retrieval (this is what happens at WIPO). To be of optimum use it should be captured with full text mark up and linked images, ideally, of course, in XML . The EPO and USPTO do capture the full text and images but they use SGML at the moment - as good as XML of course! - resulting in very efficient production chains creating several different products on various media and in various output formats; for example: internet databases, CD-ROMS, data exchange tapes and, of course, printed patents - all from the same data source in SGML. However, as one might imagine, all this costs a lot of time, effort and money when, at the birth of the document it was already in electronic format!

In addition to applicant-to-office data transfer there is also a tremendous amount of data transfer from office-to-office. This data is almost invariably in a non-standard format. The EPO, for example, has to convert over 50 different formats for use in its central search database. Although the EPO and USPTO use SGML it is not exactly the same 'flavour' and the EPO loads and converts all USPTO filings onto their database.

The situation becomes even more complicated when one considers that the same application could be re-entered several times at different stages of its life. Here is a fairly common scenario: a US applicant types in their invention, prints it, and sends it to the USPTO. The USPTO capture the data as described already. Later the applicant decides they want wider (world-wide) protection and files with the WIPO who enter the same bibliographic data and scan the application. However, WIPO do not carry out patent searches therefore the applicant must designate a search authority - in this example case - the EPO. The EPO re-enter the bibliographic data and scan in the application and, later, capture the full text of the application. The same data has been processed at least four times. See for example US patent 5494920 A (filed 1994, published 1996); WO 96/05830 A1 (filed 1995, published 1996) and EP 0771200 B1 (filed 1995, published as a granted patent 2000) - this is the same patent; check it out on: http://ep.espacenet.com - here you can see all three documents.

In the near future we hope that another, less costly, scenario will be possible: the applicant authors in XML (validated against the E-PCT DTD), files with the USPTO online who then use that file for all by-products as do, later, the WIPO and EPO - hopefully with as little interference with the original file as possible. In many cases the only difference will be the title page of the patent. If strict adherence is kept to the E-PCT standard this should be possible and, therefore, produce a higher quality patent at greatly reduced data entry costs. Authoring is discussed below.

5.1. The application DTD

The actual tag set is contained in the application-body dtd; this was constructed, as far as possible, from tags existing in the public domain and PCT rules and regulations (especially for the main parts and headings of patents); the following figure shows the overall structure:

click image for full size view

Within the description, generally the largest part of an application, there are more patent specific headings:

click image for full size view

The paragraph element contains more 'common' elements such as:

  • lists: <ol>,<ul>, etc

  • bold <b>, italic <i>, etc

  • figure references - <figref>

  • images - <img>

  • tables - DTD OASIS Open XML Exchange Table Model. "as is"

  • maths: DTD MathML2: maintained by W3C "as is"

More specific to patents are patent citations which have their own special markup; not specific to patents are general citations to scientific literature. There is no common standard for such citations and therefore, after much research, a tag set was devised based on a number of sources: ISO 690 - Documentation - Bibliographic references - Content, form and structure, the Anglo-American cataloguing rules (AACR2), ISO 12083 - Electronic manuscript preparation and markup, DTD Association of American Publishers (AAP), Z39.59 DTD, European Working Group for SGML (EWS) MAJOUR DTD, SuperJournal Full Article DTD and others. For the data entry of patent and other citations it is envisaged that the user should have a 'pop-up' form to fill in so that there is no need to enter rather complex tagging. For the full DTD see: http://pcteasy.wipo.int/efiling_standards/schemaDocs/schemaDocs.htm.

5.2. Authoring the patent application

Patents can be highly complex documents containing a wide character set, complex tables, images and mathematical formulae - very few, we believe, would be authored in 'pure' XML; most patent applicants and their representative probably use MS Word®. We cannot expect that the application will be authored using an XML editor or the like. Therefore, we need a Word to XML convertor which has built-in patent application templates. The USPTO have such a product known as PASAT and the EPO have recently started development of a similar product - both will, eventually, output to the same DTD - the application-body.dtd. It is not an easy matter - there are very few very good Word to XML tools around; added to that the tool has to deal with a multiple of input formats such as CAD drawings in numerous image formats, MS Excel® tables, external mathematical formulae programs, etc, etc.

The JPO and their applicants have had, since 1990, special computers and software for patent application authoring and filing and, since 1998, it has been possible to use general purpose PCs also. The output is not XML based but, nevertheless, reduces paper filing enormously (by 96%). The JPO has, with Trilateral (EPO and USPTO) partners and WIPO, committed to moving towards an XML system based on the E-PCT standards.

If the other patent offices can reach the level of online filing attained by the JPO it will be an enormous success, especially so if we all use the same DTD. We need to make a determined effort to give applicants the best tools available for complete online filing and the authoring of patent applications (with XML output), however, it is not purely a technical matter - there are legal, financial, training and marketing challenges (among others!) to overcome. Nevertheless, the E-PCT standards form an important part of meeting the business goals and we are really getting serious about XML!

6. A note on the PCT e-filing DTD architecture

by John Dunning, Consultant, WIPO

Given the large number of DTDs required for document authoring and exchange and the amount of data that will be shared and re-used, it was decided to construct the DTDs from standard components. In consideration of the ease of reuse of components and of working with XML syntax, the ability to easily generate quality documentation, and the limitations of DTDs for validation, the components were created as XML Schemas (http://www.w3.org/XML/Schema).

The approach taken in developing these schemas is what has been termed ‘the Salami Slice’ design (see “XML Schemas: Best Practices”, http://www.xfront.com/BestPracticesHomepage.html). In this method of schema construction each element definition is declared globally and a reusable component and a valid schema in itself; parent elements include the child element components needed to complete their content models, thus making each component a possible document type. This approach allows for maximum reuse of components, and preserves element definitions across all document types (for example, an <address> element has the same definition among all the PCT electronic filing DTDs). By storing each element as a separate schema file, new schemas can be created very quickly, simply by importing pre-defined components. The repository of schema components is available to Offices via the internet, allowing an Office to use these standard components to create their own Schema and DTD resources

Each element component that corresponds to a document type is compiled into a complete XML Schema (the <include> instructions replaced with the content of the <include>d schema), and that file is then translated into DTD syntax for release.

Acknowledgements

The Trilateral /WIPO XML Work Group also consists of:: Mitsuru Sono and colleagues (JPO), Bruce B. Cox and Bill Stryjewski (USPTO) and John Dunning (WIPO) - we are most grateful to our fellow members, especially Bruce Cox for chairing the group and John Dunning for the notes and work on DTD architecture, these standards are only made possible by the close cooperation and joint efforts of all members of the group and input from other WIPO member states.

Biography

Prinicpal Administrator
Information Systems

Paul Brewin has been working at the European Patent Office (EPO) for 16 years - his main role has been in charge of patent publication systems which is an SGML based system. Papers have been presented on this system at previous conferences. He has been contributor and editor of several World Intellectual Property Organization (WIPO) standards. He is now also involved in online filing of patent applications and in particular authoring patents in XML. Over the last year he has worked with the US, Japan Patent Offices and WIPO on a suite of XML standards to be used by patent offices for electronic filing of patent applications.

Shiro Ankyu worked at the Japan Patent Office (JPO) for 10 years and is now working at the World Intellectual Property Organization (WIPO) as a systems engineer. His main job in the WIPO IT division is the establishment of standard data formats and structures, in particular, XML and DTDs.