Abstract
This paper presents an applied research project for content management which aims to the creation of a university web portal using the XML technology. The work has been carried out in order to reorganize the obsolete HTML based web site.
The paper objective is to analyse the main issues involved in the migration process from HTML to XML based content management.
The basic issues identified, concerning the creation and management of a medium-large size university web site, include homogeneous presentation, orienteering, navigation and search facilities, resource and competence availability necessary for keeping content updated, etc. From the analysis also emerged that only a minor part of the whole site information does not change, such as history or location, while a part may change occasionally, a part regularly like courses, timetables etc., and another part may be newly generated on a week base rate, such as news, events, and seminar announcements.
On the basis of this situation a dynamic content management solution has been envisaged to be effective and has been adopted. A number of issues concerning the budget, low in public university, and scale, neither small nor as large as in big companies/universities, determined the choice of the tool for its implementation and deployment. The Apache Cocoon XML open source tool has been selected and its separation of concerns model exploited. Cocoon provides effective support for the external and internal end users' needs: navigating and browsing the content, checking information and publishing in an easy and quick way, controlling access and roles, guaranteeing security, and supporting workflow management.
The fundamental separation of the content from its presentation style and from logic is managed by means of contents stored in XML or in a DB and accessed through the Cocoon ESQL logicsheet, and the use of XSL stylesheets for their presentation. This permits to reuse the same content in different contexts filtering out different information, to guarantee style consistency for presentation of similar pages, and to differentiate content presentation in various contexts.
The XML technology is used here in order to represent and present information about news, teaching, research and services of the school. In addition, it is used for defining and controlling the site navigation within a menu xml file and to share common contents. The goal is also to optimize the content maintenance process which can guarantee a higher quality and an up to date web site with useful information both for students and staff. The project development is an on going long process which currently consists of the completion of some modules. In particular, the data about courses and curricula have been collected via XSP web pages and stored in a data base in order to make them available also to other existing applications already in use internally within the administration department. Contents are retrieved in XML format and uniformly formatted both as HTML pages or PDF printing files via XML-FO and Apache-fop. The graphical design has been accomplished by exploiting the expertise of the Applied Arts Department on the basis of usability criteria and of XSL and CSS stylesheets.
Keywords
Table of Contents
Computer science university departments are expected to take benefits from the application of concepts, design models, architectures, and technologies usually taught in their courses. Their value, capabilities, and competences can also be shown through a well-designed web site. "In the network economy, the web site becomes a company's primary interface to the customer" [Nielsen2000].
A relatively ambitious applied research project has been undertaken to reach this goal in one of the seven Switzerland Universities of Applied Sciences in southern Switzerland (SUPSI) [SUPSI], with about 3000 students (bachelors and post-graduates) and more than 400 employees.
This project aims to the creation of a medium-large size university web portal using the XML technology for content management. The work has been carried out in order to reorganize the obsolete HTML based web site. The existing web site was composed of about two thousands static HTML pages, organized in 280 folders, developed during the previous 4 or 5 years by different people and without any consistency. This site involved dozens of potential content authors from three departments and three institutes. In this context there was the need for simplifying the web site management and creating a uniform layout.
The initial goals of this project include the provision of internet and intranet functionalities that cover the external and internal end users' needs such as content browsing, checking information and publishing in an easy and quick way, controlling access and roles, guaranteeing security, etc. In general maintenance and content usefulness have been our primary concerns and requirements.
The paper objective is to analyse the main issues involved in the migration process from the HTML to the XML web site. The identification of these issues may strongly influence the results of this process where many and different variables have to considered.
This section summarizes the current situation and trend of the Web world: a real evolution and revolution.
Evolution: the web is evolving since the new millennium from a network of information nodes interconnected through hyper-textual links, as initially ideated and realized by means of the HTML language, towards new compelling frontiers for representing and using hypermedia contents, where an enormous information mass, ever so rapidly generated by the human kind, is efficiently managed. Within this context, XML (eXtensible Markup Language) [XML] emerges as standard language for content mark up, and also the Web Services come out for developing interoperable distributed network applications with standard languages for communication (SOAP) and service description and retrieval (WSDL, UDDI).
Effects of this evolution are turned in terms of benefits in the information organization and discovery. From a technological point of view, the evolution is supported by the XML formalism to represent and deliver information, and by Content Management Systems to produce web pages.
Revolution: these innovations, at the basis of the web, allow and will allow in the future contents to be used in a revolutionary, different, and deeper way. Market interests, technologies, and infrastructures have reached a maturity status where the need for machine that are able to understand and semantically elaborate contents strongly emerges. The need for the semantic web is now clear. Until now the web was oriented to humans. The new web is oriented towards machines and programs that are very fast in making computations and logic deductions which may involve a large number of pages and amount of information. However, information is difficult to be retrieved on the web, and information quality and accuracy are very low.
In order to solve this problem a change is required from the sterile HTML standard language used for web pages representation, to a semantically rich language like XML. While in the former the focus is on the stylistic and formatting features of the page, in the latter the focus is rather on the content description. We are referring here to contents which are manipulated on web servers in order to be aggregated, filtered, elaborated, etc., and finally presented in an updated and targeted way to the web consumer.
In this context, content can be informally defined as what is inside a container, i.e. the web page, in a separate way from how it is presented in the page, its style, colour, and position in the page.
The migration from HTML to XML requires a change from simple information page editing, in the case of a traditional web site, to a more complex process of content management which takes into consideration the real user needs: navigating, publishing and checking information in a fast and easy way, configuring security and access control in order to determine who is allowed to see and publish contents, how the information is presented on the web site, what working flow functionalities are available to the intranet user.
In summary, two strong points are the basis for the development carried out and provide rationales for the migration from HTML to XML based site:
the separation of content from style which produces data interoperability with portable and re-usable contents
the semantic markup which explicitly introduces meta information about the contents and makes possible an elaborated, more "intelligent" information treatment, setting both structural and content foundations towards a new "semantic" web where the main goal is to "Transform all information into valuable assets."
Before starting the migration process, the SUPSI web site consisted of a collection of sites created and managed by different departments and institutes. As a consequence, each site had its own content and layout. This content and presentation inconsistency leaded to low homogeneity and disorientation when navigating. In addition, not all departments had competences, resources, and time to keep content updated. These factors reduce considerably the site usefulness.
The solution adopted in order to solve this problem was to have, on one hand, a decentralized model to maintain the different sites in order to guarantee an appropriate autonomy level, and, on the other hand, a centralized control for unifying the style. A tool that is appropriate for this purpose is a content management system, and, in particular one that supports the XML format. This guarantees portable contents, and allows different views or presentations to be associated to the same content targeting it to different users and contexts.
This section describes the practical development process from the analysis of the HTML web site, to the architecture design, and some implementation details.
The starting point of this work has been the analysis of the existing HTML web site with the objective of identifying the main components, and evaluating the content updating issue.
By analysing the web site, a number of main components or modules have been identified: teaching (didattica), research, services, general information about the school ("SUPSI briefly"), and SUPSI live.
By considering whether and how frequently content changes, it can be observed that:
a small part of content, i.e. information about history, mission, location, etc. in "SUPSI in breve" (SUPSI briefly) can be considered static;
a part changes regularly, i.e. courses, course timetable, and calendar;
a part changes occasionally, e.g. the organization chart, and lecturers' details;
another part is newly generated on a week rate, e.g. news, events, seminar announcements, etc. in "SUPSI live". In the existing SUPSI web site this dynamic part is the central part of the home page (see Figure 7).
In the HTML based web site all contents are static pages, and content changes are managed through manual updating of the web master.
In the XML based web site, a dynamic content management solution has been envisaged to be effective and has been adopted.
The use of XML considerably simplifies dynamic content management, thanks to the availability of predefined structures and stylesheets, and the possibility of defining facilities for multi-user and collaborative authoring. Within the project, taking into account the XML potential and the size of the whole information system of the school, priority has been given to those modules with more dynamic content and which the school is more interested in. In successive phases other modules will be considered and completed.
In particular, the teaching module has been selected for starting the migration process. The next steps consisted of re-designing this module and choosing the supporting platform.
With respect to the existing Web site, the teaching module which, in the old version, only considered the problem of visualizing courses, has been enriched with a number of functionalities for managing the whole set of courses. This module has been named course management in the new XML web site.
Its functionalities include:
insertion of a new course. This concerns the definition of metadata about a course which are entered by means of a validated form and are stored via XML in the central DB of the school;
definition of curricula by associating existing courses to semesters, to an academic year, and to a curriculum;
course and curricula updating, modification, deletion;
visualization of courses. The details of each course (syllabus) can be presented in a card in a uniform consistent layout, which can be delivered through different channels such as an HTML web page or a PDF page;
protected access.
These functionalities can be distinguished in internet or intranet functionalities and are available to each user according to his/her role privileges. For instance, deletion of courses is currently allowed only to the system administrator, while course visualization is publicly open to all users.
A fundamental activity when migrating from HTML to XML is the identification of the main content structures and their formalization as XML schema or DTD. In the course management module different structures have been defined. The most important ones are the course and the curriculum. The main components of the course structure, defined as an XML schema, include: title, ECTS, duration, prerequisites, objectives, contents, etc. The curriculum content is dynamically created (on-the fly or cached) by aggregating the contents of all the courses belonging to the specific curriculum.
Among other courses, the Semester practical projects and the Diploma practical projects culminating with a dissertation, deserved special treatment. The course management module provides facilities for dissertation cataloguing and content retrieval. This part was not present in the HTML web site. The normal procedure was to collect SUPSI Semester and Diploma dissertations in the library in paper format, making only titles, curriculum, and date available for on-line searching. In the XML web site, on the other hand, the content of dissertations is made available on-line and a retrieval mechanism is provided on the basis of the XML Lucene [Lucene] integrated search engine and a Cocoon extension (action) for PDF full text search. For each dissertation some metadata in XML are used for describing the work in terms of title, authors, keywords, technology used, etc. In addition the system allows an electronic version of the dissertation to be uploaded in PDF format, indexed, and managed by a Department responsible, who can proceed to publication upon acceptance. Finally an end user can search for relevant works in any existing dissertation and access its online version.
A number of issues concerning the budget, low in public university, and scale, neither small nor as large as in big companies/universities, determined the choice of the tool for its implementation and deployment. The Apache Cocoon XML open source tool [Cocoon2] has been selected and its separation of concerns model exploited (see Figure 1). It is worth noting that according to this model, content, logic and presentation are kept separate. The independence of these parts allows also a neat identification of developer groups which are dedicated to each specific task without any conflict. Thus, the entire web publishing process can be optimized enhancing productivity and reducing management costs.
Cocoon supports interaction with many data sources, including file systems, RDBMS, LDAP, native XML databases, SAP (R) systems and network-based data sources. It provides a powerful multiple channels output format mechanism which can tailor content delivery to the different devices capabilities like e.g. HTML, WML, PDF, SVG, RTF, and many others.
In addition, Cocoon provides effective support for the external and internal end users' needs: navigating and browsing the content, checking information and publishing in an easy and quick way, controlling access and roles, guaranteeing security, and supporting workflow management.
Other open source XML based content management systems could have been chosen. For instance, the open source Wyona-Lenya [Lenya2004] was a good candidate because it already provided implementation for additional functions on top of Cocoon, but it was not sufficiently stable and mature to be securely adopted.
The fundamental separation of the content from its presentation style and from logic is obtained by means of contents stored in a data layer (3-tier model) in XML or in a data base and accessed through the Cocoon ESQL logicsheet, and the use of XSL stylesheets for their presentation. This permits to reuse the same content in different contexts filtering out different information, to guarantee style consistency for presentation of similar pages, and to differentiate content presentation in various contexts.
In particular, the data about courses and curricula have been collected via XSP web pages and stored in a data base in order to make them available also to other existing applications already in use internally within the administration department. Contents are retrieved in XML format and uniformly formatted both as HTML pages or PDF printing files via XML-FO and Apache-fop. The graphical design has been accomplished by exploiting the expertise of the Applied Arts Department on the basis of usability criteria and of XSL and CSS stylesheets.
Hereafter, some of the "course management" functionalities are highlighted through the most relevant screenshots. The module is activated through the page shown in figure 2 (Figure 2), where a link bar with the main functionalities appear.
Once the user has been identified as authorized user, he/she may choose any function available for course management (adding, deleting, modifying, etc.) according his/her privileges. For instance, figure 3 (Figure 3) shows a part of the form for modifying an existing course, where it appears the primary key Acronym used to uniquely identifying a course.
One interesting feature is the possibility of showing contents in different formats, e.g. HTML and PDF. Figure 4 (Figure 4) and Figure 5 (Figure 5) give access to the information of a specific curriculum respectively in PDF and HTML format.
From the list of courses in the curriculum page, it is possible to access to the course details page (see Figure 6), where a link (top right icon) allows the PDF version to be presented.
A number of basic issues involved in the creation and management of an XML based medium-large size university web site are here introduced. They directly derive from our real experiment, which can be considered as a case study in the development of an academic web site.
One of the main issues related to the structure of an academic web portal is that it often consists of a collection of sites created and managed by different departments, each with its own content type and layout. This produces a content and presentation inconsistency, with a consequent low homogeneity and disorientation.
Another general problem concerns the different availability of competences, resources, and time in different departments to keep content updated, considerably reducing the global university site quality and usefulness.
The use of a content management system to solve the previous issues implies changes in the content management process, with respect to the HTML version, and the emergence of new professional roles, which change the way traditional actors, i.e. students, teachers, and administrators, approach the web site. The role of the web master responsible for organizing and updating HTML pages, collecting information from different sources and people (intermediary role), is replaced by new distributed roles who are more focused on the content rather than on its formatting. This makes the publication process more straightforward.
An implication of this situation is the need for training people to adopt new tools. Obviously it is not required that teachers, for instance, edit text in raw XML format, but the authoring process can be facilitated by masks or supported by user-friendly tools, such as the course modification form previously mentioned in our solution.
Another problem related to the migration towards XML content management systems concerns making project managers aware of the XML potential and of the global management costs. The use of content management system implies an initial arise of cost, due to the need for re-organizing the information corpora and for training people to use new procedure and technologies; however, these initial costs are rewarded by the long term benefits of an easy maintenance and reuse. The difficulty is understanding the real benefits and long term effects.
In addition, there are other issues regarding designing homogeneous interfaces and providing appropriate end-user facilities for searching and navigating.
Within the interface context, an important issue is identifying and pursuing the proper corporate identity which will be the own window on the net. In this case, the use of XML as the content markup language provides considerable benefits, mainly with respect to presentation and searching facilities. As already described, presentation is kept separate from content. Therefore by using the XSLT language it is possible to guarantee homogeneity in the content presentation. CSS, and in particular its latest versions, can also achieve similar results with HTML pages. However XSLT is more powerful as it also addresses content filtering and transformation, and allows many output formats to be produced.
Searching facilities can be surely improved thanks to the availability of meta-information that describe the content and its semantics. The meta-tag in HTML can provide a similar function. However this tag allows some information to be added to the whole page, but not to mark the single “piece of content”, and assign a specific structure to the document. Therefore, a correct definition of tags gets a fundamental task for the semantic added value, and towards the generation of solid durable content bricks.
This work describes the lesson learned from the migration process from the HTML to the XML SUPSI web site. Migrating to XML is not only translating HTML to XML pages, but also implies using a content management system which exploits the XML format. This is particularly useful to manage dynamic content. It is the case of an academic web site, where the most part of information changes occasionally or regularly.
The advantages of using the XML format is that the contents are portable, and, by separating content from presentation, different views or presentations may be associated to the same content targeting it to different users and contexts.
The main issues emerged from the migration process can be summarized as follows:
difficulty in understanding the real benefits and long term effects of an XML content management system; for instance, in the SUPSI web site project there was a difficulty in making other departments aware of the advantages in using the dynamic content management solution for their courses management;
initial high costs; this will be probably a decreasing factor in the future due to the creation of more powerful tools for XML web authoring;
need for new professional roles for web content management with the objective of transforming any content into valuable asset;
need for specific training.
These issues have introduced some delays in the full accomplishment of the XML migration. As a consequence, the currently adopted solution is a mixed solution, where HTML contents and XML contents coexist (see Figure 7).
Since the features of an academic web site are quite general, there is the possibility to generalize this approach to company and organization web sites with similar problems, needs, and requirements.
[Cocoon2] Cocoon, version 2.1.3, The Apache Software Foundation, http://xml.apache.org/cocoon/.
[Lenya2004] Apache Lenya, The Apache Software Foundation, 2004, http://cocoon.apache.org/lenya/.
[Lucene] Lucene XML search engine, http://jakarta.apache.org/lucene/.
[SUPSI] The SUPSI Web Site, Oct. 2003, http://www.supsi.ch.
[XML] The eXtensible Markup Language - XML, W3C, http://www.w3c.org/XML/.
![]() ![]() |
Design & Development by deepX Ltd. |