MUSASHI - Publications

Overview

Concept
XMLTable

Applications

Download
Requirements

Documentation

User
Developer
Benchmark

Training

MUSASHI Tutorial
Data Mining Studies
Artificial Data

Development

Bug Reports
Roadmap
CVS
Join Us

XML Table

1. Why XML (limitation of relational database?
1.1. Limitation of table based data modeling
1.2. Disadvantage of database design
1.3. Data disappearance
1.4. System out of reach of end users
2. Record all input data as XML
2.1. Input screen and record by XML
2.2. Process of XML data
2.3. Data update
2.4. Data consistency
2.5. System change
3.XML Table

MUSASHI describes data for processing in XML. It is presumed here that readers have fundamental knowledge.

1. Why XML (limitation of relational database)?

Before explaining why MUSASHI uses XML, please answer the following questions.

Can you retrieve the order a customer made one year ago?
Do you store in your computer all the information about reasons of returns?
Are all customers' transactions stored in your computer?
Suppose when you were settling a customer's account at point of sales terminal, that customer returned to the showcase to buy an article he/she forgets, and you settled another customer's account by interruption. Is this fact stored in your computer?
Does your computer store the information about the input of changes at point of sales terminal?
Is the history of updates to the customer master file recorded?
When you ask the branch of information system to add a new attribute to your file, can it be done in a week?
Suppose it took ten minutes before you got the result when you retrieved information from stored data by SQL. DO you feel it is very slow?

1.1. Limitation of table based data modeling

In most of companies, the information system uses relational database for storing their data. Relational database represents data as a table consisting of attributes and lines. However, we often encounter the situation in companies that real workflow is difficult to be modeled by a table. As a familiar example, let us consider to write a diary using Excel. Real work is an event which occurs as time proceeds. However, table based data modeling as in relational database, daily operation is usually modeled by things such as article, customer, order voucher. With this modeling, it is difficult to model a temporal relation such as interruption at point of sales terminal.

1.2. Disadvantage of database design

Additional problem with relational database is that database design is difficult and time consuming. When designing a database, initially people of the branch of information system and those working in daily operations discussed which data is necessary to be stored in a database. After this, they start to design a data structure. Database experts then spend a plenty of time to examine several details such as normalization and constraints in order to make a perfect design. This is because once it is used in daily operation, modification of an database requires a lot of time.
This fact in turn makes a user wait for an unnecessarily long time until the result desired is obtained, and it follows for the user side that the necessary information cannot be obtained in due time. It sometimes happen that when the result comes out, a user does need it anymore. Also, it may often occur that the result is not the one desired and a user will have to make another amendment request. Other than such discrepancy issue between a user and a system developer, the difficulty about the design of relational database may lead to the other issues, "data loss" and "system out of reach from end users", which will be discussed below.

1.3. data disappearance

When designing relational database, system developer first examines which data ought to be stored in a database. The data items other than those which are to be stored will be neglected. As was in the question posed above, if a user does not need the information about changes, such information will be dropped out of design scope. However, looking at daily operation at point of sales terminal,the information about changes is displayed on the register screen. Here we introduce the following story. A president of some company who does not know much about computers murmured that all the receipt information are stored in a computer when he saw a receipt at one shop of the company. This sense is quite normal. But, the reality is that the information which was once judged unnecessary to be stored in a database does not exist in a computer at all. It may quite often happen that the original receipt information and discount price are also lost. Such data loss causes a serious problem from the viewpoint of data mining. On cannot predict which data becomes necessary for future data mining analysis. Information about changes may become necessary in unexpected situation. Therefore, we believe that it is critically important to record all data occurred. Since a hard disk of 40 GB costs only 300 dollars nowadays, hardware cost does not impose any constraint. Thus, how to record data becomes an important issue. From this viewpoint, we have paid our attention to XML.

1.4. System beyond reach of end users

The final issue about current information system is that end users cannot directly reach the information system since they are not familiar with details of computers. Although relational database is considered to be easy to construct, the reality is not the case. It is hard for a general end user to understand the data structure of relational database. If information system branch replies that it is impossible for such and such reason for an end user request to add a new attribute to the existing database, the user cannot retort to the reply in general. The system based on RDB is a black box for an end user although some user familiar with computers may send an SQL query. However, the problems of data disappearance and database design still remain and general users complain, "why the information system of our company is difficult to use" and "why our system is so slow" if they had to wait for ten minutes until the answer to a simple SQL query returns.
This is a psychological issue of human beings. Since the system is a black box for the users, users feel that our system is slow if if they had to wait for ten minutes, or if they had to wait for one month until the request for the change of a database design. If they deeply know about the system, they many feel like that. We believe that information system should be more explicitly understood by every end user. In this respect, relational database seems to be too complicated.

2. Record all input data by XML

The system we propose is very simple. All input data are recorded by XML. Data are input in various formats in an information system of a company such as data input by scanning articles at point of sales terminal, update data for article master file, online order by other companies, returns by sales staffs and etc. The data from such various input source are usually stored and managed by a relational database in a unified way. However, there are many negative effects in data management by RDB as we have seen above. In order to get rid of the effects, we propose to abandon to use RDB, but instead to record all input data in chronological order by XML. For an end user, all what exists in the system is input data, and all information he/she needs lies in there. You not need an expensive and complicated software to view the content, but all you have to do is to open the file by an editor or a browser. If you feel sacred to access the computer at the system branch, you just ask the copy of the file. A very simple solution.

This is our philosophy and idea. Here are several problems we have to cope with.

How do we record as an XML data on the input screen?
Is it simple to process the data recorded in XML as you wish? Can it be done fast?
How do we update inventory data? This is not so simple as just accumulating input data.
How do we maintain data integrity?
Is it simple to change the system specification such as addition of a new attribute?

2.1. Input screen and record by XML

We will explain how we design the input screen and record input data by XML, based on a simple program for receiving orders. Suppose a staff inputs the data with a telephone receiver in one hand on the input screen shown in Figure 1.

FIgure 1 Screen for order receipt

The input data on this screen are customer ID, article number and article quantity. It is assumed that the other data such as customer name, article name, unit of quantity and price unit are transferred from customer master file and article master file and then displayed on the screen. It is also assumed that the amount and the total amount are calculated from input data, and that the date and the time are displayed from the internal clock of a computer.
Now how can we describe data items on this screen by XML? If one wants to avoid data duplication, he/she thinks that it is sufficient to record the newly input data and the time, and may make the following XML program.

Figure 2
<?xml version="1.0" encoding="UTF-8"?> <order receipt number="0001"> <input date>2001/12/06</input date> <input time>10:15:20</input time> <customer ID>0001</customer ID> <statement line="1"> <article No>010001</article No> <quantity>10</quatity> </statement line="2"> �@<article No>010002</article No> <quantity>1</quatity> </statement> </order receit>

However, there are problems with this way of recording; 1) There is no guarantee that the data displayed via article master file continues to exist in the master file forever,2) the program for computing the total summary may have a bug, and among other, 3) a user cannot understand at a first glance the details of the XML file (e.g. the article name of '10001"). For these reasons, we propose to record all data on the screen viewed by the staff. This will result in XML shown in Figure 3 below.

Figure 3
<?xml version="1.0" encoding="UTF-8"?> <order receipt number="0001"> <order receipt number="0001"> <input date>2001/12/06</input date> <input time>10:15:20</input time> <customer ID>0001</customer ID> <statement line="1"> <article No>010001</article No> <article name> US beef </article name> <quantity>10</quatity> <unit>kg</unit> <unit price>240</unit price> <amount>2400</amount> </statement line="2"> �@<article No>010002</article No> <article name>meat ball</article name> <quantity>1</quatity> <unit>sack</unit> <unit price>1200</unit price> <amount>1200</amount> </statement> </order receit>

The problems posed above have been resolved with this way of description. Everyone can understand the contents of the data. However, Ii this description sufficient? It has simply recorded the events occurred in the temporal order by XML. How can we describe the event of interruption input? We do not have a solution to this yet.
What we write in what follows is just our idea but has not been implemented. The way of recording we think best is to to record all events occurred in chronological order by XML. All what operates on a computer is written by some programming language. Most of current programming languages are developed by following the principle of "structured programming". A unit of process is described as one module which is in turn described by a tree structure. Thus, if we make the program out as XML tag what was done at the beginning and at the end of the program execution, XML data can be obtained. All tags should be attached with time attribute. However, things are not so simple. We could assume ten years ago without any doubt that a program is executed in a fixed order from the top, but we cannot do now because with the widespread use of GUI, a program is invoked by an event, in particular at the input screen. Therefore, it is not certain that making a module simply output XML tags does work for our purpose.
Although there are some problems remaining with data accumulation by XML, it is still very effective to record all input data by XML due to its flexibility. We are planning to investigate how we accumulate history data by XML.

2.2. Process of XML data

How do we process data accumulated by XML? The trend to use XML for data accumulation as database instead of RDB can be found in the activity of software development of Ygdracill. The data structure of XML is very flexible, but suffers inefficiency of processing speed. Also, considering that Excel of Microsoft enjoys popularity, the way to use a table for data presentation cannot be neglected due to its simplicity. From these views in our mind, we developed XML Table. The overall idea is to accumulate data by XML and to process it by XML table. Data processing in our system is done after converting XML data to XMLtable. Details will be given in the following sections.

2.3. Data update
2.4. Data consistency
2.5. System change

MUSASHI was developed for facilitating the implementation of data mining. However, we believe that our idea can also be applied to the development of operational information system. In fact, a few companies already did it.
We have not yet developed the application of MUSASHI to operational systems, and hence currently we do not touch upon the details of Sections 2.3 to 2.5, which will be given in near future.

3. XML Table

Data processed by MUSASHI is an XML table or a plain text file. We have five year's experience of data processing based on plain text files, and found that such data processing is quite effective. The disadvantage is that we have to specify an attribute by attribute number and always have to pay our attention to the ordering of numbered attributes. In order to avoid this advantage, we have developed MUSASHI in which we do not need to care about the ordering since we specify an attribute by name but not by number.

Now let us have a look at XML table. Figure 4 shows an XML table transformed from Figure 3.

FIgure 4 XML table
<?xml version="1.0" encoding="UTF-8"?> <xmltbl version="1.00"> <head> <field no="1"> <name>order receipt number</name> </field> <field no="2"> <name>statemnet_line_number</name> </field> <field no="3"> <name>article number</name> </field> <field no="4"> <name>article name</name> </field> <field no="5"> <name>quantity</name> </field> <field no="6"> <name>amount</name> </field> </head> <body><![CDATA[ 0001 1 010001 US beef 10 2400 0001 2 010002 meat ball 1 1200 ]]></body> </xmltbl>

This data is obtained by converting XML data of Figure 3 by command shown in Figure 5. By -k parameter, treating the region enclosed by tag "/order receipt/statement" as a single line, number of order receipt, attributes of "statement", i.e., article number, article name, quantity and amount are output by parameter -f. The result is an XML Table as shown in Figure 6. See List of commands to see the details about how to use commnads.

Figure 6
xml2xt -k/order_receipt/statement -f order_receipt@number, statement@number, article number, article name, quantity, amount

As seen from Figure 6, XMLtable is entirely an XML document. The uppermost element <xmltbl> has two elements <head> and <body>. In <body> tag, data in a table form is described in which LF is used as line delimiter, and space is used for attribute delimiter. If you look at this part, it is a data structure commonly used in Unix, and can also be used by awk and grep. Right after <body> tag, there is a <![CDATA[....]]> tag which instructs XML parser to treat it as it is. Namely, in order for XML parser to recognize LF, such CDATA tag is required (in usual, XML parser neglects LF within a tag bu treats it as a space character). However, commands dealing with Xmltable simply search for string "<body><![CDATA" and regard what follows the string as a table data. However, by doing so, if this data is displayed by an application such as InternetExplorer which can interpret XML, a user can view it as a table well structured. If such CDATA tag does not exist, the data will be displayed as a sequence of character strings.
<head> tag acts as a dictionary for this data in which information about name and position of each attribute is described by <field> tag. Various information other than information about name and position is stored in <head> tag. Figure 6 illustrates such XML table.

Figure 6
<?xml version="1.0" encoding="UTF-8"?> <xmltbl version="1.00"> <head> <title> XML Table sample </title> <comment> Write comments here </comment> <field no="1"> <name>order_receip_number</name> <comment>comments on attributes</comment> <sort priority="1"> </sort> </field> <field no="2"> <name>statement_line_number</name> </field> <field no="3"> <name>article number</name> </field> <field no="4"> <name>article name</name> </field> <field no="5"> <name>quantity</name> <sort priority="2"> <numeric/><reverse/> </sort> </field> <field no="6"> <name>amount</name> </field> </head> <body><![CDATA[ 0001 1 010001 US beef 10 2400 0001 2 010002 meat ball 1 1200 ]]></body> </xmltbl>

Tags available in XMLtable are shown in tree form as follows.

Figure 7
xmlTalbe +head +title +comment +field@no +name +comment +sort@priority +reverse +numeric +body

The following explains what each tag means.

tag	explanation
<xmltbl>	tag that stands for xmltable (mandatory)
<xmltbl><head>	meta data of table data shown in <body> (mandatory)
<xmltbl><head><title>	title for the data (optional)
<xmltbl><head><comment>	comment for the data (optional)
<xmltbl><head><field>	attribute information (mandatory) "no" stands for the attribute position as number.
<xmltbl><head><field><name>	attribute name (mandatory)
<xmltbl><head><field><comment>	comment for the attribute (optional)
<xmltbl><head><field><sort>	This indicates that the data is sorted by this attributej. Attribute "priority" indicated the priority of the attribute used for sorting. Most of commands determine the algorithm actually applied.
<xmltbl><head><field><sort><numeric>	It is used if sorted by numerical order (optional)
<xmltbl><head><field><sort><reverse>	It is used if sorted in decreasing order (optional)
<xmltbl><body>	Data is inserted here (mandatory)

MUSASHI	publications	development team	related links	mailing list	user group
Copyright 2004 MUSASHI