Japanese Site 
  Overview

Concept
XMLTable

Applications

Download
Requirements

Documentation

User
Developer
Benchmark

Training

MUSASHI Tutorial
Data Mining Studies
Artificial Data

Development

Bug Reports
Roadmap
CVS
Join Us

 

How data processing is done with MUSASHI

A set of commands in our system are developed following the philosophy of UNIX. Namely, it enables us to create various data processing by combining small commands having a single function.

As was already explained, it is assumed in our system that data is stored in an XML file. In order to run our commands, an XML file is converted into an XML table. In order to process a data in XML table, we prepare a lot of commands varying from an operator of selecting attributes to that of joining two tables.

As in most of UNIX commands, all commands read XML table(s) from standard input, process them and writes the result into a standard output. Data is normally stored on a hard disk. By way of redirection, the resulting output is subsequently read or stored. Typical example of data processing using UNIX is as follows.

standard input → process → standard output: mcut -f date<in.xt >out.xt

With in.xt as a standard input, the command mcut (with parameter -f date) processes in.xt and the output is written into out.xt.

More specifically, in.xt is an XML table as illustrated in Fig. , and mcut selects the attributes specified by the parameter -f. In the example in Fig. 1, The single attribute "date" is selected from in.xt which consists of four attributes. The result is written into the file out.xt.

Let us consider how to combine one than one commands. Two or more commands can be combined by the function "pipe" which has been conventionally used in UNIX. "Pipe" lays a pipeline between commands by
which a standard output of the preceding command is linked to the standard output of the succeeding command.

input → command1=command2 → output

Note: If piping function is equipped with, you can call the commands either from shell such as bash, perl and tcl.
The standard UNIX shell (bash) is presumed throughout this document.

In order to illustrate the pipe function, let us link xtagg after xtcut in the previous example. It is simply done
by typing the following.

xtcut -f date, amount <in.xt | xtagg -k date -f amount -csum >out.xt

The command xtcut simply selects two attributes date and amount which are then handed over to the command xtagg by using "pipe". The command xtagg is a sum command which sums up (-csum) the amounts of the values in the attribute "amount" with date as key (-k date). The result is written out into out.xt.

Combining several simple commands in this manner enables us to carry out various complex computations.
This idea is compared to Lego block. Various objects can be constructed by combining various types of blocks. The difference lies in that in building Lego, the final object may be altered during the building process, stimulated by building process itself while in our commands, the final output is fixed and we know the set of commands available, but it is sometimes difficult to figure out how to combine them which is like an intellectual puzzle.

In fact, a simple data processing can be written by a combination of only a few commands while a complex one
requires more than ten commands to be combined. Thus, instead of writing in a single line a complex combination of all required commands, it is better to appropriately break it into a number of lines so that every line
contains a small number of commands ending with writing the intermediate result into a wrok file.

In the tutorial, we illustrate how we can use our commands to carry out various tasks ranging from simple processes to complex ones. This will convince you how our system has a flexible power that allows us to efficiently process various tasks with a large amount of data.


MUSASHI publications development team related links mailing list user group