Japanese Site 
  Overview

Concept
XMLTable

Applications

Download
Requirements

Documentation

User
Developer
Benchmark

Training

MUSASHI Tutorial
Data Mining Studies
Artificial Data

Development

Bug Reports
Roadmap
CVS
Join Us

 

Benchmark


In order to show the standard of the operation time of MUSASHI, here the simple benchmark test is done.

Summary

5 scripts are executed simply, that average duration is measured, (it does not measure CPU time and/or has not done).
It forms the data to random, it does measurement vis-a-vis the data of 10, 50, 100 and 200 ten thousand lines.

How we run the benchmark

Simply, 5 processing scripts are executed, that average duration is measured, (it does not measure CPU time and/or has not done).

Sample data

File name The number of lines Size The number of types of key
Key1 Key2 Key3 Key4
Dat100000.xt 100000 Approximately 10 M 63,077 10,000 1,000 100
Dat500000.xt 500000 Approximately 50 M 99,360 10,000 1,000 100
Dat1000000.xt 1000000 Approximately 100 M 99,997 10,000 1,000 100
Dat2000000.xt 2000000 Approximately 200 M 100,000 10,000 1,000 100

Content of each data has become like below.

<? Xml version= "1.0" encoding= "euc-jp"? > 
<xmltbl version= "1.1" > <header> <field no= "1" name= "No" ></field> 
<field no= "2" name= "key1" ></field> <field no= "3" name= "key2" 
></field> <field no= "4" name= "key3" ></field> <field no= "5" name= 
"key4" ></field> <field no= "6" name= "qtty" ></field> <field no= "7" 
name= "price" ></field> <field no= "8" name= "rand1" ></field> <field 
no= "9" name= "rand2" ></field> <field no= "10" name= "date1" 
></field> <field no= "11" name= "time1" ></field> <field no= 
"12" name= "date2" ></field> <field no= "13" name= "time2" 
></field> <field no= "14" name= "s1" ></field> <field no= "15" name= 
"s2" ></field> <field no= "16" name= "s3" ></field> <field no= "17" 
name= "s4" ></field> </header> <body><! [ CDATA [ 00000001 
139438 103943 100394 100039 2 178 415 34174 20030701 000000 20040819 
09 2934 123 123 123 123 00000002 178309 107830 100782 100078 3 256 726
67759 20030701 000000 20050626 18 4919 123 123 123 123 00000003 179843
107984 100798 100079 3 259 738 69084 20030701 000000 20050708 19 1124 
123 123 123 123 00000004 191164 109116 100911 100090 3 281 828 78865 
20030701 000000 20051006 21 5425 123 123 123 123 00000005 119755 
101975 100197 100020 1 139 258 17168 20030701 000000 20040315 04 4608 
123 123 123 123 00000006 133522 103352 100335 100033 2 167 368 29063 
20030701 000000 20040703 08 0423 123 123 123 123 00000007 176822 
107682 100767 100076 3 253 714 66474 20030701 000000 20050614 18
2754 123 123 123 123 00000008 127777 102777 100277 100027 2 155 322 
24099 20030701 000000 20040518 06 4139 123 123 123 123 00000009 155396
105539 100553 100055 2 210 543 47962 20030701 000000 20041225 13 1922 
123 123 123 123 00000010 147739 104773 100477 100047 2 195 481 41347 
20030701 000000 20041024 11 2907 123 123 123 123

Run a benchmark test in your environment

1. It downloads the script setfor benchmark, it thaws & develops.

$ tar zxvf bench.tar.gz

2. Because it is the expectation which the directory, bench does, it moves to that directory, it verifies that it is three scripts below.

$ cd bench

Script name Explanation
Mkdat.sh The data for benchmark is formed. Eight data where the number of lines differs to the indat directory are formed.
Bench.sh Benchmark is executed, the measurement result is retained in the result directory.
Mkhtml.sh The measurement result of the result directory is totaled, HTML document set is formed in the directory of the name which is appointed with bench.sh.

3. It executes mkdat.sh and (5 - 10 minutes), it verifies that eight files below make the indat directory.

$ . /mkdat.sh

Division File name Explanation
Data for test Dat100.xt Data of 100 lines
Dat500.xt Data of 500 lines
Dat1000.xt Data of 1000 lines
Dat2000.xt Data of 2000 lines
Data for production Dat100000.xt Data of 10 ten thousand lines
Dat500000.xt Data of 50 ten thousand lines
Dat1000000.xt Data of 100 ten thousand lines
Dat200000.xt Data of 200 ten thousand lines

4. The benchmark script static test mode (the small data utilization) with is executed (ends within 1 minute).
"&>log" is inserted because the execution log is retained in the file, (it reaches the point where nothing picture in is indicated).

$ . /bench.sh all test &>log

You verify the content of log, you verify that error message is not output.
In addition, if it is executed just, it is the expectation where the result file is written out in the result directory.

5. The benchmark script is executed with production mode.
"&>log" is inserted because the execution log is retained in the file, (it reaches the point where nothing picture in is indicated).
It depends on the execution of the production, even in the specifications of the machine, but because 2 hours or more it is required, lunch break or while going to bed that it does are recommended.

$ . /bench.sh all &>log

Note) with this benchmark script, it does not measure actual CPU open hours in time command and/or and the like has not done, actual time to end just it is measured from starting the command. Because of that, the case where the bench.sh script is executed, in order for excessive process not to be executed with the reverse side, please pay attention.

6. Below regarding present machine environment, four data are described to the mkhtml.sh script (have explained details inside the script).

7. It totals the result, the HTML file it is formed (ends instantaneously).

$ . /mkhtml.sh

8. It meaning that HTML document "table.shtml" makes under the directory of the benchmark name which is input with procedure 6, you open with the WEB browser, verify contents.

MUSASHI publications development team related links mailing list user group