Japanese Site 
  Overview

Concept
XMLTable

Applications

Download
Requirements

Documentation

User
Developer
Benchmark

Training

MUSASHI Tutorial
Data Mining Studies
Artificial Data

Development

Bug Reports
Roadmap
CVS
Join Us

 

Generating Artificial Data of Customer Purchase History

Making plausible data

Data mining researchers in business applications are faced with a serious problem that it is difficult to obtain real business data for their research use. We have developed a system for producing an artificial data of customer purchase history. Since it is artificial, we do not need to care about confidentiality of data. (tutorial makes use of this artificial data).

When generating artificial data, it is important that data generated looks like a real one. The current system generates artificial data based on a simple random number generator, and thus it lacks reality in some sense. We are planning to sophisticate it so as to increase the reality.

Download of artificial data

1) The archive of the up-to-date script and the data is done download (approximately 290K).
2) Extract the files. (Tar zxvf basic.tar.gz)
3) Please verify that the following file is developed below the basic directory.

File name Explanation
Setup.sh
Mkjcfs.sh
Mksyo.sh
Mktra.sh
Mksepdat.sh
The parent script which draws up the artificial data (4 scripts below are started).
The script which draws up JICFS commodity classification master.
The script which draws up commodity master.
The script which draws up the customer buying past record data.
The script where it divides the customer buying past record data classified by month and compresses and retains.
Dat/cust.xt
Dat/jicfs1.xt
Dat/jicfs2.xt
Dat/jicfs4.xt
Dat/jicfs6.xt
Dat/syo.xt
Dat/dat.xt
Dat/dat.xt.gz
Customer master
JICFS large classification master
JICFS dividing into two equal parts master
JICFS small classification master
JICFS subdivisions master
Commodity master
Customer buying past record data
The customer buying past record data which is divided classified by month
Input/jicfs.txt Original text data of JICFS master

Because by the Setup.sh script was already drawn up the data which (the file below the dat directory) is included in the archive, as for the person who would like to utilize just the data, as for the necessity to start the Setup.sh script it is not. Please utilize the file below the dat directory.
Adjusting parameter, as for the one where we would like to draw up the artificial data which differs, explanation of the next paragraph in reference, & please correct starting the script.

Data generation script

The script, 1) conversion of the JICFS cord/code file, 2) formation of the commodity master file, and 3) formation of the buying past record data, 4) consists of 4 of compilation of the buying past record data classified by month. The JICFS cord/code(the commodity) is taxonomic codeof the commodity which circulation system development center manages. The firstscript (mkjicfs.sh)the JICFStaxonomic cord/code is convertedto the data of the type of XMLtable then (4 filesof largeclassification,dividing into two equal parts,small classificationand subdivisions).The second scripts (mksyo.sh)with, the commodity master file is formed automatically on the basisof this JICFScord/code chart. Then, it is designed in such a way that several commodities are registered concerning subdivisions items, it is formed automatically in addition item such as manufacturer cord/code, brand cord/code and stock unit cost on the basis of random number. Andthe third script (mktra.sh)being similar,it forms the customer buyingpast recorddata and the customermaster file.The script of the four squares (mksepdat.sh), it divides the customer buying past record data which was made first gz compresses and retains as the file classified by month.
We would like to have referring to the comment of each script concerning, the respective script how operates. In addition, it is necessary the 1.0.2 or more of MUSASHI-CORE to be installed in execution of the script.
In addition, because with the script random number is utilized, executing Setup.sh, the data which each time differs is formed.

Adjustment possible parameter

It shows in the chart below, it can decide each item with some kind of rule at the time of data automatic generation.

Item Calculation rule Default
The number of commodities Concerning each JICFS subdivisions, at the random number which is based on normal distribution it decides. Average 4, the random number which is based on the normal distribution of standard deviation 2
(Below, nrand (mean value and standard deviation) with you display.)
The commodity classification which is dealt with When all classification is dealt with, because it reaches a quantity where the number of commodities is enormous, only a certain commodity classification selects. Dividing into two equal parts only 11,12,13,14 deal with.
Stock unit cost The same commodity has the same stock unit cost. At the random number which is based on normal distribution it decides. Nrand (,100 Yen 298 Yen)
However, it deletes the commodity which becomes stock unit cost of 50 Yen or less.
Customer number of people classified by store At care power, the store cord/code and customer number of people are appointed. A - 7 stores of G, total 2350 customers
The customer it comes, store time It decides with the kind of random number where a certain time becomes peak It seems that 11 o'clock and 16 o'clock comes and becomes peak of the store to come, store time
Purchase quantity per one coming store of customer At the random number which is based on normal distribution it decides. Nrand (nrand (5,2),3)
However, 0 or less are excluded
The customer it comes, store interval At the random number which is based on normal distribution it decides. Nrand (30 day,10 day)
The customer it comes, store target date In order a certain date properly to be distributed focusing on, it decides with random number. 2001/07/01 + nrand (0,200 days)
The customer it comes, store end day In order a certain date properly to be distributed focusing on, it decides with random number. 2003/07/01 + nrand (0,200 days)
Purchase commodity The frequency which can be sold with the commodity is changed with random number Nrand (point of commodity number, commodity several /3)
Sex In a certain ratio the man and woman is decided to random "Man: Woman = 1: The kind of uniform random number which becomes with 5 "
Date of birth of customer At the random number which is based on normal distribution it decides "1960 +nrand (0,10)" /01/01 + nrand (0,365 days)
Selling price It fluctuates with the probability which is been by date concerning the commodity which has a certain store. It stocks the unit cost in first date and unit cost (1+nrand (0.3,0.1)) with is calculated. It calculates the unit cost from the following date, from unit cost of date before. The same unit cost as the line before is used with 90% probability. Unit cost fluctuates with 10% probability, adds the 10% to unit cost of line before.

Future schedule

MUSASHI publications development team related links mailing list user group