Japanese Site |
Overview | Applications | Documentation | Training | Development |
Data mining researchers in business applications are faced with a serious problem that it is difficult to obtain real business data for their research use. We have developed a system for producing an artificial data of customer purchase history. Since it is artificial, we do not need to care about confidentiality of data. (tutorial makes use of this artificial data).
When generating artificial data, it is important that data generated looks like a real one. The current system generates artificial data based on a simple random number generator, and thus it lacks reality in some sense. We are planning to sophisticate it so as to increase the reality.
1) The archive of the up-to-date script and
the data is done download (approximately
290K).
2) Extract the files. (Tar zxvf basic.tar.gz)
3) Please verify that the following file is
developed below the basic directory.
File name | Explanation |
Setup.sh Mkjcfs.sh Mksyo.sh Mktra.sh Mksepdat.sh |
The parent script which draws up the artificial data (4
scripts below are started). The script which draws up JICFS commodity classification master. The script which draws up commodity master. The script which draws up the customer buying past record data. The script where it divides the customer buying past record data classified by month and compresses and retains. |
Dat/cust.xt Dat/jicfs1.xt Dat/jicfs2.xt Dat/jicfs4.xt Dat/jicfs6.xt Dat/syo.xt Dat/dat.xt Dat/dat.xt.gz |
Customer master JICFS large classification master JICFS dividing into two equal parts master JICFS small classification master JICFS subdivisions master Commodity master Customer buying past record data The customer buying past record data which is divided classified by month |
Input/jicfs.txt | Original text data of JICFS master |
Because by the Setup.sh script was already drawn
up the data which (the file below the dat directory) is included in
the archive, as for the person who would like to utilize just the
data, as for the necessity to start the Setup.sh script it is not.
Please utilize the file below the dat directory.
Adjusting parameter, as for the one where we would like to draw
up the artificial data which differs, explanation of the next
paragraph in reference, & please correct starting the script.
The script, 1) conversion of the JICFS cord/code
file, 2) formation of the commodity master file, and 3) formation of
the buying past record data, 4) consists of 4 of compilation of the
buying past record data classified by month. The JICFS cord/code(the commodity) is taxonomic codeof the commodity which circulation system development
center manages. The firstscript
(mkjicfs.sh)the JICFStaxonomic
cord/code is convertedto the data of the type
of XMLtable then (4 filesof largeclassification,dividing
into two equal parts,small classificationand
subdivisions).The second scripts
(mksyo.sh)with, the commodity master file
is formed automatically on the basisof this
JICFScord/code chart. Then, it is designed in
such a way that several commodities are registered concerning
subdivisions items, it is formed automatically in addition item such
as manufacturer cord/code, brand cord/code and stock unit cost on the
basis of random number. Andthe third script
(mktra.sh)being similar,it
forms the customer buyingpast recorddata and the customermaster file.The script of the
four squares (mksepdat.sh), it divides the customer
buying past record data which was made first gz compresses and retains
as the file classified by month.
We would like to have referring to the comment of each script
concerning, the respective script how operates. In addition, it
is necessary the 1.0.2 or more of MUSASHI-CORE to be installed in
execution of the script.
In addition, because with the script random number is utilized,
executing Setup.sh, the data which each time differs is formed.
It shows in the chart below, it can decide each item with some kind of rule at the time of data automatic generation.
Item | Calculation rule | Default |
The number of commodities | Concerning each JICFS subdivisions, at the random number which is based on normal distribution it decides. | Average 4, the random number which is based on the normal
distribution of standard deviation 2 (Below, nrand (mean value and standard deviation) with you display.) |
The commodity classification which is dealt with | When all classification is dealt with, because it reaches a quantity where the number of commodities is enormous, only a certain commodity classification selects. | Dividing into two equal parts only 11,12,13,14 deal with. |
Stock unit cost | The same commodity has the same stock unit cost. At the random number which is based on normal distribution it decides. | Nrand (,100 Yen 298 Yen) However, it deletes the commodity which becomes stock unit cost of 50 Yen or less. |
Customer number of people classified by store | At care power, the store cord/code and customer number of people are appointed. | A - 7 stores of G, total 2350 customers |
The customer it comes, store time | It decides with the kind of random number where a certain time becomes peak | It seems that 11 o'clock and 16 o'clock comes and becomes peak of the store to come, store time |
Purchase quantity per one coming store of customer | At the random number which is based on normal distribution it decides. | Nrand (nrand (5,2),3) However, 0 or less are excluded |
The customer it comes, store interval | At the random number which is based on normal distribution it decides. | Nrand (30 day,10 day) |
The customer it comes, store target date | In order a certain date properly to be distributed focusing on, it decides with random number. | 2001/07/01 + nrand (0,200 days) |
The customer it comes, store end day | In order a certain date properly to be distributed focusing on, it decides with random number. | 2003/07/01 + nrand (0,200 days) |
Purchase commodity | The frequency which can be sold with the commodity is changed with random number | Nrand (point of commodity number, commodity several /3) |
Sex | In a certain ratio the man and woman is decided to random | "Man: Woman = 1: The kind of uniform random number which becomes with 5 " |
Date of birth of customer | At the random number which is based on normal distribution it decides | "1960 +nrand (0,10)" /01/01 + nrand (0,365 days) |
Selling price | It fluctuates with the probability which is been by date concerning the commodity which has a certain store. | It stocks the unit cost in first date and unit cost (1+nrand (0.3,0.1)) with is calculated. It calculates the unit cost from the following date, from unit cost of date before. The same unit cost as the line before is used with 90% probability. Unit cost fluctuates with 10% probability, adds the 10% to unit cost of line before. |
MUSASHI | publications | development team | related links | mailing list | user group | |
Copyright 2004 MUSASHI |