To create a report with hypotheses, click on the Create hypotheses tab and select:
Next, click the Create button at the bottom to start the hypothesis formulation process. The radio buttons just above the "Create" button, allow you to control (also while the hypothesis formulation process running) the type of hypotheses searched: for low values, for high values, or for both. The default is to look for indications for both high and low values. In particular if you intend to use the generated hypotheses for ranking new cases, the default setting is preferable.
- one or more (numerical or categorical) Experimental Observations for which you want to find hypotheses (right panel);
- Background Knowledge you want to include in the hypotheses (left panel), i.e., the components of the hypothesis language; by default all background knowledge apart from the observations is activated.
While the hypothesis formulation process is running, you can at any time click the Stop button to (gently) interrupt the process and stick to the results produced so far.
By clicking on the Show Advanced button (top right corner) you can enter a name for your report and modify five parameters that influence the search for hypotheses.
Basically, the higher the values for the first four parameters, the more intense DMax Assistant™ will search. A more intense search means:
So these parameters allow you to explore the quality vs. speed tradeoff, even while the tool is running. You could for instance start with high values, and lower them later on to speed up (accepting potential loss in quality).
- on the positive side: potentially better results;
- on the negative side: guaranteed longer time to complete;
To use these parameters effectively, you need some insight into the way DMax Assistant™ looks for hypotheses. Users happy with the default settings can safely skip the rest of this page.
The first step is to exhaustively try all hypotheses that refer to a single item of background knowledge. These single item hypotheses are ranked on the basis of "interestingness". Interesting hypotheses being those that group examples with an average target value that is unusually high or low (see Section Statistics for a discussion of the statistical tests used).
Next, the top N single item hypotheses are selected for extension. The idea is that the hypothesis might be improved by making it more restrictive, i.e., longer. Extra conditions might lead to a smaller group of covered examples, with a potentially more unusual average target value. You can control the N in "top N" via the Field of View Width parameter. By default, this parameter has value 5, which means at most the top 5 single item hypotheses will be selected for extension.
The above procedure is then repeated. The 2-item hypotheses are evaluated and ranked. The top N is selected for further extensions to create 3-item hypotheses. And so on.
You can control the maximal number of background items that is combined with the Maximal rule length parameter. By default, this parameter has value 3, which means that hypotheses will be constructed in maximally 3 steps. Notice that in each step DMax Assistant™ will try to further improve the top N hypotheses by adding conditions that further isolate a subgroup with an unusual average target value.
The steps above result in a hypothesis that is just specific enough to define an interesting subgroup. In many situations, it is desirable to add additional characteristics that are common to all members of the subgroup. To achieve this, we can stepwise specialize the hypothesis by adding conditions. The number of specialization steps can be controlled with the Max specialization steps parameter. If you set this parameter to 0, no specialization will happen. By default, maximally 8 specialization steps will be tried.
The paragraphs above describe the strategy for finding one hypothesis. To find multiple hypotheses, DMax Assistant™ will repeat this strategy. Obviously, if we were to apply exactly the same procedure, we would obtain the same hypothesis in an endless loop. To avoid this, we use the concept of "fading data". The examples covered by a hypothesis fade away such that in the next round they contribute less to the score of the hypotheses that cover them. As a result:
The speed with which examples fade away can be controlled with the Allow overlap parameter. Higher values for this parameter cause examples to fade away slower. As a result, single examples will contribute to multiple hypotheses, which will therefore tend to "overlap" (i.e., refer to similar examples). Notice that also with low values hypotheses can overlap, but their quality is calculated merely on the basis of their coverage of "new" (i.e., yet uncovered) examples.
- the hypotheses found in previous rounds will get a lower score;
- other hypotheses, in particular those that cover yet uncovered examples, move to the top of the ranking
- the process converges to a situation where all "unusual" examples have faded away, and no more interesting rules can be found
The fourth parameter Stop after x very low confidence rules can be selected or deselected. If it is selected, the search for new hypotheses to explain either low or high values will be interrupted as soon as a user-defined number of weak hypotheses has been found. Notice this stop criterion relies exclusively on training data: a weak hypothesis is one with a p-value above 0.5 as computed on the training set. Weak hypotheses could be interesting as they may capture small subgroups of exceptional cases. However, in large datasets it may be necessary to limit the maximal number of such small groups that are returned.
© 2002-2010 DTAI, K.U.Leuven. All rights reserved.