关联规则与频繁项集
Association rules are statements of the form fX1;X2; : : :;Xng ) Y , meaning that if we nd all of X1;X2; : : :;Xn in the market basket, then we have a good chance of nding Y . The probability of nding Y for us to accept this rule is called the con dence of the rule. We normally would search only for rules that had con dence above a certain threshold. We may also ask that the con dence be signi cantly higher than it would be if items were placed at random into baskets. For example, we might nd a rule like fmilk; butterg ) bread simply because a lot of people buy bread. However, the beer/diapers story asserts that the rule fdiapersg ) beer holds with con dence sigini cantly greater than the fraction of baskets that contain beer.
2. Causality. Ideally, we would like to know that in an association rule the presence of X1; : : :;Xn actually causes" Y to be bought. However, causality" is an elusive concept. nevertheless, for market-basket data, the following test suggests what causality means. If we lower the price of diapers and raise the price of beer, we can lure diaper buyers, who are more likely to pick up beer while in the store,thus covering our losses on the diapers. That strategy works because diapers causes beer." However,working it the other way round, running a sale on beer and raising the price of diapers, will not result in beer buyers buying diapers in any great numbers, and we lose money.
3. Frequent itemsets. In many (but not all) situations, we only care about association rules or causalities involving sets of items that appear frequently in baskets. For example, we cannot run a good marketing strategy involving items that no one buys anyway. Thus, much data mining starts with the assumption that we only care about sets of items with high support; i.e., they appear together in many baskets. We then nd association rules or causalities only involving a high-support set of items (i.e., fX1; : : :;Xn; Y g must appear in at least a certain percent of the baskets, called the support threshold.