数据挖掘作业1hw1

HW1

Due Date: Nov. 2

Submission requirements:

Please submit your solutions to our class website. Only hand in what is required below.

Upload the Clementine stream containing the assignment execution to our class website so that we may refer to it if necessary. Part I: 书面作业

1. 假定数据仓库中包含4个维:date, product, vendor, location;和两个度量:sales_volume和sales_cost。

(a) 画出该数据仓库的星形模式图

(b) 由基本方体[date, product, vendor, location]开始,列出每年在Los Angles的每个vendor的

sales_volume。

(c) 对于数据仓库,位图索引是有用的。以该立方体为例,简略讨论使用位图索引结构的优点和问

题。

2.Suppose a hospital tested the age and body fat data for 18 random selected adults with the following result: age út 23 9.5 23 26.5 27 7.8 27 17.8 39 31.4 41 25.9 47 27.4 49 27.2 50 31.2 52 34.6 54 42.5 54 28.8 56 33.4 57 30.2 58 34.1 58 32.9 60 41.2 61 35.7 (a) (b) (c) (d)

(e)

Calculate the mean, median, and standard deviation of age and út. Draw the boxplots for age and út.

Draw a scatter plot based on these two variables.

Normalize the two variables based on min-max normalization.

Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these two variables positively or negatively correlated?

3. 下面是一个超市某种商品连续20个月的销售数据(单位为百元)

21,16,19,24,27,23,22,21,20,17,16,20,23,22,18,24,26, 25,20,26。对以上数据进行深度为5的Equal-depth binning.

(a) 采用bin median方法进行平滑; (b) 采用bin boundaries方法进行平滑。

4. A database has 5 transactions. Let min_sup = 40%.

TID items_bought 1 O, N, K, E 2 D, O, K, E, Y 3 A, K, E, O 4 M, U, C, K 5 C, O, K, I, E (a) Find all frequent itemsets using Apriori. Please present the frequent itemsets in each iteration.

PAGE 1 11/15/2015

(b) Find all frequent itemsets using FP-growth. Please present all the FP-trees, conditional pattern bases, and frequent patterns in the entire process.

5. Consider the data set shown in Question 4. Assume min_sup = 40%, min_conf=40%.

(a) Please present the confidence for all the 3-frequent itemsets.

(b) List all of the strong association rules (with support s and confidence c) matching the

following metarule, where X is a variable representing customers, and itemi denotes variables representing items (e.g. “A”, “B”, etc.):

Part II: 上机作业:Recommendation Systems

The goal of this assignment is to learn the use of market basket analysis for the purpose of making product purchase recommendations to the customers.

The data set contains transactions from a large supermarket. Each transaction is made by someone holding the loyalty card. We limited the total number of categories in this supermarket data to 20 categories for simplicity. The field value for a certain product in the transaction basket is 1 if the customer has bought it and 0 if he/she has not. The file named “Transactions” has data for 46243 transactions. The data are available from the class web page.

Your written submission should consist only of those deliverables marked indicated by “Hand-in”. Market basket analysis has the objective to discover individual products, or groups of products that tend to occur together in transactions. The knowledge obtained from a market basket analysis can be employed by a business to recognize products frequently sold together in order to determine recommendations and cross-sell and up-sell opportunities. It can also be used to improve the efficiency of a promotional campaign.

Run Apriori on “transaction” data set. Set the “Type” of “COD” as “Typeless”, set the “direction” of all the other 20 categories as “Both”, set their “Type” as “Flag”. Set “Minimum antecedent support” to be 5%, “Minimum confidence” to be 50%, and “Maximum number of antecedents” to be 5 in the modeling node (Apriori node). In general you should explore by trying different values of these parameters to see what type of rules you get.

? Hand-in: The list of association rules generated by the model.

? Sort the rules by lift, support, and confidence, respectively to see the rules identified.

Hand-in: For each case, choose top 5 rules (note: make sure no redundant rules in the 5 rules) and give 2-3 lines comments. Many of the rules will be logically redundant and therefore will have to be eliminated after you think carefully about them.

PAGE 2 11/15/2015

联系客服:779662525#qq.com(#替换为@) 苏ICP备20003344号-4