INFO 367Chapter 2 Problems2/6/17 2.1) Assuming that data mining techniques are to be used in the following cases, identify whether the task required is supervised or unsupervised learning. a. Deciding whether to issue a loan to an applicant based on demographic and financial data (with reference to a database of similar data on prior customers) b. In an online bookstore, making recommendations
...[Show More]
INFO 367
Chapter 2 Problems
2/6/17
2.1) Assuming that data mining techniques are to be used in the following cases, identify whether the task required is supervised or unsupervised learning.
a. Deciding whether to issue a loan to an applicant based on demographic and financial data (with reference to a database of similar data on prior customers)
b. In an online bookstore, making recommendations to customers concerning additional items to buy based on the buying patterns in prior transactions.
c. Identifying a network data packet as dangerous (virus, hacker attack) based on comparison to other packets whose threat status is known.
d. Identifying segments of similar customers
e. Predicting whether a company will go bankrupt based on comparing its financial data to those of similar bankrupt and non-bankrupt firms.
f. Estimating the repair time required for an aircraft based on a trouble ticket
g. Automated sorting of mail by zip code scanning.
h. Printing of custom discount coupons at the conclusion of a grocery store checkout based on what you just bought and what others have bought previously.
2.2) Describe the difference in roles assumed by the validation partition and the test partition.
2.3) Consider the sample from the database of credit applicants in table 2.5. Comment on the likelihood that it was sampled randomly, and whether it is likely to be a useful sample.
2.4) Consider the sample from a bank database shown in table 2.6; is was selected randomly from a larger database to be the training set. Personal loan indicates whether a solicitation for a personal loan was accepted and is the response variable. A campaign is planned for a similar solicitation in the future, and the bank is looking for a model that will help identify likely responder. Examine the data carefully and indicate what your next step would be.
2.5) Using the concept of overfitting, explain why when the model is fit to training data, zero error with those data is not necessarily good.
2.6) In fitting a model to classify prospects as purchasers or nonpurchasers, a certain company drew the training data from internal data that include demographic and purchase information. Future data to be classified will be listed purchased from other sources, with demographic data included. It was found that “refund issued” was a useful predictor in the training data. Why is this not an appropriate variable to include in the model?
2.7) A dataset has 1000 records and 50 variables with 5% of the values missing, spread randomly throughout the records and variables. An analyst decides to remove records with missing values. About how many records would you expect to be removed?
2.8) Normalize the data
| Age | | | Income | |
| 25 | -1.313253 | | 49000 | -0.790027 |
| 56 | 0.7567898 | | 156000 | 0.9119774 |
| 65 | 1.35777 | | 99000 | 0.0053022 |
| 32 | -0.845824 | | 192000 | 1.4846144 |
| 41 | -0.244844 | | 39000 | -0.949093 |
| 49 | 0.2893608 | | 57000 | -0.662774 |
| | | | | |
| | Average | 44.66667 | | Average | 98666.67
| | Standard Deviation | 14.97554 | | Standard Deviation | 62867.06
2.9) Can normalizing the data change which records are furthest away from each other in terms of Euclidean distance?
2.10) Two models are applied to a dataset that has been partitioned. Model A is considerably more accurate than Model B on the training data, but slightly less accurate than Model B on the validation Data, Which model are you more likely to consider for final development?
2.11)
[Show Less]