The Solutions in "Learning From Data: A Short Course"(Last Updated: 2021-02-22)


The Solutions in "Learning From Data: A Short Course"(Continuing Updates)

This essay is used to post my solutions to the exercises in the book Learning From Data: A Short Course, by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin, which is also a reference textbook to my machine learning course this semester(Semester B 2020/21).

If there is any mistake or problem in the solutions, please feel free to comment and let me know; I will appreciate that.
如果本文解题有任何错误或者问题,请留言告诉我,我不胜感激。

Chapter 1: The Learning Problem

Exercise 1.1

Express each of the following tasks in the framework of learning from data by specifying the input space \(X\), output space \(Y\), target function \(f: X \to Y\), and the specifics of the data set that we will learn from.

(a) Medical diagnosis: A patient walks in with a medical history and some symptoms, and you want to identify the problem.

(b) Handwritten digit recognition (for example postal zip code recognition for mail sorting).

(c) Determining if an email is spam or not.

(d) Predicting how an electric load varies with price, temperature, and day of the week.

(e) A problem of interest to you for which there is no analytic solution, but you have data from which to construct an empirical solution.

Answer:

Part Input Space X Output Space Y Function f: X->Y The Specifics of the Data Set
a Medical Histories and some symptoms Whether the patient has disease Based on the clinical diagnosis and experience to judge whether the patient has the disease. The main symptoms and medical histories of the target disease need as clear and correct as possible.
b Handwritten digit Recognition digit result Through the digits features recognition to classify digit and get a result. Handwritten digits need to be clear and as much as possible to cover different handwritten habits.
c Emails and some information about these emails such as senders, time, etc. Whether the email is spam or not. The function can base on some keywords, senders, title, or other information to judge whether the email is spam or not Need to update keywords or senders set to improve the function judgment.
d The data set of prices, temperatures, and days of the weeks. Electric load results Use linear regression to predict results Data need as much as possible to make results more accurate.
e The information of the problem and some related data. The solution for this problem. use my experience or research. The information should be related to the problem.

Exercise 1.2

Suppose that we use a perceptron to detect spam messages. Let's say that each email message is represented by the frequency of occurrence of keywords, and the output is \(+1\) if the message is considered spam.

(a) Can you think of some keywords that will end up with a large positive weight in the perceptron?

(b) How about keywords that will get a negative weight?

(c) What parameter in the perceptron directly affects how many borderline messages end up being classified as spam?

Answer:

(a)

For example, 'On Sale', 'Unsubscribe', 'for free', 'coupon' and etc.

(b)

For example, 'schedule,' 'weather,' 'receipt' etc.

(c)

The weights to the different input directly affect the messages being classified as spam.

Exercise 1.3

The weight update rule in (1.3)* has the nice interpretation that it moves in the direction of classifying x(t) correctly.

(a) Show that \(y(t)w^T(t)x(t) < 0\) . [Hint: \(x(t)\) is misclassified by \(w(t)\).]

(b) Show that \(y(t)w^T(t + 1)x(t) > y(t)w^T(t)x(t)\). [Hint: Use (1.3).]

( c) As far as classifying \(x(t)\) is concerned, argue that the move from \(w(t)\) to \(w(t + 1)\) is a move ' in the right direction'.

[Ps: the (1.3) mentioned in the question is the update rule for perceptron learning algorithm ( PLA ), which is \(w(t + 1) = w(t) + y(t)x(t)\)]

Answer:

(a)

Since,

\(h(x) = sign(w^Tx) = \begin{cases} +1,w^Tx > 0 \\ -1,w^Tx < 0 \end{cases}\), if \(x(t)\) is misclassified by \(w(t)\), \(h(x)\) will obtain opposite value from the label \(y(t)\).

It means that when \(h(x) = 1, y(t) = -1\), or \(h(x) = -1, y(t) = 1\).

Thus,

\(y(t)h(x) = y(t)w^T(t)x(t) = -1 \rightarrow y(t)w^T(t)x(t) < 0\).

(b)

\(y(t)w^T(t + 1)x(t) = y(t)( w(t) + y(t)x(t) )x(t)\)

\(= ( y(t)w(t) + y^2(t) )x(t)\)

\(= y(t)w^T(t)x(t) + y^2(t)x^T(t)x(t)\)

\(= y(t)w^T(t)x(t) + x^T(t)x(t)\)

Since \(x_0 = 1\), \(x^T(t)x(t) > 0\).

Conclude,

\(y(t)w^T(t + 1)x(t) = y(t)w^T(t)x(t) + x^T(t)x(t) > y(t)w^T(t)x(t)\) .

(c)

When \(x(t)\) is misclassified by \(w(t)\), there may be two cases.

Case 1: \(y(t) = 1, h(x) = w^T(t)x(t) = -1\)

Exercise 1.3 part(c) Case 1.

Case 2: \(y(t) = -1, h(x) = w^T(t)x(t) = 1\)

Exercise 1.3 part(c) Case 2.

Thus,

in case 1, \(y(t) = 1, w^T(t)x(t) = -1\), \(x(t)\) is misclassified as \(-1\) ( \(w(t)\) is away from \(x(t)\)).

\(w(t+1) = w(t) + y(t)x(t)\) will lead w(t+1) closer to \(x(t)\) and \(x(t)\) will be classified correctly by \(w(t+1)\).

Similarly for case 2, \(y(t) = -1, w^T(t)x(t) = 1\), \(x(t)\) is misclassified as \(+1\) ( \(w(t)\) is too close to \(x(t)\)).

\(w(t+1) = w(t) + y(t)x(t)\) will lead \(w(t+1)\) more away from \(x(t)\) and \(x(t)\) will be classified correctly by \(w(t+1)\).

Exercise 1.4

Let us create our own target function \(f\) and data set \(D\) and see how the perceptron learning algorithm works. Take \(d = 2\) so you can visualize the problem, and choose a random line in the plane as your target function, where one side of the line maps to \(+1\) and the other maps to \(-1\). Choose the inputs \(x_n\) of the data set as random points in the plane, and evaluate the target function on each \(x_n\) to get the corresponding output \(y_n\)·
Now, generate a data set of size \(20\). Try the perceptron learning algorithm on your data set and see how long it takes to converge and how well the final hypothesis \(g\) matches your target \(f\). You can find other ways to play with this experiment in Problem 1.4.

Answer:

The target function \(f\) is vector \([6, 2, 4]\).

The PLA program will be posted as followed, by using python.

# perceptron learning algorithm
def getHypothesisVector(data):
    # initialize vector w by the first input data
    w = np.array([0, data[0][0], data[0][1]])
    update = True; rounds = 0;i = 0;
    while update:
        rounds += 1

        # check all data
        for i in range(len(data)):
            x = np.array([1, data[i][0], data[i][1]])
            label = data[i][2]
            g = w.dot(x)
            if g > 0:
                hypothesis = 1
            else:
                hypothesis = -1

            # if hypothesis misclassified
            if not hypothesis * label > 0:
                # update w
                w = w + label * x
                update = True
                break
            # if all data correct, no need to update w
            update = False
    return {'g': w, 'rounds': rounds}

Here are three cases for this function. Each data set has a size of \(20\). However, they have different size ratios: case 1 contains \(10\) data for positive and negative, respectively, while the unbalanced ratio of case 2 is \(1:3\), and the unbalanced ratio of case 3 is \(3:1\).

Besides, I also generate a data set with a size of 800 with an unbalanced ratio of 3:5 as case 4. Just for fun!

Case 1:

Exercise 1.4 Case 1.

Case 2:

Exercise 1.4 Case 2.

Case 3:

Exercise 1.4 Case 3.

Case 4:

Exercise 1.4 Case 4.

Conclude,

with fewer data sizes, the hypothesis \(g\) may have a large bias to the target function \(f\) and fewer rounds to reach the result.

On the other hand, the hypothesis \(g\) will have a small bias to the target function \(f\) when having a large enough data size will take more rounds to reach the result.

Exercise 1.5

Which of the following problems are more suited for the learning approach and which are more suited for the design approach?
(a) Determining the age at which a particular medical test should be performed
(b) Classifying numbers into primes and non-primes
(c) Detecting potential fraud in credit card charges
(d) Determining the time it would take a falling object to hit the ground
(e) Determining the optimal cycle for traffic lights in a busy intersection

Answer:

Part Learning Approach or Design Approach
a Learning approach
b Design approach
c Learning approach
d Design approach
e Design approach

Exercise 1.6

For each of the following tasks, identify which type of learning is involved (supervised, reinforcement, or unsupervised) and the training data to be used. If a task can fit more than one type, explain how and describe the training data for each type.
(a) Recommending a book to a user in an online bookstore
(b) Playing tic tac toe
(c) Categorizing movies into different types
(d) Learning to play music
(e) Credit limit: Deciding the maximum allowed debt for each bank customer

Answer:

Part Type(s) of Learning Training Data
a Supervised learning(Online learning) The properties of books and the user's interests.
b Reinforcement learning, unsupervised learning The rules of playing tic-tac-toe, and play it.
c Supervised learning or unsupervised learning The properties of movies and the labels of different types. If we use unsupervised learning, no need to get the labels of different types.
d Reinforcement learning, unsupervised learning How to play music and music scores.
e Supervised learning(Active learning) The relation between the max allowed debt and customers.

Exercise 1.7

For each of the following learning scenarios in the above problem, evaluate the performance of \(g\) on the three points \(\mathcal X\) in outside \(\mathcal D\). To measure the performance, compute how many of the \(8\) possible target functions agree with \(g\) on all three points, on two of them, on one of them, and on none of them.

(a) \(\mathcal H\) has only two hypotheses, one that always returns ' \(•\)' and one that always returns '\(o\)'. The learning algorithm picks the hypothesis that matches the data set the most.
(b) The same \(\mathcal H\), but the learning algorithm now picks the hypothesis that matches the data set the least.
(c) \(\mathcal H = \{\text{XOR}\}\) (only one hypothesis which is always picked), where \(\text{XOR}\) is defined by \(\text{XOR}(x) = •\) if the number of \(1\)'s in \(x\) is odd and \(\text{XOR}(x) = o\) if the number is even.
(d) \(\mathcal H\) contains all possible hypotheses (all Boolean functions on three variables), and the learning algorithm picks the hypothesis that agrees with all training examples, but otherwise disagrees the most with the \(\text{XOR}\)

Figure for Exercise 1.7

Answer:

(a)

For the hypothesis that always returns '\(•\)': 1 agrees with \(g\) on all three points(\(f_8\)); 3 agree with \(g\) on two of them(\(f_4\), \(f_6\), \(f_7\)); 3 agree with \(g\) on one of them(\(f_2\), \(f_3\), \(f_5\)); 1 agrees with \(g\) on none of them(\(f_1\)).

For the hypothesis that always returns '\(o\)': 1 agrees with \(g\) on all three points(\(f_1\)); 3 agree with \(g\) on two of them(\(f_2\), \(f_3\), \(f_5\)); 3 agree with \(g\) on one of them(\(f_4\), \(f_6\), \(f_7\)); 1 agrees with \(g\) on none of them(\(f_8\)).

(b)

Since the algorithm picks the hypothesis that matches the data set the least, the answer is the opposite of part (a):

For the hypothesis that always returns '\(•\)': 1 agrees with \(g\) matches the least on all three points(\(f_1\)); 3 agree with \(g\) matches the least on two of them(\(f_2\), \(f_3\), \(f_5\)); 3 agree with \(g\) matches the least on one of them(\(f_4\), \(f_6\), \(f_7\)); 1 agrees with \(g\) on none of them(\(f_8\)).

For the hypothesis that always returns '\(o\)': 1 agrees with \(g\) matches the least on all three points(\(f_8\)); 3 agree with \(g\) matches the least on two of them(\(f_4\), \(f_6\), \(f_7\)); 3 agree with \(g\) matches the least on one of them(\(f_2\), \(f_3\), \(f_5\)); 1 agrees with \(g\) matches the least on none of them(\(f_1\)).

(c)

According to the hypothesis.

\(x\) \(g\)
101 \(o\)
110 \(o\)
111 \(•\)

Thus, There is 1 agrees with \(g\) on all three points(\(f_2\)); 3 agree with \(g\) on two of them(\(f_1\), \(f_4\), \(f_6\)); 3 agree with \(g\) on one of them(\(f_3\), \(f_5\), \(f_8\)); 1 agrees with \(g\) on none of them(\(f_7\)).

(d)

Since the algorithm picks disagrees the most with the \(\text{XOR}\), the result is:

\(x\) \(g\)
101 \(•\)
110 \(•\)
111 \(o\)

Thus, There is 1 agrees with \(g\) on all three points(\(f_7\)); 3 agree with \(g\) on two of them(\(f_3\), \(f_5\), \(f_8\)); 3 agree with \(g\) on one of them(\(f_1\), \(f_4\), \(f_6\)); 1 agrees with \(g\) on none of them(\(f_2\)).

Exercise 1.8

If \(μ = 0.9\), what is the probability that a sample of 10 marbles will have \(\nu \le 0.1\)? [Hints: 1. Use binomial distribution. 2. The answer is a very small number.]

Answer:

\(\mu = P(Red) = 0.9\)

\(P(Red \le 1|Marbles = 10) = P(Red = 0|Marbles = 10)+P(Red = 1|Marbles = 10)\)

\(P(Red = 0|Marbles = 10)=0.1^{10}\)

\(P(Red = 1|Marbles = 10) = C^1_{10} \times 0.9 \times 0.1^{9}\)

Thus,

\(\nu = P(Red \le 0.1|Marbles = 10)\)

\(= P(Red = 0|Marbles = 10)+P(Red = 1|Marbles = 10)\)

\(= 0.1^{10} + C^1_{10} \times 0.9 \times 0.1^9\)

\(=0.1^9 \times (0.1 + 9)\)

\(= 9.1 \times 10^{-9}\)

Exercise 1.9

If \(μ = 0.9\), use the Hoeffding Inequality to bound the probability that a sample of 10 marbles will have \(\nu \le 0.1\) and compare the answer to the previous exercise.

Answer:

Since,

\(\nu \le 0.1\)

\(\mu = 0.9\)

We can know that,

\(|\nu - \mu| \ge 0.8 => \epsilon = 0.8\)

Thus,

\(\mathbb{P}[|\nu - \mu| > \epsilon]\le 2e^{-2\epsilon^{2}N} = 2e^{-2 \times 0.8^2 \times 10} \approx 5.5215 \times 10^{-6}\)

Conclude,

The boundary we get through the Hoeffding Inequality contains the answer we calculated in Exercise 1.8, since the Hoeffding Inequality is a more common situation.


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM