Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (2024)

Nothing but Numpy is a continuation of my neural network series. To view the previous blog in this series or for a refresher on neural networks you may click here.

This post continues from Understanding and Creating Neural Networks with Computational Graphs from Scratch.

It’s easy to feel lost when you have twenty browser tabs open trying to understand a complex concept and most of the writeups you come across regurgitate the same shallow explanations. In this second installment of Nothing but NumPy, I’ll again strive to give the reader a deeper understanding of neural networks as we delve deeper into a specific kind of neural network called a “Binary Classification Neural Network”. If you’ve read my previous post then this will seem very familiar.

Understanding “Binary Classification” will help us lay down major concepts that help us understand many of the choices we make in multi-classification, which is why this post will also serve as a prelude to “Understanding & Creating Softmax Layer with Computational Graphs from Scratch”.

This blog post is divided into two parts, the first part will be understanding the basics of a Binary Classification Neural Network and the second part will comprise the code for implementing everything learned from the first part.

Binary classification is a common machine learning task. It involves predicting whether a given example is part of one class or the other. The two classes can be arbitrarily assigned either a “0” or a “1” for mathematical representation, but more commonly the object/class of interest is assigned a “1”(positive label) and the rest a “0”(negative label). For example:

Is the given picture of a cat(1) or not-a-cat(0)?
Given a patient’s test results, is the tumor benign(0; harmless) or malignant(1; harmful)?
Given a person’s information (eg. age, education level, marital status, etc) as features, predict whether they make less than $50K(0) or more than $50K(1) a year.
Is the given email spam(1) or not-spam(0)?

In all the examples above the object/class of interest is assigned a positive label(1).

Most of the time it will be fairly obvious whether a given machine learning problem requires binary classification or not. A general rule of thumb is that binary classification helps us answer yes(1)/no(0) questions.

Now let’s build a simple 1-layer neural network(input and output layers only) and hand solve it to get a better picture. (we’ll make a neural network the same as the one elaborated in my previous post, but with one key difference, the output of the neural network is interpreted as a probability instead of a raw value).

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (1)

Let’s expand this neural network out to reveal its intricacies.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (2)

For those not familiar with all the different parts of a neural network I’ll go over each of them briefly. (A more detailed explanation is provided in my previous post)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (3)

Inputs: x₁ and x₂ are the input nodes for two features that represent an example we want our neural network to learn from. Since input nodes form the first layer of the network they are collectively referred to as the “input layer”.
Weights: w₁ & w₂ represent the weight values that we associate with the inputs x₁ & x₂, respectively. Weights control the influence each input has in the calculation of the next node. A neural network “learns” these weights to make accurate predictions. Initially, weights are randomly assigned.
Linear Node(z): The “z” node creates a linear function out of all the inputs coming into it i.e z = w₁x₁+w₂ x₂+b
Bias: “b” represents the bias node. The bias node inserts an additive quantity into the linear function node(z). As the name suggests the bias sways the output so that it may better align with our desired output. The value of the bias is initialized to b=0 and is also learned during the training phase.
Sigmoid Node: This σ node, called the Sigmoid node, takes the input from a preceding linear node(z) and passes it through the following activation function, called the Sigmoid function(because of its S-shaped curve), also known as the Logistic function:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (4)

Sigmoid is one of the many “activations functions” used in neural networks. Activation functions are non-linear functions(not simple straight lines). They add non-linearity to a neural network by expanding its dimensionality, in turn, helping it learn complex things(for more details please refer to my previous post). Since it is the last node in our neural network, it is the output of the neural network and is, therefore, called the “output layer”.

A linear node(z) combined with a bias node(b) and an activation node, such as the sigmoid node(σ), forms a “neuron” in an artificial neural network.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (5)

In neural network literature, every neuron in an artificial neural network is assumed to have a linear node along with its corresponding bias, hence the linear node and bias nodes are not shown in neural network diagrams, as in Fig.1. To get a deeper understanding of the computations in a neural network I will continue to show expanded versions of neural networks in this blog post, as in Fig.2.

The use of a single Sigmoid/Logistic neuron in the output layer is the mainstay of a binary classification neural network. This is because the output of a Sigmoid/Logistic function can be conveniently interpreted as the estimated probability(p̂, pronounced p-hat) that the given input belongs to the “positive” class. How? Let’s delve a bit deeper.

The Sigmoid function squashes any input into the output range 0<σ<1. So, for example, if we were creating a neural network-based “cat(1) vs. not-cat(0)” detector, given images as input examples, our output layer will still be a single Sigmoid neuron, converting all the calculations from previous layers into p̂, a simple 0–1 output range.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (6)

We can then simply interpret p̂ as “What is the probability that the given input image is of a cat?”, where “cat” is the positive label. If p̂≈0, then it is highly unlikely that the input image is of a cat, on the other hand, p̂≈1 then it is very likely that the input image is of a cat. Simply put, p̂ is how confident our neural network model is in predicting that the input is a cat i.e the positive class(1).

This can be mathematically summarized simply as a conditional probability:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (7)

Since every binary classification neural net architecture has a single Sigmoid neuron in the output layer, as shown in Fig.6 above, the output of the Sigmoid (estimated probability) depends on the output of the linear node(z) associated with the neuron. If the value of the linear node(z) is :

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (8)

Greater than zero(z>0) then the output of the Sigmoid node is greater than 0.5(σ(z)>0.5), which can be interpreted as “The probability that the input image is of a cat is greater than 50%”.
Less than zero(z<0) then the output of the Sigmoid node is less than 0.5(σ(z)<0.5), which can be interpreted as “The probability that the input image is of a cat is less than 50%”.
Equal to zero(z=0) then the output of the Sigmoid node equals 0.5(σ(z)=0.5), which means that “The probability that the input image is of a cat is exactly 50%”.

Now that we know what everything represents in our neural network let’s see what calculations our binary classification neural network performs given the following data set:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (9)

The data above represents the AND logic gate, where the output is given a positive label(1) only when both the inputs are x₁=1 and x₂=1, all other cases are assigned a negative label(0). Each row of the data represents an example we want our neural network to learn from and then classify. I have also plotted the points on a 2-D plane so that it is easy to visualize(red dots represent points where the class(y) is 0 and the green cross represents the point where the class is 1). This data set also happens to be linearly separable i.e. we can draw a straight line to separate the positive labeled examples from the negative ones.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (10)

The blue line shown above, called the decision boundary, separates our two classes. Above the line is our positive labeled example(green cross) and below the line are our negative labeled examples(red crosses). Behind the scenes, this blue line is formed by the z(linear function) node. We’ll later see how the neural network learns this decision boundary.

Like my previous blog post, first, we will perform Stochastic Gradient Descent, which is training a neural network using just one example from our training data. Then we’ll generalize our learnings from the stochastic process to Batch Gradient Descent(preferred method) where we train the neural network using all the examples in the training data.

Computations in a neural network move from left to right, this is called forward propagation. Let’s go through all the forward computations our neural network will perform when provided with just the first training example x₁ = 0 and x₂ = 0. Also, we’ll randomly initialize the weights to w₁=0.1 and w₂=0.6 and bias to b=0.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (11)

So, the prediction of the neural network is p̂=0.5. Recall, this is a binary classification neural network, p̂ here represents the estimated probability that the input example, with features x₁=0 & x₂=0, belongs to the positive class(1). Our neural network currently thinks that there is a 0.5(or 50%) chance that the first training example belongs to the positive class (recall from the probability equation this equates to P(1∣ x₁, x₂; w,b)=p̂=0.5).

Yikes! This is kinda poor 😕, especially since the negative label is associated with the first example, i.e y=0. The estimated probability should be around p̂≈0; it should be very unlikely that the first example belongs to the positive class, so that, the chance of belonging to the negative class is high (i.e P(0∣ x₁, x₂; w,b)≈1-p̂≈1).

If you’ve read my previous post then you know that at this point we need a Loss function to help us out. So, what Loss function should we use to tell a binary classification neural network to correct its estimated probability? In comes the Binary Cross-Entropy Loss Function to our rescue.

Binary Cross-Entropy Loss Function

Note: in most programming languages “log” is the natural logarithm(log with base-e), denoted in mathematics as “ln”. For consistency between code and equations consider “log” as natural logarithm and not as “log₁₀”(log with base-10).

The Binary Cross-Entropy(BCE) Loss function is defined as follows :

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (12)

All Loss functions essentially tell us how far our predicted output is from our desired output, for one example only. Simply put, a Loss function computes the error between prediction and actual value. Keeping that in view, the Binary Cross-Entropy(BCE) Loss function computes a different Loss when the associated label of a training example is y=1(positive) and a different Loss when the label is y=0(negative). Let’s see:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (13)

Now it’s apparent that the BCE Loss function in Fig.12 is just an elegantly compressed version of the piecewise equation.

Let’s plot the above piecewise function to visualize what’s going on underneath.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (14)

So, the BCE Loss function captures the intuition that the neural network should pay a high penalty(Loss→∞) when the estimated probability, with respect to the training example’s label, is completely wrong. On the other hand, the Loss should equal zero(Loss=0) when the estimated probability, with respect to the training example’s label, is correct. Simply put, the BCE Loss should equal zero in only two instances:

if the example is positively labeled(y=1) the neural network model should be completely sure that the example belongs to the positive class i.e p̂=1.
if the example is negatively labeled(y=0) the neural network model should be completely sure that the example does not belong to the positive class i.e p̂=0.

In neural networks, the gradient/derivative of the Loss function dictates whether to increase or decrease the weights and bias of a neural network. So let’s see what the derivative of the Binary Cross-Entropy(BCE) Loss function looks like:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (15)

We can also split the derivative into a piecewise function and visualize its effects:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (16)

A positive derivative would mean decrease the weights and negative would mean increase the weights. The steeper the slope(gradient) the more incorrect the prediction was. Let’s take a moment to make sure we understand this statement:

If the gradient is negative that would mean we are looking at the first Loss curve, where the actual label for the example is positive(y=1). The only way to drive Loss to zero would be to move in the opposite direction of the slope(gradient), from negative to positive. Therefore, we need to increase the weights and bias so that z = w₁x₁+w₂x₂+b > 0 (recall Fig.8) and in turn estimated probability of belonging to the positive class is p̂≈σ(z)≈1.
Similarly, when the gradient is positive we are looking at the second Loss curve where the actual label for the example is negative(y=0). The only way to drive the Loss to zero would again be to move in the opposite direction of the slope(gradient), this time from positive to negative. In this instance, we would need to decrease the weights and bias so that z = w₁x₁+w₂x₂+b < 0 and consequently estimated probability of belonging to the positive class p̂≈σ(z)≈0.

The explanation provided for BCE Loss up till now is sufficient for all intents and purposes, but the curious among you might be wondering where did this Loss function even come from and why not just use the Mean Squared Error Loss function like in the previous post? More on this later.

Now that we know the purpose of a Loss function and how the Binary Cross-Entropy Loss function works let’s calculate the BCE Loss on our current example(x₁ = 0 and x₂ = 0), for which our neural network estimated that the probability for belonging to the positive class is p̂=0.5 while its label(y) is y=0:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (17)

The Loss is about 0.693(rounded to 3 decimal places). We can now use the derivative of the BCE Loss function to check if we need to increase or decrease the weights and bias, using the process called backpropagation; it is the opposite of the forward-propagation, we track backward from output to input. Backpropagation allows us to figure out how much of the Loss each part of the neural network was responsible for, we can then adjust those parts of the neural network accordingly.

As shown in my previous post, we’ll employ the following graph technique for propagating the gradients back from the output layer to the input layer of the neural network:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (18)

At each node, we only have our local gradient computed(partial derivatives of that node). Then during backpropagation, as we are receiving numerical values of gradients from upstream, we multiply upstream gradients with local gradients and pass them on to their respective connected nodes. This is a generalization of the chain rule from calculus.

Let’s go over backpropagation step by step:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (19)

For the next calculation, we’ll need the derivative of the Sigmoid function which forms the local gradient at the red node. The derivative if the Sigmoid function is(derived in detail in my previous post):

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (20)

Now let’s use the derivate of the Sigmoid node and backpropagate the gradient further:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (21)

Gradients should not propagate back to the input nodes( i.e red arrows should not travel towards the green nodes) as we do not want to change our input data, we only intend to change the weights associated with them.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (22)

Finally, we can update the parameters(weights and bias) of the neural network by performing gradient descent.

Gradient Descent

Gradient descent is adjusting the parameters of the neural network by moving in the negative direction of the gradient i.e away from a sloping region to a flatter region.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (23)

The general equation for gradient descent is:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (24)

The Learning Rate, α(pronounced alpha), is used to control the step size down the Loss curve(Fig. 21). The learning rate is a hyper-parameter of the neural network, which means it can’t be learned through the backpropagation of gradients and must be set by the creator of the neural network, ideally after some experimentation. For more information about the effects of the learning rate, you may refer to my previous post.

Notice that the gradient descent steps (blue arrows) keep getting smaller and smaller, that’s because as we move away from the sloping region to a flatter region, near the minimum point, the magnitude of the gradient also decreases resulting in progressively smaller steps.

We’ll set the learning rate(α) to α=1.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (25)

Now that we have updated the weights and bias(actually we were only able to update our bias in this training iteration) let’s do a forward propagation on the same example and calculate the new Loss to check if we’ve done the right thing.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (26)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (27)

Now the estimated probability for the 1ˢᵗ example belonging to the positive class(p̂) is down from 0.5 to approximately 0.378(rounded to 3 d.p) and consequently, the BCE Loss has reduced a bit, too, down from 0.693 to around 0.475(to 3 d.p).

Up till now, we have performed stochastic gradient descent. We have used only one example(x₁=0 and x₂=0), from our AND gate dataset of four examples, to perform a single training iteration(each training iteration is forward propagation, calculating Loss, followed by backward propagation and updating the weights through gradient descent).

We can continue on this path of updating the weights just by learning from one example at a time, but ideally, we’d like to learn from multiple examples at a time and reduce our Loss across all of them.

In batch gradient descent(also called full batch gradient descent) we use all the training examples in a dataset during each training iteration. (If batch gradient descent is not possible for some reason, e.g. size of all the training data is too big to fit into RAM or GPU, we may use a subset of the dataset in each training iteration, this is called mini-batch gradient descent.)

A batch is just a vector/matrix full of training examples.

Before we proceed with processing multiple examples we need to define a Cost function.

Binary Cross Entropy Cost Function

For batch gradient descent we need to adjust the Binary Cross Entropy(BCE) Loss function to accommodate not just one example but all the examples in a batch. This adjusted Loss function is called the Cost function(also represented by the letter J in neural network literature and some times also called the objective function).

Instead of calculating the Loss on one example, the Cost function calculates average Loss across ALL the examples in the batch.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (28)

When performing batch gradient descent(or mini-batch gradient descent) we take the derivative with respect to the Cost function instead of the Loss function. So next, we’ll see how to take the derivative of the Binary Cross-Entropy Cost function, using a simple example and then generalizing from there.

The derivative of Binary Cross-Entropy Cost function

In vectorized form our BCE Cost function looks as follows:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (29)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (30)

As expected the Cost is just the average of the Loss of the two examples, but all our calculations are vectorized, allowing us to compute the Binary Cross-Entropy Cost for a batch in one go. We prefer to use vectorized computations in neural networks as computer hardware(CPU and GPU) is better suited to batch computations in vectorized form. (Note: if we had just one example in the batch the BCE Cost would simply be calculating the BCE Loss, just like the stochastic gradient descent example we went through earlier)

Next, let’s derive the partial derivatives of this vectorized Cost function.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (31)

From this, we can generalize the partial derivative of the Binary Cross-Entropy Cost function.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (32)

A very important consequence of the Cost function is that since it calculates the average Loss across a batch of examples it also calculates the average of the gradient across the batch of examples, this helps in figuring out a less noisy general direction in which Loss across all examples decreases. In contrast, stochastic gradient descent(batch with only one example) gives a very noisy estimate of the gradients because it uses only one example per training iteration to guide gradient descent.

For vectorized(batched) computations, we need to adjust the linear node(z) of the neural network, so that it accepts vectorized inputs and use the Cost function instead of the Loss function, also for the same reason.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (33)

Z node now computes the dot-product between appropriately sized weight matrix(W) and training data(X). The output of the Z node is now also a vector/matrix.

Now we can set up our data(X, W, b & Y) for vectorized computation.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (34)

We are now finally ready to perform forward and backward propagation using Xₜᵣₐᵢₙ, Yₜᵣₐᵢₙ, W, and b.

(NOTE: All the results below are rounded to 3 decimal points, just for brevity)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (35)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (36)

Through vectorized computations, we have performed forward propagation; calculating all the estimated probabilities for every example in the batch in one go.

Now we can calculate the BCE Cost on these output estimated probabilities(P̂ ). (Below, for legibility, I have highlighted the portions of the Cost function that are calculating the Loss on positive examples in blue and the negative examples in red)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (37)

So, the Cost with our current weights and bias is approximately 0.720. Our goal now is to reduce this Cost using backpropagation and gradient descent. Let’s go through backpropagation step-by-step.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (38)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (39)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (40)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (41)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (42)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (43)

And just like that, we have computed all the gradients with respect to the Cost function in one go for our entire batch of training examples, using vectorized computations. We can now perform gradient descent to update the weights and bias.

(For those confused with how ∂Cost/∂W and ∂Cost/∂b are being calculated in the last backpropagation step please refer to my previous blog where I break down this computation, more specifically why derivatives of dot products result in transposed matrices)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (44)

To check if we have done the right thing we can use the new weights and bias to perform another forward propagation and calculate the new Cost.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (45)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (46)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (47)

With one training iteration, we have reduced the Binary Cross Entropy Cost from 0.720 to around 0.618. We will need to perform multiple training iterations before we can converge to good weight and bias values that result in an overall low BCE Cost.

At this point, if you’d like to give it a go and perform the next backpropagation step yourself, as an exercise, here are the approximate gradients of Cost w.r.t weights(W) and bias(b) you should get(rounded to 3 d.p):

∂Cost/∂W = [-0.002, 0.027]
∂Cost/∂b =[0.239]

After about 5000 Epochs (an epoch is complete when the neural net goes through all the training examples in a training iteration) the Cost steadily decreases to about 0.003, our weights settle to around W = [10.678, 10.678], bias resolves to around b = [-16.186]. We see by the Cost Curve below that the network has converged to a good set of parameters(i.e W & b):

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (48)

The Cost Curve(or Learning Curve) is a neural network model’s performance over time. It is the Cost plotted after every few training iterations(or epochs). Note how quickly the Cost decreases initially but then asymptotes, recall Fig 21 this is because initially the magnitude of the gradient is high but as we descend to flatter region near minimum Cost the magnitude of gradient decreases and further training only slightly improves the neural network parameters.

After the neural net has been trained for 5000 epochs the predicted output probabilities(p̂) on Xₜᵣₐᵢₙ are:

[[9.46258077e-08, 4.05463814e-03, 4.05463814e-03, 9.94323194e-01]]

Let’s break this down:

for x₁=0, x₂=0, the predicted output is p̂≈ 9.46×10⁻ ⁸≈0.0000000946
for x₁=0, x₂=1 the predicted output is p̂≈ 4.05×10⁻ ³≈0.00405
for x₁=1, x₂=0 the predicted output is p̂≈ 4.05×10⁻ ³≈0.00405
for x₁=1, x₂=1 the predicted output is p̂≈ 9.94×10⁻ ¹≈0.994

Recall, that the labels are y = [0, 0, 0, 1]. So, only for the last example, the neural network is 99.4% confident that it belongs to the positive class for the rest it’s less than 1% confident. Also, remember the probability equations from Fig.7? P(1)=p̂ and P(0)=1-p̂, so the predicted probabilities(p̂) confirm our neural network knows what it is doing 👌.

Now that we know that the neural network’s predicted probabilities are correct we need to define when the predicted class should be 1 and when it should be 0 i.e. classify the examples based on these probabilities. For this, we need to define a classification threshold (also called decision threshold). What’s that? Let’s get into it

Classification Threshold

In binary classification tasks, it is common to classify all the predictions of a neural network to the positive class(1) if the estimated probability(p̂ ) is greater than a certain threshold, and similarly, to the negative class(0) if the estimated probability is below the threshold.

This can be mathematically written as follows:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (49)

The value of the threshold defines how stringent our model is in assigning an input to the positive class. Suppose if the threshold is thresh=0, then all the input examples will be assigned to the positive class i.e predicted class(ŷ) will always be ŷ=1. Similarly, if thresh=1 then all the input examples will be assigned to the negative class i.e predicted class(ŷ) will always be ŷ=0. (Recall, that the sigmoid activation function asymptotes at either ends so it may come very close to 0 or 1 but will never output completely 0 or 1)

The Sigmoid/Logistic function provides a natural threshold value for us. Recall Fig.8 from earlier.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (50)

So, with the natural threshold of 0.5 the classes can be predicted as follows:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (51)

How do we interpret this? Well if the neural network is at least 50%(0.5) confident than the input belongs to the positive class(1) then we’ll assign it to the positive class(ŷ=1), otherwise we’ll assign it to the negative class(ŷ=0).

Recall how we predicted in Fig.10 the neural network could separate the two classes in the AND gate dataset by drawing a line that separates the positive class(green cross) and negative class(red crosses). Well, the location of that line is defined by our threshold value. Let’s see:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (52)

Recall after training our weights and bias converged to around W = [10.678, 10.678] and b = [-16.186], respectively. Let’s plug these into the inequality derived in Fig. 43, above.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (53)

Further, realize this inequality gives us an equation of a line that separates our two classes:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (54)

This equation of a line marked in Fig.45 forms the Decision Boundary. The Decision Boundary is the line along which the neural network changes its prediction from positive to negative class and vice versa. All points(x₁,x₂) that fall on the line have the estimated probability of exactly 50% i.e p̂=0.5, all points above it have estimated probabilities of greater than 50% i.e p̂ >0.5, and all points that fall below the line have estimated probabilities of less than 50% i.e p̂<0.5.

We can visualize the decision boundary by shading the area green where the neural network predicts the positive class(1) and red where the neural net predicts the negative class(0).

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (55)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (56)

In most cases, we can set a threshold value of 0.5 in binary classification problems. So, what’s the take away after going this deep into understanding the threshold value? Should we just set it to 0.5 and forget about it? NO! In some cases, you’d want the threshold value to be high, for example, if you’re creating a cancer detection neural network model you’d want your neural network to be very confident, maybe at least 95%(0.95) or even 99%(0.99), that the patient has cancer, because if they don’t they may have to go through toxic chemotherapy for nothing. On the other hand, a cat-detector neural net model may be set to a low threshold, around 0.5 or so, because even if the neural net misclassifies a cat, it’s just a funny accident, no harm no foul.

Now to drive home the concept of classification threshold let’s visualize its effect on the location of the decision boundary and the resultant accuracy of the neural network model:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (57)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (58)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (59)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (60)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (61)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (62)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (63)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (64)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (65)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (66)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (67)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (68)

After training the neural network in the above four figures I have plotted the decision boundary(left), the shaded decision boundary(middle) and the shortest distance of each point from the decision boundary(right) with the classification threshold ranging from 0.000000001 to 0.9999.

The classification threshold is also a hyperparameter of the neural network model which needs to tuned according to the problem at hand. Classification threshold doesn’t affect the neural network directly(it does not change the weights and bias) it is only used to convert the output probabilities back to binary representations for our classes i.e back to 1’s and 0's.

On a final note, the decision boundary is not the property of the dataset, its shape(straight, curved, etc.) is the result of the weights and bias of the neural network and its location is the result of the value of the classification threshold.

We’ve learned a lot up till now, right?😅 For the most part, we know almost everything about binary classification problems and how to solve them through neural networks. Unfortunately, I’ve got some bad news, our Binary Cross-Entropy Loss function has a serious computational flaw, it is very unstable in its current form😱.

Don’t worry! With some simple maths, we’ll be able to solve this problem

Let’s take another look at the Binary Cross-Entropy(BCE) Loss function:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (69)

Note from the piecewise equation that all the characteristics for the Binary Cross-Entropy Loss function are dependent on the “log” function(recall, “log” here is the natural logarithm).

Let’s plot the log function and visualize its characteristics:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (70)

The log function in Binary Cross-Entropy Loss defines when the neural network pays a high penalty (Loss→∞) and when the neural network is correct (Loss→0). The domain of the log function is 0<x<∞ and its range is unbounded -∞<log(x)<∞ , more importantly, as x gets closer and closer to zero(x → 0) the value of log(x) tends to negative infinity(log(x) → -∞). So, small changes in values near zero have an extreme impact on the result of the Binary Cross-Entropy Loss function, further our computers can store numbers only to a certain floating-point precision, and when there are functions that tend to infinity they cause a numerical overflow(overflow is when the number is too big to be stored in computer memory and underflow is when the number is too small) in computers. It turns out the Binary Cross Entropy function’s strength, the log function, is also its weakness making it unstable near small values.

This has a dire effect on the calculation of the gradients, too. As the values get closer and closer to zero the gradient tends to approach infinity making the gradient calculations also unstable.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (71)

Consider the following example:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (72)

Similarly, when calculating the gradients for the above example:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (73)

Now let’s see how we can fix this:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (74)

We have successfully taken then natural logarithm(log) function out of the danger zone! The range of “1+e⁻ ᶻ” is greater than 1 (i.e 1+e⁻ ᶻ>1) resultantly the range of “log” function in BCE loss becomes greater than 0 (i.e log(1+e⁻ ᶻ)>0). The overall Binary Cross-Entropy function is no longer critically unstable.

We can stop here but let’s go one step further and simplify the Loss function even more:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (75)

We’ve significantly simplified the Binary Cross-Entropy(BCE) expression, but there is a problem with it. Can you guess it, looking at the curve for “1+e⁻ ᶻ” from Fig.53?

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (76)

The expression is “1+e⁻ ᶻ” tends approach to infinity for negative values (i.e 1+e⁻ ᶻ →∞, when z<0)! So, unfortunately, this simplified expression overflows when a negative value is encountered. Let’s try to fix this.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (77)

Now with this “eᶻ+1” expression, we have solved the problem of the log function being unstable at negative values. Unfortunately, now we face the opposite problem, the new Binary Cross-Entropy Loss function is unstable for large positive values 😕 because “eᶻ+1” tends to infinity for positive values (i.e eᶻ+1 →∞, when z>0)!

Let’s visualize the two exponential expressions:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (78)

We need to somehow combine these two simplified functions(in Fig.54 & 56) into one the Binary Cross-Entropy(BCE) Function so that the overall Loss function is stable across all values, positive and negative.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (79)

Let’s confirm that it is doing the right calculation on negative and positive values:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (80)

Take a moment to understand this and try to piece it together with the piecewise stable Binary Cross-Entropy Loss function from Fig.58.

So, with some simple highschool level math, we have solved the numerical flaw in the basic Binary Cross-Entropy function and created a Stable Binary Cross-Entropy Loss and Cost function.

Note that the previous “unstable” Binary Cross-Entropy Loss function took as inputs label(y) and probabilities from the last sigmoid node(p̂ ) but the new Stable Binary Cross-Entropy Loss function takes as input label(y) and the values from the last linear node(z). The same goes for the stable Cost function.

Now that we have a stable BCE Loss function and its corresponding BCE Cost function how do we find the stable gradient of the Binary Cross-Entropy function?

That answer has been in plain sight all along!

Recall the derivative of the Binary Cross Entropy Loss function(Fig.15):

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (81)

Also recall that during backpropagation this derivative flows into the Sigmoid node and multiplies with the local gradient at the sigmoid node, which is just the derivative of the Sigmoid function(Fig.19.b.):

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (82)

Some beautiful mathematics takes place as we multiply the two derivatives:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (83)

So to calculate the derivative ∂Loss/∂z we don’t even need to calculate the derivative of the Loss function or the derivative of the Sigmoid node instead we can just bypass the Sigmoid node and pass “p̂-y” as the upstream gradient to the last linear node(z)!

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (84)

This optimization has two great benefits:

We no longer have to use the unstable derivative of the Binary Cross-Entropy function.
We also avoid multiplying with the saturating gradients of the Sigmoid function.

What is a saturating gradient? Recall the Sigmoid function curve

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (85)

At either end the Sigmoid curve becomes flat. This becomes a huge problem in neural networks when the weights increase or decrease by a large amount such that the output of the associated linear node(z) becomes very big or very small. In these cases, the gradient(i.e the local gradient at the sigmoid node) becomes zero or very close to zero. So, when an incoming upstream gradient is multiplied with a very small or a zero local gradient at the Sigmoid node, not much or none of the upstream gradient value is able to pass through.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (86)

On a final note of this section, we could have found the derivative of the stable Binary Cross-Entropy function and reached the same conclusion, but I like the above explanation better as it helps us understand why we can bypass the last sigmoid node when backpropagating gradients in a binary classification neural network. For sake of completion I’ve also derived that below:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (87)

Now let’s apply all that we’ve learned onto the slightly complicated XOR gate data where we’d need a multilayer neural network(a deep neural network) as a simple straight line from a single layer neural network won’t cut(view my previous post for more information on this phenomena):

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (88)

To classify the data points of the XOR dataset we’ll use the following neural network architecture:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (89)

A layer in a neural network is any set of nodes at the same depth with tunable weights. Above neural network as two layers with tunable weights, the middle(hidden) and the last output layer.

Let’s expand out this 2-layer neural network before we proceed with forward and backward propagation:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (90)

Now we are ready to perform batch gradient descent, starting with forwarding propagation:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (91)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (92)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (93)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (94)

We can now calculate the stable Cost:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (95)

After the calculation of Cost, we can now move on to backpropagation and improving the weights and biases. Recall, we can bypass the last Sigmoid node with our optimization technique.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (96)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (97)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (98)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (99)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (100)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (101)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (102)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (103)

Man, that was a lot!😅 But now we know everything in-depth about a Binary Classification Neural Network. Finally, let’s move on to gradient descent and update our weights.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (104)

At this point, if you would like to perform the next training iteration yourself and further your understanding, the following are the approximate gradients you should get(rounded to 3 d.p):

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (105)

So, after 5000 epochs the Cost steadily decreases to about 0.0017 and we get the following Learning Curve and Decision Boundary when the classification threshold value set to 0.5(in the coding section you can play around with the threshold value and see how it affects the decision boundary) :

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (106)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (107)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (108)

Before I conclude this section I want to answer some remaining questions, that might be bugging you:

1- Isn’t this just Logistic Regression?

Yes, a neural network with just one sigmoid neuron and no hidden layers, as in Fig.1, is logistic regression. A single-sigmoid-neuron neural net/logistic regression can classify simpler datasets that can be separated with just a straight line (like AND gate data). For a complicated dataset(such as XOR) feature engineering needs to be performed, by hand, to make a single-sigmoid-neuron neural net/logistic regression work adequately(explained in the previous post).

A multilayer neural network with multiple hidden layers and multiple neurons is called a deep neural network. A deep neural network can capture much more information about a dataset, than a single neuron, and can make classifications on complex datasets with little to no human intervention, the only caveat is that it needs much more training data than a simpler classification model such as a single-sigmoid-neuron neural net/logistic regression.

Further, the Binary Cross-Entropy Cost function for a single-sigmoid-neuron neural net/logistic regression is convex(u-shaped) with a guaranteed global minimum point. On the other hand, for a deep neural network, the Binary Cross-Entropy Cost function is not guaranteed to have a global minimum; practically this does not have a serious effect on training deep neural nets and research has shown this can be mitigated with more training data.

2- Can we use the raw output probabilities as/is?

Yes, raw probabilities from a neural network can also be used, depending on the type of problem you are trying to solve. For example, you train a binary classification model to predict the probability of a car accident at a junction per day, P(accident ∣ day). Suppose the probability is P(accident ∣ day)=0.08. So in a year at that junction, we can expect:

P(accident ∣ day) × 365 = 0.08 × 365 = 29.2 accidents

3- How to find the optimal classification threshold?

Accuracy is one metric to figure out the classification threshold. We would want a classification threshold that maximizes the accuracy of our model.

Unfortunately in may real-world cases accuracy, alone, is a poor metric. This is especially evident in cases where the classes are skewed in a dataset(in simple terms, there are more examples of one class than the other). The AND gate, we saw earlier, also suffered from this problem; only one example of positive class, the rest of the negative class. If you go back and look at Fig.47.d. where we set the classification threshold so high(0.9999) that the model predicted the negative class for all our examples you’ll see that the model’s accuracy is still 75%! This sounds pretty acceptable, but looking at the data it isn’t.

Consider another case where you are training a cancer detection model, but your 1000 patient dataset has only one example of a patient with cancer. Now if the model always outputs a negative class(i.e not-cancer, 0), regardless of input, you’d have a classifier that has a 99.9% accuracy on the dataset!

So, to deal with real-world problems many data scientists use metrics that employ the use of Precision and Recall.

Precision: How many of the positive predictions did the classifier get correct? (True Positives / Total number of Predicted Positives)

Recall: What proportion of positive examples was the classifier able to identify? (True Positives / Total number of Actual Positives)

Both these metrics can be visualized through a 2×2 matrix called the “confusion matrix”:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (109)

Tuning the classification threshold is a tug of war between Precision and Recall. If the Precision is high (i.e high classification threshold) Recall will be Low and vice versa. Understanding the Precision vs. Recall trade-off is a topic that is beyond the scope of the post and will be a topic of a future Nothing but Numpy blog.

One common metric that most data scientists employ for tuning classification threshold, which combines both Precision and Recall, is the F1 score.

For the sake of brevity, the following questions have been given their own short post and serve as a supplement to our discussion(click/tap on the question to go to its respective post)

4- Where did this Binary Cross-Entropy Loss Function Come from?

5- Why not just use the Mean Squared Error(MSE) function like your last blog? After all, you were able to solve the same examples using MSE.

6- How do Tensorflow and Keras implement Binary Classification and the Binary Cross-Entropy function(Bonus)?

This concludes Part Ⅰ.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (110)

This implementation builds upon the code from the previous post(for more details you may review the coding section of the last post or read the documentation in the code).

The code for the Linear Layer class remains the same.

The code for the Sigmoid Layer class also remains the same:

The Binary Cross-Entropy(BCE) Cost function(and its variants) are the main new addition to the code form last time.

First, let’s look at the “unstable” Binary Cross-Entropy Cost function compute_bce_cost(Y, P_hat) , which takes as arguments the true labels(Y)and the probabilities from the last Sigmoid layer(P_hat). This simple version of the Cost function returns the unstable version of the Binary Cross-Entropy Cost(cost)and its derivative with respect to the probabilities(dP_hat):

Now, let’s look at the stable version of Binary Cross-Entropy Cost function compute_stable_bce_cost(Y, Z) , which takes as argument the true labels(Y)and the output from the last Linear layer(Z). This Cost function returns the stable version of the Binary Cross-Entropy Cost(cost), as calculated by TensorFlow, and the derivative with respect to the last linear layer(dZ_last):

Finally, let’s also look at the way Keras implements the Binary Cross-Entropy Cost function. compute_keras_like_bce_cost(Y, P_hat, from_logits=Flase takes as arguments true labels(Y), the output from the last Linear layer(Z) or the last Sigmoid layer (P_hat) depending on the optional argument from_logits. If from from_logtis=Flase (default) then all assume P_hat contains probabilities that need to be converted to logits for computing the stable cost function. If from from_logtis=True then all assume P_hat contains output from the Linear node(Z) and stable cost function can be directly computed. This function returns the Cost(cost) and the derivative with respect to the last linear layer(dZ_last).

At this point, you should open up the 1_layer_toy_network_on_Iris_petals notebook from this repository in a separate window and go over this blog and the notebook side-by-side.

We will use the Iris flower dataset, which happens to be one of the first datasets created for statistical analysis. The Iris dataset contains 150 examples of Iris flowers belonging to 3 species — Iris-setosa, Iris-versicolor and, Iris-virginica. Each example has 4 features — petal length, petal width, sepal length, and sepal width.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (111)

For our first Binary Classification neural network, we will create a 1-layer neural network, as in Fig.1, to discriminate between Iris-virginica vs. others, using only petal length and petal width as input features. So let’s build our neural network layers:

Now we can move on to training our neural network:

Notice that we are passing the derivative, dZ1, directly into the Linear layer Z1.backward(dZ1) bypassing the Sigmoid layer, A1, because of the optimization, we came up with earlier.

After running the loop for 5000 epochs, in the notebook, we see that the Cost steadily decreases to about 0.080.

Cost at epoch#4700: 0.08127062969243247
Cost at epoch#4800: 0.08099585868475366
Cost at epoch#4900: 0.08073032792428664
Cost at epoch#4999: 0.08047611054333165

Resulting in the following Learning Curve and Decision Boundary:

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (112)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (113)

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (114)

Our model’s accuracy on the training data is:

The predicted outputs of first 5 examples: 
[[ 0. 0. 1. 0. 1.]]
The predicted prbabilities of first 5 examples:
 [[ 0.012 0.022 0.542 0. 0.719]]The accuracy of the model is: 96.0%

Check out other notebooks in the repository. We’ll be building upon the things we learned in this blog in future Nothing but NumPy blogs, therefore, it would behoove you to create the layer classes(if you haven’t before) and the Binary Cross-Entropy Cost functions from memory as an exercise and try recreating the AND gate example from Part Ⅰ.

This concludes the blog🙌🎉. Thank you for taking the time out to read this post, I hope you enjoyed.

For any questions feel free to reach out to me on Twitter @RafayAK

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (115)

TensorFlow Documentation and GitHub(especially this)
Keras Documentation and GitHub(especially this and this)
Sait Celebi’ s blogs
Google’s ML crash-course
James D. McCaffrey’s blog
Will Wolf’s( @willwolf_) amazing post on deriving functions through MLE
Andrej Karpathy’s(@karpathy) Stanford course
Christopher Olah’s(@ch402) blogs
Andrew Ng(@AndrewYNg) and his Coursera courses on deep learning and machine learning
Ian Goodfellow(@goodfellow_ian) and his amazing book
Reddit and StackExchange
Berkeley CS294 lecture notes
Stanford CS229 lecture notes
Stanford Probabilistic Graphical Models lecture
Finally, Hassan-uz-Zaman(@OKidAmnesiac) and Hassaan Tauqeer(@_hassaantauqeer) for invaluable feedback.

Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with… (2024)

FAQs

How do you create a neural network for binary classification? ›

Building a neural network that performs binary classification involves making two simple changes: Add an activation function – specifically, the sigmoid activation function – to the output layer. Sigmoid reduces the output to a value from 0.0 to 1.0 representing a probability.

Tell Me More ›

Which neural network is used for binary classification? ›

The one-node technique for neural network binary classification is shown in the bottom diagram in Figure 2. Here, male is encoded as 0 and female is encoded as 1 in the training data. The value of the single output node is 0.6493.

Learn More ›

What is the best model for binary classification? ›

Common Classification Models

Logistic Regression. Even though the word “regression” is in the name, logistic regression is used for binary classification problems (those where the data has only two classes). ...
Naive Bayes. ...
k-Nearest Neighbor. ...
Decision Trees. ...
Support Vector Machine. ...
Neural Networks.

Show Me More ›

Can I use SoftMax in binary classification? ›

Sigmoid is used for binary classification methods where we only have 2 classes, while SoftMax applies to multiclass problems. In fact, the SoftMax function is an extension of the Sigmoid function.

See Details ›

Can we use CNN for binary classification? ›

With the help of effective use of Neural Networks (Deep Learning Models), binary classification problems can be solved to a fairly high degree. Here we are using Convolution Neural Network(CNN). It is a class of Neural network that has proven very effective in areas of image recognition, processing, and classification.

Get More Info ›

How do you create a neural network in Python? ›

How To Create a Neural Network In Python – With And Without Keras

Import the libraries. ...
Define/create input data. ...
Add weights and bias (if applicable) to input features. ...
Train the network against known, good data in order to find the correct values for the weights and biases.

More items...

Jul 12, 2022

Know More ›

Is naive Bayes used for binary classification? ›

Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.

Find Out More ›

What is binary classification method? ›

Binary classification is the task of classifying the elements of a set into two groups (each called class) on the basis of a classification rule.

Keep Reading ›

Can I use XGBoost for binary classification? ›

First of all, XGBoost can be used in regression, binary classification, and multi-class classification (One-vs-all).

Learn More Now ›

What regression model would you use for binary classification? ›

Logistic Regression is a “Supervised machine learning” algorithm that can be used to model the probability of a certain class or event. It is used when the data is linearly separable and the outcome is binary or dichotomous in nature. That means Logistic regression is usually used for Binary classification problems.

Find Out More ›

Can we use decision tree for binary classification? ›

Decision trees are a common model type used for binary classification tasks. The natural structure of a binary tree, which is traversed sequentially by evaluating the truth of each logical statement until the final prediction outcome is reached, lends itself well to predicting a “yes” or “no” target.

Can tanh be used for binary classification? ›

Tanh can be used in binary classification between two classes. When using tanh, remember to label the data accordingly with [-1,1]. Sigmoid function is another logistic function like tanh. If the sigmoid function inputs are restricted to real and positive values, the output will be in the range of (0,1).

Which machine learning method is used for Naive Bayes classification? ›

Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems. It is mainly used in text classification that includes a high-dimensional training dataset.

What is the difference between binary tree and decision tree? ›

Binary search trees store data conveniently for searching later. Some bounds on worst case scenarios for searching and sorting are obtained. Definition: a decision tree is a tree in which • internal nodes represent actions, • arcs represent outcomes of an action, and • leaves represent final outcomes.

View Details ›

What applications use binary trees? ›

Applications of Binary Tree

Binary Tree is used to as the basic data structure in Microsoft Excel and spreadsheets in usual.
Binary Tree is used to implement indexing of Segmented Database.
Splay Tree (Binary Tree variant) is used in implemented efficient cache is hardware and software systems.

More items...

Get More Info ›

Which algorithm can be used to obtain the elements in a binary tree? ›

Explanation: In binary tree sort is a sort algorithm where a binary search tree is built from the elements to be sorted, and then we perform inorder traversal on the BST to get the elements in sorted order.

Learn More Now ›

Why use tanh instead of sigmoid? ›

We observe that the gradient of tanh is four times greater than the gradient of the sigmoid function. This means that using the tanh activation function results in higher values of gradient during training and higher updates in the weights of the network.

Get More Info Here ›

Which activation function is commonly used for binary classification? ›

In a binary classifier, we use the sigmoid activation function with one node. In a multiclass classification problem, we use the softmax activation function with one node per class. In a multilabel classification problem, we use the sigmoid activation function with one node per class.

Learn More ›

Can binary classification be implemented using softmax regression? ›

The answer is not always a yes. You can always formulate the binary classification problem in such a way that both sigmoid and softmax will work. However you should be careful to use the right formulation. Sigmoid can be used when your last dense layer has a single neuron and outputs a single number which is a score.

Read On ›

Why linear regression fails in binary classification? ›

There are two things that explain why Linear Regression is not suitable for classification. The first one is that Linear Regression deals with continuous values whereas classification problems mandate discrete values. The second problem is regarding the shift in threshold value when new data points are added.

Find Out More ›

Why linear regression is not appropriate when Modelling binary dependent variables? ›

With binary data the variance is a function of the mean, and in particular is not constant as the mean changes. This violates one of the standard linear regression assumptions that the variance of the residual errors is constant.

What is the difference between linear and binary regression? ›

Variable Type : Linear regression requires the dependent variable to be continuous i.e. numeric values (no categories or groups). While Binary logistic regression requires the dependent variable to be binary - two categories only (0/1).

Get More Info ›

Which activation function is best for binary classification neural network? ›

For binary classification, the logistic function (a sigmoid) and softmax will perform equally well, but the logistic function is mathematically simpler and hence the natural choice.

Keep Reading ›

How do you train a binary classifier? ›

Binary Classification Using PyTorch: Training

Prepare the training and test data.
Implement a Dataset object to serve up the data.
Design and implement a neural network.
Write code to train the network.
Write code to evaluate the model (the trained network)

More items...

Nov 4, 2020

Discover More Details ›

Can RNN be used for binary classification? ›

Yes but usually RNN works best with the time series data where past information needs to be incorporated. But if sole classification is the end goal and data is non-time series, a simple algorithm from logistic regression for binary classification should be suffice as it will reduce implementation algorithm complexity.

Discover More ›

Which function do we have to use for binary classification? ›

It uses the sigmoid activation function in order to produce a probability output in the range of 0 to 1 that can easily and automatically be converted to crisp class values.

Get More Info ›

Which neural network is best for data classification? ›

Convolutional Neural Networks

Convolutional Neural Networks (CNNs) is the most popular neural network model being used for image classification problem. ...
Consider a 256 x 256 image. ...
A convolution is a weighted sum of the pixel values of the image, as the window slides across the whole image.

More items...

Which algorithms are used for binary classification? ›

Popular algorithms that can be used for binary classification include: Logistic Regression. k-Nearest Neighbors. Decision Trees.

Tell Me More ›

How do you make a binary classifier in Python? ›

To perform binary classification using logistic regression with sklearn, we must accomplish the following steps.

Step 1: Define explanatory and target variables. ...
Step 2: Split the dataset into training and testing sets. ...
Step 3: Normalize the data for numerical stability.

More items...

Keep Reading ›

Is KNN good for binary classification? ›

Yes, you certainly can use KNN with both binary and continuous data, but there are some important considerations you should be aware of when doing so.

Learn More ›

Can LSTM be used for binary classification? ›

Uses an LSTM to predict the next days stock movement based on sequence of previous days. I used this project to gain experience working with LSTM's and time series data.

View Details ›

Is linear regression used for binary classification? ›

Linear regression is used for predicting continuous values, whereas logistic regression is used in the binary classification of values. In this article, we will have a look at how the two are different from each other.

Show Me More ›