Neural Networks
Neurons as structural constituents of the brain [Ramón y Cajál, 1911] ;
Five to six orders of magnitude slower than silicon logic gates ;
In a silicon chip happen in the nanosecond (on chip) vs millisecond range (neural events) ;
A truly staggering number of neurons (nerve cells) with massive interconnections between them ;
Neural Networks
Receive input from other units and decides whether or not to fire;
Approximately 10 billion neurons in the human cortex, and 60 trillion synapses or connections [Shepherd and Koch, 1990] ;
Energy efficiency of the brain is approximately $10^{−16}$ joules per operation per second against ~ $10^{−8}$ in a computer;
Neurons
input signals from its dendrites ;
output signals along its (single) axon ;
three major types of neurons: sensory neurons , motor neurons , and interneurons
Neurons
How do they work?
Control the influence from one neuron on another:
Excitatory when weight is positive; or
Inhibitory when weight is negative;
Nucleus is responsible for summing the incoming signals;
If the sum is above some threshold, then fire!
Neurons
Artificial Neuron
Neural Networks
It appears that one reason why the human brain is so powerful is the
sheer complexity of connections between neurons;
The brain exhibits huge degree of parallelism ;
Artificial Neural Networks
Model each part of the neuron and interactions;
Interact multiplicatively (e.g. $w_0x_0$) with the dendrites of the other neuron based on the synaptic strength at that synapse (e.g. $w_0$ );
Learn synapses strengths ;
Artificial Neural Networks
Function Approximation Machines
Datasets as composite functions: $y=f^{*}(x)$
Maps $x$ input to a category (or a value) $y$;
Learn synapses weights and aproximate $y$ with $\hat{y}$:
$\hat{y} = f(x;w)$
Learn the $w$ parameters;
Artificial Neural Networks
Can be seen as a directed graph with units (or neurons) situated at the vertices;
Receive signal from the outside world;
The remaining are named computation units ;
Each unit produces an output
Transmitted to other units along the arcs of the directed graph;
Artificial Neural Networks
Input , Output , and Hidden layers;
Hidden as in “not defined by the output”;
Artificial Neural Networks
Motivation Example (taken from Jay Alammar blog post )
Imagine that you want to forecast the price of houses at your neighborhood;
After some research you found that 3 people sold houses for the following values:
Area (sq ft) (x)
Price (y)
2,104
$\$399,900$
1,600
$\$329,900$
2,400
$\$369,000$
Artificial Neural Networks
Motivation Example (taken from Jay Alammar blog post )
If you want to sell a 2K sq ft house, how much should ask for it?
How about finding the average price per square feet ?
$\$180$ per sq ft.
Artificial Neural Networks
Motivation Example (taken from Jay Alammar blog post )
Our very first neural network looks like this:
Artificial Neural Networks
Motivation Example (taken from Jay Alammar blog post )
Multiplying $2,000$ sq ft by $180$ gives us $\$360,000$.
Calculating the prediction is simple multiplication.
We needed to think about the weight we’ll be multiplying by.
That is what training means!
Area (sq ft) (x)
Price (y)
Estimated Price($\hat{y}$)
2,104
$\$399,900$
$\$378,720$
1,600
$\$329,900$
$\$288,000$
2,400
$\$369,000$
$\$432,000$
Artificial Neural Networks
Motivation Example (taken from Jay Alammar blog post )
How bad is our model?
Calculate the Error ;
A better model is one that has less error;
Mean Square Error
: $2,058$
Area (sq ft) (x)
Price (y)
Estimated Price($\hat{y}$)
$y-\hat{y}$
$(y-\hat{y})^2$
2,104
$\$399,900$
$\$378,720$
$\$21$
$449$
1,600
$\$329,900$
$\$288,000$
$\$42$
$1756$
2,400
$\$369,000$
$\$432,000$
$\$-63$
$3969$
Artificial Neural Networks
Fitting the line to our data:
Follows the equation: $\hat{y} = W * x$
Artificial Neural Networks
How about addind the Intercept ?
$\hat{y}=Wx + b$
Artificial Neural Networks
The Bias
Artificial Neural Networks
Try to train it manually:
Artificial Neural Networks
How to discover the correct weights?
Gradient Descent:
Finding the minimum of a function ;
Look for the best weights values, minimizing the error ;
Takes steps proportional to the negative of the gradient of the function at the current point.
Gradient is a vector that is tangent of a function and points in the direction of greatest increase of this function.
Artificial Neural Networks
Gradient Descent
In mathematics, gradient is defined as partial derivative for every input variable of function;
Negative gradient is a vector pointing at the greatest decrease of a function;
Minimize a function by iteratively moving a little bit in the direction of negative gradient;
Artificial Neural Networks
Gradient Descent
Artificial Neural Networks
Gradient Descent
Artificial Neural Networks
Perceptron
In 1958, Frank Rosenblatt proposed an algorithm for training the perceptron.
Simplest form of Neural Network;
One unique neuron;
Adjustable Synaptic weights
Artificial Neural Networks
Perceptron
Classification of observations into two classes:
Artificial Neural Networks
Perceptron
Classification of observations into two classes:
Artificial Neural Networks
Perceptron
Find the $w_i$ values that could solve the or problem.
Artificial Neural Networks
Perceptron
Artificial Neural Networks
Perceptron
One possible solution $w_0=-1$, $w_1=1.1$, $w_2=1.1$:
Artificial Neural Networks
High-level neural networks API;
Capable of running on top of TensorFlow , CNTK , or Theano ;
Focus on enabling fast experimentation ;
Go from idea to result with the least possible delay ;
Runs seamlessly on CPU and GPU ;
Compatible with: Python 2.7-3.6 ;
Artificial Neural Networks
Artificial Neural Networks
Dense means a fully connected layer.
Artificial Neural Networks
Computational Graphs:
Nodes represent both inputs and operations;
Even relatively “simple” deep neural networks have hundreds of thousands of nodes and edges;
Lots of operations can run in parallel;
Example: $(x*y)+(w*z)$
Makes it easier to create an auto diferentiation strategy;
Artificial Neural Networks
Evaluate the quality of the model:
# Use evaluate function to get the loss and other metrics that the framework
# makes available
loss_and_metrics = model.evaluate(train_data_X, train_data_Y)
print(loss_and_metrics)
#0.4043288230895996
# Do a prediction using the trained model
prediction = model.predict(train_data_X)
print(prediction)
# [[-0.25007164]
# [ 0.24998784]
# [ 0.24999022]
# [ 0.7500497 ]]
Artificial Neural Networks
Exercise:
Run the example of the Jupyter notebook:
Perceptron - OR
Artificial Neural Networks
Perceptron
Exercise:
What about the AND function?
$x_1$
$x_2$
$y$
0
0
0
0
1
0
1
0
0
1
1
1
Artificial Neural Networks
Activation Functions
Artificial Neural Networks
Activation Functions
Multiply the input by its weights , add the bias and applies activation ;
Sigmoid, Hyperbolic Tangent, Rectified Linear Unit;
Differentiable function instead of the step function;
With this modification, a multi-layered network of perceptrons would become
differentiable. Hence gradient descent could be applied to minimize the
network’s error and the chain rule could “back-propagate” proper error
derivatives to update the weights from every layer of the network.
At the moment, one of the most efficient ways to train a multi-layer neural
network is by using gradient descent with backpropagation. A requirement for
backpropagation algorithm is a differentiable activation function. However, the
Heaviside step function is non-differentiable at x = 0 and it has 0 derivative
elsewhere. This means that gradient descent won’t be able to make a progress in
updating the weights.
The main objective of the neural network is to learn the values of the weights
and biases so that the model could produce a prediction as close as possible to
the real value. In order to do this, as in many optimisation problems, we’d
like a small change in the weight or bias to cause only a small corresponding
change in the output from the network. By doing this, we can continuously
tweaked the values of weights and bias towards resulting the best
approximation. Having a function that can only generate either 0 or 1 (or yes
and no), won’t help us to achieve this objective.
Artificial Neural Networks
The Bias
Artificial Neural Networks
The Bias
Artificial Neural Networks
Perceptron - What it can’t do !
Artificial Neural Networks
Perceptron - Solving the XOR problem
3D example of the solution of learning the OR function:
Using Sigmoid function;
Artificial Neural Networks
Perceptron - Solving the XOR problem
Maybe there is a combination of functions that could create hyperplanes that separate the XOR classes:
By increasing the number of layers we increase the complexity of the function represented by the ANN:
Now, there are 2 hyperplanes, that when put together, can perfectly separate the classes;
Artificial Neural Networks
Perceptron - Solving the XOR problem
The combination of the layers:
That is what people mean when they say we don’t know how deep neural networks
work. We know that it is a composition of functions, but the shape of that
remains a little bit hard to define;
Yesterday we saw polynomial transformation of features - in that we saw that
we changed the shape of the regression line being built;
Artificial Neural Networks
Perceptron - Solving the XOR problem
Artificial Neural Networks
Multilayer Perceptrons - Increasing the model power
Artificial Neural Networks
Multilayer Perceptrons - Increasing the model power
Information flows from $x$ , through computations and finally to $y$.
No feedback!
Artificial Neural Networks
Understanding the training
Plot the architecture of the network:
tf.keras.utils.plot_model(model, show_shapes=True, show_layer_names=False)
Artificial Neural Networks
Understanding the training
Plotting the training progress of the XOR ANN:
history = model.fit(x=X_data, y=Y_data, epochs=2500, verbose=0)
import matplotlib.pyplot as plt
plt.plot(history.history['loss'])
plt.title('Model Training Progression')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Loss'], loc='upper left')
plt.show()
This is called the learning curve ;
In the case of the XOR. What is wrong with that?
Artificial Neural Networks
Problems with the training procedure:
Saddle points:
No matter how long you train your model for, the error remains (almost) constant!
That eventually happens because of a bad optimization function;
Imagine that you could add momentum to the gradient descent - probably it
could continue updating;
Artificial Neural Networks
Optimization alternatives
The Gradient Descent is not always the best option to go with:
Only does the update after calculating the derivative for the whole
dataset ;
Can take a long time to find the minimum point;
Artificial Neural Networks
Optimization alternatives
For large datasets, the vectorization of data doesn’t fit into memory.
Artificial Neural Networks
Optimization alternatives
Minibatch:
* Updates are less noisy compared to SGD which leads to better convergence.
* A high number of updates in a single epoch compared to GD so less number of epochs are required for large datasets.
* Fits very well to the processor memory which makes computing faster.
Artificial Neural Networks
Optimization alternatives
Adaptative Learning Rates:
For Adagrad:
Parameters with small updates(sparse features) have high learning rate whereas the parameters with large updates(dense features) have low learning rateupdates at each input;
The learning rate decays very aggressively
RMSProp: A large number of oscillations with high learning rate or large gradient
Artificial Neural Networks
Multilayer Perceptron - XOR
Artificial Neural Networks
Predicting probabilities
Imagine that we have more than 2 classes to output;
One of the most popular usages for ANN;
Artificial Neural Networks
Predicting probabilities
Softmax - function that takes as input a vector of K real numbers, and normalizes it into a probability distribution
Artificial Neural Networks
Loss functions
For regression problems
Mean squared error is not always the best one to go ;
What if we have a three classes problem?
Alternatives: mean_absolute_error
, mean_squared_logarithmic_error
logarithm means changing scale as the error can grow really fast;
Artificial Neural Networks
Loss functions
Cross Entropy loss:
Default loss function to use for binary classification problems.
Measures the performance of a model whose output is a probability value between 0 and 1;
Loss increases as the predicted probability diverges from the actual label;
A perfect model would have a log loss of 0;
As the correct predicted probability decreases, however, the log loss increases rapidly:
In case the model has to answer 1, but it does with a very low probability;
Artificial Neural Networks
Dealing with overfitting
Dropout layers:
Randomly disable some of the neurons during the training passes;
Artificial Neural Networks
Dealing with overfitting
“drops out” a random set of activations in that layer by setting them to zero;
forces the network to be redundant;
the net should be able to provide the right classification for a specific example even if some of the activations are dropped out;
Artificial Neural Networks
Larger Example
The MNIST dataset: database of handwritten digits;
Dataset included in Keras;
Artificial Neural Networks
The MNIST MLP
Try to improve the classification results using this notebook :
Things to try:
Increase the number of neurons at the first layer;
Change the optimizer and the loss function;
Try categorical_crossentropy
and rmsprop
optimizer;
Try adding some extra layers;
Artificial Neural Networks
The MNIST MLP
Artificial Neural Networks
The Exercise
Artificial Neural Networks
The Exercise
Artificial Neural Networks
The Exercise
Quantum Chromodynamics
Artificial Neural Networks
Signal VS Background
Artificial Neural Networks
Signal VS Background
Run this Jupyter Notebook for performing the Jet Classification.