UL 75

13 Fully Connected Neural Networks
13.1 Introduction
As we first saw in Section 11.2.3, artificial neural networks, unlike polynomials and other fixed-shape approximators, have internal parameters that allow each of their units to take on a variety of shapes. In this chapter we expand on that introduction extensively, discussing general multi-layer neural networks, also referred to as fully connected networks, multi-layer perceptrons, and deep feed-forward neural networks.
13.2 Fully Connected Neural Networks
In this section we describe general fully connected neural networks, which are recursively built generalizations of the sort of units we first saw throughout Chapter 11. As this is an often confused subject we describe fully connected networks progressively (and with some redundancy that will hopefully benefit the reader), layer by layer, beginning with single-hidden-layer units first described in Section 11.2.3, providing algebraic, graphical, and computational perspectives on their construction. Afterwards, we briefly touch on the biological plausibility of fully connected networks, and end this section with an in-depth description of how to eciently implement them in Python.
13.2.1 Single-hidden-layer units
The general algebraic representation (i.e., the formula) of a single-hidden-layer unit, also called a single-layer unit for short, is something we first saw in Equa- tion (11.12) and is quite simple: a linear combination of input passed through a nonlinear activation function, which is typically an elementary mathematical function (e.g., tanh). Here we will denote such units in general as
(1) 0B (1) XN (1) 1C
(x) = + wn xnCA (13.1)
where a (·) denotes an activation function, and the superscripts on f and w0

404 Fully Connected Neural Networks
through wN indicate they represent a single- (i.e., one-) layer unit and its internal weights, respectively.
Because we will want to extend the single-layer idea to create multi-layer networks, it is helpful to pull apart the sequence of two operations used to construct a single-layer unit: the linear combination of input, and the passage through a nonlinear activation function. We refer to this manner of writing out the unit as the recursive recipe for creating single-layer neural network units, and summarize it below.
Recursive recipe for single-layer units
1. Choose an activation function a (·) P
2. Compute the linear combination v = w(1) + N w(1) x
0 n=1nn 3. Pass the result through activation and form a (v)
Example 13.1 Illustrating the capacity of single-layer units
In the top panels of Figure 13.1 we plot four instances of a single-layer unit using tanh as our nonlinear activation function. These four nonlinear units take the form
f(1)(x)=tanh⇣w(1) +w(1)x⌘ (13.2) 01
where in each instance the internal parameters of the unit (i.e., w(1) and w(1)) have 01
been set randomly, giving each instance a distinct shape. This roughly illustrates the capacity of each single-layer unit. Capacity (a concept first introduced in Section 11.2.2) refers to the range of shapes a function like this can take, given all the di↵erent settings of its internal parameters.
In the bottom row of Figure 13.1 we swap out tanh for the ReLU1 activation function, forming a single-layer unit of the form
f(1)(x)=max⇣0,w(1) +w(1)x⌘. (13.3) 01
Once again the internal parameters of this unit allow it to take on a variety of shapes (distinct from those created by tanh activation), four instances of which are illustrated in the bottom panels of the figure.
1 The Rectified Linear Unit or ReLU function was first introduced in the context of two-class classification and the Perceptron cost in Section 6.4.2.

13.2 Fully Connected Neural Networks 405
Figure13.1 FigureassociatedwithExample13.1.Fourinstancesofasingle-layerneural network unit with tanh (top row) and ReLU activation (bottom row). See text for further details.
If we form a general nonlinear model using B = U1 such single-layer units as model (x, ⇥) = w0 + f (1) (x) w1 + · · · + f (1) (x) wU1 (13.4)
whose jth unit takes the form
(1) 0B (1) XN (1) 1C
fj (x) = j + wn, j xnCA (13.5) n=1
then the parameter set ⇥ contains not only the weights of the final linear com- bination w0 through wU1 , but all parameters internal to each f (1) as well. This is
precisely the sort of model we used in the neural network examples throughout Chapter 11.
The left panel of Figure 13.2 shows a common graphical representation of the single-layer model in Equation (13.4), and visually unravels the individual algebraic operations performed by such a model. A visual representation like this is often referred to as a neural network architecture or just an architecture. Here the bias and input of each single-layer unit composing the model is shown as a sequence of dots all the way on the left of the diagram. This layer is ”visible” to us since this is where we inject the input data to our network which we ourselves can ”see,” and is also often referred to as the first or input layer of the

406 Fully Connected Neural Networks
w (1) 0,U1
f (1) 1 U1
hidden layer feature transformation
. w(1) . 1,U1
xN w(1) N,U1
x ̊a(·)
input layer
input layer hidden layer feature transformation
hidden layer feature transformation
input layer
Figure13.2 (leftpanel)Graphicalrepresentationofasingle-layerneuralnetwork model, given in Equation (13.4), which is composed of U1 single-layer units. (top-right panel) A condensed graphical representation of a single-layer neural network. (bottom-right panel) This network can be represented even more compactly, illustrating in a simple diagram all of the computation performed by a single-layer neural network model. See text for further details.
network. The linear combination of input leading to each unit is then shown visually by edges connecting the input to a hollow circle (summation unit), with the nonlinear activation then shown as a larger blue circle (activation unit). In the middle of this visual depiction (where the blue circles representing all U1 activations align) is the hidden layer of this architecture. This layer is called ”hidden” because it contains internally processed versions of our input that we do not ”see.” While the name hidden is not entirely accurate (as we can visualize the internal state of these units if we so desire) it is a commonly used convention, hence the name single-hidden-layer unit. The output of these U1 units is then collected in a linear combination, and once again visualized by edges connecting each unit to a final summation shown as a hollow circle. This is the final output

13.2 Fully Connected Neural Networks 407 of the model, and is often called the final or output layer of the network, which
is again ”visible” to us (not hidden).
Compact representation of single-layer neural networks
Because we will soon wish to add more hidden layers to our rudimentary model in detailing multi-layer networks, the sort of visual depiction in the left panel of Figure 13.2 quickly becomes unwieldy. Thus, in order to keep ourselves organized and better prepared to understand deeper neural network units, it is quite helpful to compactify this visualization. We can do so by first using a more compact notation to represent our model algebraically, beginning by more concisely representing our input, placing a 1 at the top of our input vector x, which we denote by placing a hollow ring symbol over x as2
x ̊ = 6 .1 7 . (13.6)
w(1) , and place them into the jth column of an (N + 1) ⇥ U1 matrix
26w(1) 6 0,1
6 w(1) 6 1,1
w(1) ··· w(1) 37 0,2 0,U1 7
w(1) ··· w(1) 7 1,2 1,U1 7
6 . 7 4 xN 5
Next we collect all of the internal parameters of our U1 single-layer units. Examining the algebraic form for the jth unit in Equation (13.5) we can see that it has N + 1 such internal parameters. Taking these parameters we form a column vector,startingwiththebiasw(1) andtheninput-touchingweightsw(1) through
W1=6 . . . . 7. 6 . . .. . 7
64 w(1) w(1) ··· w(1) 75 N,1 N,2 N,U1
With this notation note how the matrix-vector product WT1 x ̊ contains every linear combination internal to our U1 nonlinear units. In other words, WT1 x ̊ has dimension U1 ⇥ 1, and its jth entry is precisely the linear combination of the input data internal to the jth unit
h T i (1) XN (1)
W1 x ̊ j =w0,j + wn,jxn j=1,2,…,U1. (13.8)
Next, we extend our notation for the arbitrary activation function a(·) to
2 This notation was introduced and employed previously in Section 5.2.

408 Fully Connected Neural Networks
handle such a vector. More specifically, we define a (·) as the vector function that takes in a general d ⇥ 1 vector v and returns – as output – a vector of the same dimension containing activation of each of its input’s entries, as
26 a(v1) 37
a (v) = 6 .
64 a(vd) 75
7 . (13.9) Notice how with this notation, the vector activation a ⇣WT1 x ̊ ⌘ becomes a U1 ⇥ 1
vector containing all U1 single-layer units, the jth of which is given as
ha⇣WT x ̊⌘i = a w(1) + PN w(1) xn! j = 1,2,…,U1. (13.10)
1j 0,jn=1n,j
Using another compact notation to denote the weights of the final linear combi-
w = 6 w1 7 (13.11) 2 6 . 7
and extending our vector a by tacking a 1 on top of it – denoting the resulting (U1 + 1) ⇥ 1 vector a ̊ – we can finally write out the model in Equation (13.4) quite compactly as
model(x,⇥)=wT2a ̊⇣WT1 x ̊⌘. (13.12)
This more compact algebraic formulation lends itself to much more easily digestible visual depictions. In the top-right panel of Figure 13.2 we show a slightly condensed version of our original graph in the left panel, where the linear weights attached to each input are now shown more compactly as the set of crisscrossing line segments connecting the input to each unit, with the matrix W1 jointly representing all U1 of the weighted combinations. In the bottom- right panel of Figure 13.2 we compactify our original visual representation even further. In this more compact representation we can more easily visualize the computation performed by the general single-layer neural network model in Equation (13.12), where we depict the scalars, vectors, and matrices in this formula symbolically as circles, diamonds, and squares, respectively.
13.2.2 Two-hidden-layer units
To create a two-hidden-layer neural network unit, or a two-layer unit for short, we recurse on the idea of the single-layer unit detailed in the previous section.

13.2 Fully Connected Neural Networks 409
We do this by first constructing a set of U1 single-layer units and treat them as input to another nonlinear unit. That is, we take their linear combination and pass the result through a nonlinear activation.
The algebraic form of a general two-layer unit is given as
0X1 (2) B (2) U1 (2) (1) C
f (x) = a + wi fi (x)CA (13.13) i=1
which reflects the recursive nature of constructing two-layer units using single- layer ones. This recursive nature can be also seen in the recursive recipe given below for building two-layer units.
Recursive recipe for two-layer units
1. Choose an activation function a (·)
2. Construct U1 single-layer units f (1) (x) for i = 1, 2, …, U1
3. Compute the linear combination v = w(2) + U1 w(2) f (1) (x)
0 i=1ii 4. Pass the result through activation and form a (v)
Example 13.2 Illustrating the capacity of two-layer units
In the top row of Figure 13.3 we plot four instances of a two-layer neural network unit – using tanh activation – of the form
f(2)(x)=tanh⇣w(2) +w(2) f(1)(x)⌘ 01
f(1)(x)=tanh⇣w(1) +w(1)x⌘. 01
The wider variety of shapes taken on by instances of this unit, as shown in the figure, reflects the increased capacity of two-layer units over their single-layer analogs shown in Figure 13.1.
In the bottom row of the figure we show four exemplars of the same sort of unit, only now we use ReLU activation instead of tanh in each layer.

410 Fully Connected Neural Networks
Figure13.3 FigureassociatedwithExample13.2.Fourinstancesofatwo-layerneural network unit with tanh (top row) and ReLU activation (bottom row). See text for further details.
In general, if we wish to create a model using B = U2 two-layer neural network units we write
model (x, ⇥) = w0 + f (2) (x) w1 + · · · + f (2) (x) wU2 1 U2
and where the parameter set ⇥, as always, contains those (superscripted) weights
0X1 (2) B (2) U1 (2) (1) C
fj (x) = + wi,j fi (x)CA j = 1,2,…,U2 (13.17) i=1
internal to the neural network units as well as the final linear combination
weights. Importantly note that while each two-layer unit f (2) in Equation (13.17) j
has unique internal parameters – denoted by w(2) where i ranges from 0 to U1 i,j
– the weights internal to each single-layer unit f (1) are the same across all the two-layer units themselves. i
Figure 13.4 shows a graphical representation (or architecture) of a generic two- layer neural network model whose algebraic form is given in Equation (13.16). In the left panel we illustrate each input single-layer unit precisely as shown previously in the top-right panel of Figure 13.2. The input layer, all the way to the left, is first fed into each of our U1 single-layer units (which still constitutes

13.2 Fully Connected Neural Networks 411
x1 2 2f(2)
input layer
w(2) 2 . 2,2
w (2) U1 U1,2
0,U2 1 w(2)
2 w(2) . 2,U2
w (2) U1 U1 ,U2
w(2) (2) U1,U2 fU2
hidden layer 1 feature transformation
hidden layer 2
1 1 1 w0 x1 w1
input layer hidden layer 1 hidden layer 2 feature transformation
TTT W1 W2 w3
̊x ̊a(·) input layer hidden layer 1
feature transformation model
̊a(·) hidden layer 2
Figure13.4 (leftpanel)Graphicalrepresentationofatwo-layerneuralnetworkmodel, given in Equation (13.16), which is composed of U2 two-layer units. (top-right panel) A condensed graphical representation of a two-layer neural network. (bottom-right panel) This network can be represented more compactly, providing a simpler depiction of the computation performed by a two-layer neural network model. See text for further details.
the first hidden layer of the network). A linear combination of these single-layer units is then fed into each of the U2 two-layer units, referred to by convention as the second hidden layer, since its computation is also not immediately ”visible” to us. Here we can also see why this sort of architecture is referred to as fully connected: every dimension of input is connected to every unit in the first hidden layer, and each unit of the first hidden layer is connected to every unit of the second hidden layer. At last, all the way to the right of this panel, we see a linear combination of the U2 two-layer units which produces the final (visible) layer of the network: the output of our two-layer model.

412 Fully Connected Neural Networks
Compact representation of two-layer neural networks
As with single-layer models, here it is also helpful to compactify both our notation and the corresponding visualization of a two-layer model in order to simplify our understanding and make the concept easier to wield. Using the same notation introduced in Section 13.2.1, we can compactly designate the output of our U1 single-layer units as
output of first hidden layer: a ̊ ⇣WT1 x ̊ ⌘ . (13.18)
Following the same pattern as before we can then condense all internal weights of the U2 units in the second layer column-wise into a (U1 + 1) ⇥ U2 matrix of the form
26w(2) w(2) ··· w(2) 37 6 0,1 0,2 0,U2 7
6 w(2) w(2) · · · w(2) 7
W2 = 6 . . . . 7
1,U2 7 6 . . .. . 7
which mirrors precisely how we defined the (N + 1) ⇥ U1 internal weight matrix W1 for our single-layer units in Equation (13.7). This allows us to likewise express the output of our U2 two-layer units compactly as
output of second hidden layer: a ̊ ⇣WT2 a ̊ ⇣WT1 x ̊ ⌘⌘ . (13.20)
The recursive nature of two-layer units is on full display here. Remember that we use the notation a ̊ (·) somewhat loosely as a vector-valued function in the sense that it simply represents taking the nonlinear activation a (·), element-wise, of whatever vector is input into it as shown in Equation (13.9), with a 1 appended to the top of the result.
Concatenating the final linear combination weights into a single vector as
64 w(2) w(2) ··· w(2) 75 U1,1 U1,2 U1,U2
allows us to write the full two-layer neural network model as
26 w0 37 w = 6 w1 7
3 6 . 7 64wU2 75
model (x, ⇥) = wT3 a ̊ ⇣WT2 a ̊ ⇣WT1 x ̊ ⌘⌘ .
As with its single-layer analog, this compact algebraic formulation of a two-layer

13.2 Fully Connected Neural Networks 413
neural network lends itself to much more easily digestible visual depictions. In the top-right panel of Figure 13.4 we show a slightly condensed version of the original graph in the left panel, where the redundancy of showing every single- layer unit has been reduced to a single visual representation. In doing so we remove all the weights assigned to crisscrossing edges connecting the first and second hidden layers, and place them in the matrix W2 defined in Equation (13.19) to avoid cluttering the visualization. In the bottom-right panel of Figure 13.4 we condense this two-layer depiction even further where scalars, vectors, and matrices are depicted symbolically as circles, diamonds, and squares, re- spectively. This greatly compacted graph provides a simplified visual represen- tation of the total computation performed by a general two-layer neural network model.
13.2.3 General multi-hidden-layer units
Following the same pattern we have seen thus far in describing single- and two-layer units we can construct general fully connected neural network units with an arbitrary number of hidden layers. With each hidden layer added we increase the capacity of a neural network unit, as we have seen previously in graduating from single-layer units to two-layer units, as well as a model built using such units.
To construct a general L-hidden-layer neural network unit, or L-layer unit for short, we simply recurse on the pattern we have established previously L 1 times, with the resulting L-layer unit taking as input a number UL1 of (L 1)-layer units, as
0X1 (L) B (L) UL1 (L) (L1) C
f (x) = a + wi fi (x)CA . (13.23) i=1
As was the case with single- and two-layer units, this formula is perhaps easier to digest if we think about it in terms of the recursive recipe given below.
Recursive recipe for L-layer units
1. Choose an activation function a (·)
2. Construct UL1 number of (L 1)-layer unitPs f (L1) (x) for i = 1, 2, …, UL1 i
3. Compute the linear combination v = w(L) + UL1 w(L) f (L1) (x) 0 i=1 i i
4. Pass the result through activation and form a (v)

414 Fully Connected Neural Networks
Note that while in principle the same activation function need not be used for all hidden layers of an L-layer unit, for the sake of simplicity, a single kind of activation is almost always used.
Example 13.3 Illustrating the capacity of three-layer units
In the top row of Figure 13.5 we show four instances of a three-layer unit with tanh activation. The greater variety of shapes shown here, as compared to single- and two-layer analogs in Examples 13.1 and 13.2, reflects the increased capacity of these units. In the bottom row we repeat the same experiment, only using the ReLU activation function instead of tanh.
Figure13.5 FigureassociatedwithExample13.3.Fourinstancesofathree-layer network unit with tanh (top row) and ReLU activation (bottom row). See text for further details.
In general we can produce a model consisting of B = UL such L-layer units as
model (x, ⇥) = w0 + f (L) (x) w1 + · · · + f (L) (x) wUL (13.24) 1 UL

13.2 Fully Connected Neural Networks 415
f (L) (x) = a Bw(L) + Xw(L) f (L1) (x)C j = 1, 2, …, UL (13.25)
j @0,ji=1i,ji A
and where the parameter set ⇥ contains both those weights internal to the neural network units as well as the final linear combination weights.
Figure 13.6 shows an unraveled graphical representation of this model, which is a direct generalization of the kinds of visualizations we have seen previously with single- and two-layer networks. From left to right we can see the input layer to the network, its L hidden layers, and the output. Often models built using three or more hidden layers are referred to as deep neural networks in the jargon of machine learning.
10,1 x1 w(L)
xN U1 U2 W1 1 W2 1
xN U1 U2 W 1 1 W 2 1
W 1 1 W 2 1 W WLL1 2 1 111
. w2,1 w(L)
UL1 UL1,1 1
. 1 1 1 1w(L)
. w2,2 w(L)
UL1 UL1,2 1
W WLL1 2
input layer
U1 hidden layer 1
U2 hidden layer 2
2221,UL (L)
hidden layer L 1
hidden layer L
w(L) UL1,UL
feature transformation model
Figure13.6 GraphicalrepresentationofanL-layerneuralnetworkmodel,givenin Equation (13.24), which is composed of UL L-layer units.
Code Help
416 Fully Connected Neural Networks
Compact representation of multi-layer neural networks
To simplify our understanding of a general multi-layer neural network archi- tecture we can use precisely the same compact notation and visualizations we have introduced in the simpler contexts of single- and two-layer neural net- works. In complete analogy to the way we compactly represented two-layer neural networks, we denote the output of the Lth hidden layer compactly as
output of Lth hidden layer: a ̊ ⇣WTL a ̊ ⇣WTL1 a ̊ ⇣· · · a ̊ ⇣WT1 x ̊ ⌘⌘⌘⌘ . Denoting the weights of the final linear combination as
w = 6 w1 7 L+1 6 . 7 64wUL 75
we can express the L-layer neural network model compactly as model (x, ⇥) = wTL+1a ̊ ⇣WTL a ̊ ⇣WTL1a ̊ ⇣· · · a ̊ ⇣WT1 x ̊⌘⌘⌘⌘ .
Once again this compact algebraic formulation of an L-layer neural network lends itself to much more easily digestible visual depictions. The top panel of Figure 13.7 shows a condensed version of the original graph in Figure 13.6, where the redundancy of showing every (L 1)-layer unit has been reduced to a single visualization. The succinct visual depiction shown in the bottom panel of Figure 13.7 represents this network architecture even more compactly, with scalars, vectors, and matrices are shown symbolically as circles, diamonds, and squares, respectively.
13.2.4 Selecting the right network architecture
We have now seen a general and recursive method of constructing arbitrarily ”deep” neural networks, but many curiosities and technical issues remain to be addressed that are the subject of subsequent sections in this chapter. These include the choice of activation function, popular cross-validation methods for models employing neural network units, as well as a variety of optimization- related issues such as the notions of backpropagation and batch normalization.
However, one fundamental question can (at least in general) be addressed now, which is how we choose the ”right” number of units and layers for a neural network architecture. As with the choice of proper universal approximator in general (as detailed in Section 11.8), typically we do not know a priori what sort of architecture will work best for a given dataset (in terms of number of hidden layers and number of units per hidden layer). In principle, to determine the best architecture for use with a given dataset we must cross-validate an array

13.2 Fully Connected Neural Networks 417
W1 1 W2 1 WL12 1
xN U1 U2 UL
input layer hidden layer 1
hidden layer L
feature transformation model
T T T wT W1 W2 W L,+j1
̊x input layer
̊a a ( · ) a a ( · ) a a ( · )
hidden layer 1 hidden layer 2 feature transformation
hidden layer L
Figure13.7 (toppanel)AcondensedgraphicalrepresentationofanL-layerneural network model shown in Figure 13.6. (bottom panel) A more compact version, succinctly describing the computation performed by a general L-layer neural network model. See text for further details.
of choices. In doing so we note that, generally speaking, the capacity gained by adding new individual units to a neural network model is typically much smaller relative to the capacity gained by the addition of new hidden layers. This is because appending an additional layer to an architecture grafts an additional recursion to the computation involved in each unit, which significantly increases their capacity and that of any corresponding model, as we have seen in Examples 13.1, 13.2, and 13.3. In practice, performing model search across a variety of neural network architectures can be expensive, hence compromises must be made that aim at determining a high-quality model using minimal computation. To this end, early stopping based regularization (see Sections 11.6.2 and 13.7) is commonly employed with neural network models.
Programming Help
418 Fully Connected Neural Networks
13.2.5 Neural networks: the biological perspective
The human brain contains roughly 1011 biological neurons which work together in concert when we perform cognitive tasks. Even when performing relatively minor tasks we employ a sizable series of interconnected neurons – called bio- logical neural networks – to perform the task properly. For example, somewhere between 105 to 106 neurons are required to render a realistic image of our vi- sual surroundings. Most of the basic jargon and modeling principles of neural network universal approximators we have seen thus far originated as a (very) rough mathematical model of such biological neural networks.
An individual biological neuron (shown in the top-left panel of Figure 13.8) consists of three main parts: dendrites (the neuron’s receivers), soma (the cell body), and axon (the neuron’s transmitter). Starting around the 1940s psycholo- gists and neuroscientists, with a shared desire to understand the human brain better, became interested in modeling neurons mathematically. These early mod- els, later dubbed as artificial neurons (a basic exemplar of which is shown in the top-right panel of Figure 13.8), culminated in the introduction of the Per- ceptron model in 1957 [59]. Closely mimicking a biological neuron’s structure, an artificial neuron comprises a set of dendrite-like edges that connect it to other neurons, each taking an input and multiplying it by a (synaptic) weight associated with that edge. These weighted inputs are summed up after going through a summation unit (shown by a small hollow circle). The result is sub- sequently fed to an activation unit (shown by a large blue circle) whose output is then transmitted to the outside via an axon-like projection. From a biological perspective, neurons are believed to remain inactive until the net input to the cell body (soma) reaches a certain threshold, at which point the neuron gets activated and fires an electro-chemical signal, hence the name activation function.
Stringing together large sets of such artificial neurons in layers creates a more mathematically complex model of a biological neural network, which still re- mains a very simple approximation to what goes on in the brain. The bottom panel of Figure 13.8 shows the kind of graphical representation (here of a two- layer network) used when thinking about neural networks from this biological perspective. In this rather complex visual depiction every multiplicative opera- tion of the architecture is shown as an edge, creating a mesh of intersecting lines connecting each layer. In describing fully connected neural networks in Sec- tions 13.2.1 through 13.2.3 we preferred simpler, more compact visualizations that avoid this sort of complex visual mesh.
13.2.6 Python implementation
In this section we show how to implement a generic model consisting of L-layer neural network units in Python using NumPy. Because such models can have a large number of parameters, and because associated cost functions employing them can be highly nonconvex (as further discussed in Section 13.5), first-order

CS Help, Email: tutorcs@163.com
13.2 Fully Connected Neural Networks 419
summation unit 1 w activation unit
. w1 a xN wN
linear  nonlinear  combination activation
w0 + wnxn n=1
132 x1 4 3 x2 5 4 x3 6 5 x4 7 6 x5 8 7 x6 9 8
Figure13.8 (top-leftpanel)Atypicalbiologicalneuron.(top-rightpanel)Anartificial neuron, that is a simplistic mathematical model of the biological neuron, consisting of: (i) weighted edges that represent the individual multiplications (of 1 by w0, x1 by w1, etc.), (ii) a summation unit shown as a small hollow circle representing the sum
w0 +w1 x1 +···+wN xN,and(iii)anactivationunitshownasalargerbluecircle representing the sum evaluated by the nonlinear activation function a (·). (bottom panel) An example of a fully connected two-layer neural network as commonly illustrated when detailing neural networks from a biological perspective.
local optimization methods (e.g., gradient descent and its variants detailed in Chapter 3 and Appendix A) are the most common schemes used to tune the parameters of general neural network models. Moreover, because computing derivatives of a model employing neural network units ”by hand” is extremely tedious (see S