History of artificial neural networks

Artificial neural networks (ANNs) are models created using machine learning to perform a number of tasks. While the computational implementations of ANNs relate to earlier discoveries in mathematics, their creation was inspired by biological neural circuitry. The first implementation of ANNs was the perceptron by Frank Rosenblatt.^{[‡ 1]} Little research was conducted on ANNs in the 1970s and 1980s, with the AAAI calling this period an "AI winter".^[1]

Later, advances in hardware and the development of the backpropagation algorithm, as well as recurrent neural networks and convolutional neural networks, renewed interest in ANNs. The 2010s saw the development of a deep neural network (i.e., one with many layers) called AlexNet.^{[‡ 2]} It greatly outperformed other image recognition models, and is thought to have launched the ongoing AI spring.^[2] The transformer architecture was first described in 2017 as a method to teach ANNs grammatical dependencies in language,^{[‡ 3]} and is the predominant architecture used by large language models such as GPT-4. Diffusion models were first described in 2015, and became the basis of image generation models such as DALL-E in the 2020s.

Mathematical foundations

Jürgen Schmidhuber suggests that the first neural network was the method of linear regression by least squares, first published by Adrien-Marie Legendre in 1805^[3] and independently developed by Friedrich Gauss (who claimed use since 1795)^[4] and Robert Adrain (1808),^[5]^[6] as it is mathematically equivalent to a two layer neural network without activation functions.^[7]^[a] The chain rule, developed by Gottfried Wilhelm Leibniz in 1676, and gradient descent, independently proposed by Augustin-Louis Cauchy in 1847 and Jacques Hadamard in 1907,^{[citation needed]} are also central to the development of neural networks, as they form the basis of techniques for updating weights within complex networks.^[7]^{[clarification needed]}

Biological and computational models

Artificial neural network development of the 20th century was primarily based on research into the function of biological neurons.^[8]^[9] The neuron's place as the primary functional unit of the nervous system was first recognized in the late 19th century through the work of Santiago Ramón y Cajal, notably through his 1888 paper presenting staining of axons in the cerebellum of birds.^[10] Alexander Bain's Mind and Body (1873) proposed that thoughts and bodily activity result from neuronal processes, with each thought corresponding to a distinct neural grouping.^[11] William James's The Principles of Psychology (1890) advanced two principles on a quasi-neurological basis: first, that when two brain processes are active together, one tends to propagate excitement into the other, and second, that activity at any brain point is the sum of tendencies from all other points discharging into it.^[11]^[12]^[b]

Warren McCulloch and Walter Pitts's 1943 paper "A Logical Calculus of the Ideas Immanent in Nervous Activity" studied several abstract models for neural networks, using the symbolic logic of Rudolf Carnap and Principia Mathematica. The paper argued that several abstract models of neural networks (some learning, some not) have the same computational power as Turing machines.^{[‡ 4]}^[13] This model paved the way for research to split into two approaches: one focused on biological processes, while the other focused on the application of neural networks to artificial intelligence.^{[citation needed]} This also led to work on nerve networks and their link to finite automata.^[14]^{[importance?]} Some^[who?] consider McCulloch and Pitts to be the founders of connectionism, a theory of mind in opposition to classical computationalism.^[15]^[11]^{[clarification needed]}

In his 1948 report "Intelligent Machinery", published posthumously in 1969, Alan Turing proposed randomly connected networks of neuron-like nodes trainable through "education", defining A-type machines with random networks of NAND gates and B-type machines with modifiable connections.^[16]

In 1949, the psychologist Donald O. Hebb published The Organization of Behavior, which proposed a learning hypothesis based on the mechanism of neural plasticity which became known as Hebbian learning, summarized as "neurons that fire together, wire together".^{[‡ 5]}^[17] Similar observations were made by Jerzy Konorski in 1948.^[18] The concept was used in many early neural networks, such as Rosenblatt's perceptron and the Hopfield network.^{[citation needed]} This evolved into models for long-term potentiation.^{[citation needed]}

Belmont Farley and Wesley A. Clark (1954) were the first to use computational machines to simulate a Hebbian network.^{[‡ 6]}^[19] Other neural network computational machines were simulated by Nathaniel Rochester, John Holland, Lois Haibt and William Duda (1956).^{[‡ 7]}^[20]^: 31

In 1959, a biological model was proposed by David H. Hubel and Torsten Wiesel based on their discovery of two types of cells in the primary visual cortex: simple cells and complex cells.^[21]^{[importance?]}

Perceptrons and other early neural networks

The perceptron was created by Frank Rosenblatt in 1957 while working at the Cornell Aeronautical Laboratory, publishing the details the following year.^{[‡ 1]} The perceptron was designed to classify objects into two categories, updating based on error feedback.^[22]^{[clarification needed]} He initially simulated the perceptron on an IBM 704, later designing the Mark I Perceptron, the first hardware neural net.^[23]^{[better source needed]} In 1958, Rosenblatt proposed the multilayer perceptron (MLP) model, consisting of an input layer, a hidden non-learning layer with randomized weights, and an output layer with learnable connections. He published the book Principles of Neurodynamics in 1962, which also introduced variants and computer experiments, including a version (developed alongside Henry David Block and Bruce Knight) with four-layer perceptrons where the last two layers have learned weights.^{[‡ 8]}^{[‡ 9]}

Bernard Widrow and his doctoral student Marcian Hoff developed ADALINE (Adaptive Linear Neuron) in 1960. Unlike Rosenblatt's perceptron, ADALINE adjusted weights based on their least mean squares (LMS) algorithm before applying the threshold function.^[11]^[24] MADALINE, the multilayer extension, was used to eliminate echo on phone lines, likely the first artificial neural network applied to a real‑world engineering problem.^[11]^[25]

Multi-layer learning

Group method of data handling, a method to train arbitrarily deep neural networks, was published by Alexey Ivakhnenko and Valentin Lapa in 1965; they regarded it as a form of polynomial regression^{[‡ 10]} or a generalization of Rosenblatt's perceptron.^{[‡ 11]} A 1971 paper described a deep network with the equivalent of eight layers trained by this method.^{[‡ 12]}^[17]

The first deep learning multilayer perceptron trained by stochastic gradient descent was published in 1967 by Shun'ichi Amari.^{[‡ 13]} According to Amari, in computer experiments conducted by his student Saito, a five layer MLP with two modifiable layers learned internal representations to classify non-linearly separable pattern classes.^[7] Subsequent developments in hardware and hyperparameter tunings have made end-to-end stochastic gradient descent the currently dominant training technique.^{[citation needed]}

Backlash and AI winter

In 1969, Marvin Minsky and Seymour Papert published Perceptrons with the goal of showing the limitations of neural network systems.^[20]^{: 97, 110} They proved single-layer perceptrons cannot compute non-linearly separable functions such as XOR. They also demonstrated that certain problems (e.g. determining parity) are impossible for single-layer networks under restriction of "conjunctive locality",^{[clarification needed]} and the maximum number of connections required to compute others (e.g. connectedness) grows arbitrarily large with input size.^{[‡ 14]}^[20]^{: 141–150}

Despite the limitations of the conclusions, for example to networks with at most two layers,^{[needs copy edit]} there was a subsequent decrease in funding from American and British agencies given to neural network projects^[26]^[27] in favor of symbolic AI,^{[citation needed]} and a reduction in the number of computer scientists working in the field.^[28]

Backpropagation

Up until the 1970s, neural networks were limited by their capacity to learn and update their neurons. The "learning rule" used by Rosenblatt for the perceptron only allowed for training a single layer of a neural network. The terminology "back-propagating errors" was introduced by Rosenblatt in 1962 to describe a (hypothetical) multilayer generalization of his perceptron learning algorithm.^{[‡ 15]}^[29] The aforementioned least mean squares (LMS) algorithm, also known as the Widrow–Hoff learning rule or the Delta rule, was more general but still limited to single layers.^{[citation needed]}

Backpropagation is an efficient application of the chain rule derived by Gottfried Wilhelm Leibniz in 1673 to networks of differentiable nodes.^{[‡ 16]} Henry J. Kelley had a continuous precursor of backpropagation in 1960 in the context of control theory,^{[‡ 17]} also discovered by Arthur E. Bryson independently around the same time.^{[‡ 18]} He presented a form of gradient descent to solve problems where neurons have continuous output, as opposed to the discrete (binary) output of existing neural networks.^[30]^[31]

The modern form of backpropagation was developed multiple times in early 1970s. The earliest published instance was Seppo Linnainmaa's 1970 master thesis.^{[‡ 19]}^[32] His FORTRAN code efficiently computed the derivatives of nested, differentiable functions by caching intermediate steps,^[33] used to calculate arithmetic rounding errors for the results of complex expressions.^[32] He published some of his results in English in 1976.^{[‡ 20]}^[32] Paul Werbos developed it independently in 1971 or 1972,^[34]^: 342 published in his PhD thesis in 1974.^[35]^[36] In 1982, he became the first person to apply backpropagation to neural networks.^{[‡ 21]}^[33] In 1986, David E. Rumelhart et al. popularized backpropagation.^{[‡ 22]}^[37]^: 11

Recurrent network architectures

One origin of the recurrent neural network (RNN) was statistical mechanics. The Ising model was developed by Wilhelm Lenz^{[‡ 23]} and Ernst Ising^{[‡ 24]} in the 1920s^[38] as a simple statistical mechanical model of magnets at equilibrium. Glauber in 1963 studied the Ising model evolving in time, as a process towards equilibrium (Glauber dynamics), adding in the component of time.^{[‡ 25]} Shun'ichi Amari in 1972 proposed to modify the weights of an Ising model by Hebbian learning rule as a model of associative memory, adding in the component of learning.^{[‡ 26]} This was popularized as the Hopfield network (1982).^{[‡ 27]}

Another origin of RNN was neuroscience. The word "recurrent" is used to describe loop-like structures in anatomy. In 1901, Santiago Ramón y Cajal observed "recurrent semicircles" in the cerebellar cortex.^[39] In 1933, Rafael Lorente de Nó discovered "recurrent, reciprocal connections" by Golgi's method, and proposed that excitatory loops explain certain aspects of the vestibulo-ocular reflex.^{[‡ 28]}^[40] Hebb considered "reverberating circuits" as an explanation for short-term memory.^[41]^{[better source needed]} (McCulloch & Pitts 1943) considered neural networks that contains cycles, and noted that the current activity of such networks can be affected by activity indefinitely far in the past.^[42]

Two early influential works were the Jordan network (1986) and the Elman network (1990), which applied RNN to study cognitive psychology.^{[citation needed]} In 1993, a "neural history compressor" system by Schmidhuber solved a task that required more than 1000 subsequent layers in an RNN unfolded in time.^{[clarification needed]}^{[‡ 29]}^[43]^[44]^: 5

Long short-term memory

Sepp Hochreiter's diploma thesis (1991)^{[‡ 30]} identified and analyzed the vanishing gradient problem.^{[‡ 30]}^[45]^{[clarification needed]} Hochreiter suggested recurrent residual connections to solve the problem, leading to the publication of long short-term memory (LSTM) in 1995, which set set accuracy records in multiple applications domains.^[46]^{[verification needed]} LSTM can learn "very deep learning" tasks with long credit assignment paths that require memories of events that happened thousands of discrete time steps before.^[37]^: 17,19 That LSTM was not yet the modern architecture, requiring the "forget gate" introduced in 1999,^{[‡ 31]} which became the standard RNN architecture.^{[citation needed]}

Around 2006, LSTM started to revolutionize speech recognition, outperforming traditional models in certain speech applications.^{[‡ 32]}^{[‡ 33]} LSTM also improved large-vocabulary speech recognition^{[‡ 34]}^{[‡ 35]} and text-to-speech synthesis^{[‡ 36]} and was used in Google voice search, and dictation on Android devices.^{[‡ 37]}

LSTM broke records for improved machine translation,^{[‡ 38]} language modeling,^[47] and multilingual language processing.^{[‡ 39]} LSTM combined with convolutional neural networks (CNNs) improved automatic image captioning.^{[‡ 40]}

Convolutional neural networks

Kunihiko Fukushima introduced the neocognitron in 1980.^[48]^{[‡ 41]}^[49] It was inspired by David H. Hubel and Torsten Wiesel in the 1950s and 1960s who showed that cat visual cortices contain neurons that individually respond to small regions of the visual field. The neocognitron introduced the two basic types of layers in CNNs: convolutional layers, and downsampling layers. A convolutional layer contains units whose receptive fields cover a patch of the previous layer. The weight vector (the set of adaptive parameters) of such a unit is often called a filter. Units can share filters.^{[citation needed]} Downsampling layers contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes the average of the activations of the units in its patch. Downsampling helps to correctly classify objects in visual scenes even when the objects are shifted.^{[citation needed]}

In 1969, Fukushima introduced the ReLU (rectified linear unit) activation function.^{[‡ 42]}^[7] The rectifier is the most popular activation function for CNNs and deep neural networks in general.^[50]

The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel and was one of the first CNNs, as it achieved shift invariance.^{[‡ 43]}^{[clarification needed]} It did so by sharing weights in combination with backpropagation training.^{[‡ 44]} Thus, while using a pyramidal structure as in the neocognitron, it optimized weights globally instead of locally.^{[‡ 43]}

In 1988, Wei et al. applied backpropagation to a CNN (a simplified neocognitron with convolutional interconnections between the image feature layers and the last fully connected layer) for alphabet recognition. They proposed a CNN for an optical computing system.^{[‡ 45]}^{[‡ 46]}

Max pooling appears in Fukushima's 1982 publication on the neocognitron.^{[‡ 47]} In 1989, Yann LeCun et al. trained a max pooling CNN with the purpose of recognizing handwritten ZIP codes on mail. While the algorithm worked, training required 3 days.^{[‡ 48]}^{[‡ 49]} Learning was fully automatic, performed better than manual coefficient design, and was suited to a broader image recognition problems and image types. Subsequently, Wei et al. modified their model by removing the last fully connected layer. They applied it for medical image object segmentation in 1991^{[‡ 50]} and breast cancer detection in mammograms in 1994.^{[‡ 51]}

In a variant of the neocognitron called the cresceptron, instead of using Fukushima's spatial averaging, Weng et al. used max-pooling where a downsampling unit computes the maximum (rather than the average) of the activations in its patch.^{[‡ 52]}^{[‡ 53]}^{[‡ 54]}^{[‡ 55]}

LeNet-5, a 7-level CNN by LeCun et al. in 1998^{[‡ 56]} that classifies digits, was applied by banks to recognize hand-written numbers on checks digitized in 32x32 pixel images. The ability to process higher-resolution images required more, larger layers of CNNs.

In 2010, backpropagation training through max-pooling was accelerated by GPUs and shown to perform better than other pooling variants.^{[‡ 57]}

Rprop (short for "resilient backpropagation") is a first-order optimization algorithm. It was created by Martin Riedmiller and Heinrich Braun (1992).^{[‡ 58]}^{[‡ 59]} Sven Behnke (2003) relied on only the sign of the gradient (Rprop)^{[‡ 60]} on problems such as image reconstruction and face localization.

Deep learning

The deep learning revolution started around CNN- and GPU-based computer vision. Although CNNs trained by backpropagation had been around for decades and GPU implementations of NNs for years,^{[‡ 61]} including CNNs,^{[‡ 62]} faster implementations of CNNs on GPUs were needed to progress on computer vision. Later, as deep learning becomes widespread, specialized hardware and algorithm optimizations were developed specifically for deep learning.^[51]

A key advance for the deep learning revolution was hardware advances, especially GPU. Some early work dated back to 2004.^{[‡ 61]}^{[‡ 62]} In 2009, Rajat Raina, Anand Madhavan, and Andrew Ng reported a 100M deep belief network trained on 30 Nvidia GeForce GTX 280 GPUs, an early demonstration of GPU-based deep learning. They reported up to 70 times faster training.^{[‡ 63]}

In 2011, a CNN named DanNet^{[‡ 64]}^{[‡ 65]} by Dan Ciresan, Ueli Meier, Jonathan Masci, Luca Maria Gambardella, and Jürgen Schmidhuber achieved for the first time superhuman performance in a visual pattern recognition contest, outperforming traditional methods by a factor of 3.^[37] It then won more contests.^{[‡ 66]}^{[‡ 67]} They also showed how max-pooling CNNs on GPU improved performance significantly.^{[‡ 68]}

Many discoveries were empirical and focused on engineering. For example, in 2011, Xavier Glorot, Antoine Bordes and Yoshua Bengio found that the ReLU, used by Fukushima in 1969,^{[‡ 42]} worked better than widely used activation functions prior to 2011.^{[citation needed]}

In October 2012, AlexNet by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton won the large-scale ImageNet competition by a significant margin over shallow machine learning methods.^{[‡ 69]} Further incremental improvements included the VGG-16 network by Karen Simonyan and Andrew Zisserman^{[‡ 70]} and Google's Inceptionv3.^{[‡ 71]}

The success in image classification was then extended to the more challenging task of generating descriptions (captions) for images, often as a combination of CNNs and LSTMs.^{[‡ 40]}^{[‡ 72]}^{[‡ 73]}

In 2014, the state of the art was training "very deep neural network" with 20 to 30 layers.^{[‡ 70]} Stacking too many layers led to a steep reduction in training accuracy,^[52] known as the "degradation" problem.^[53] In 2015, two techniques were developed concurrently to train very deep networks: highway network^{[‡ 74]} and residual neural network (ResNet).^{[‡ 75]} The ResNet research team attempted to train deeper ones by empirically testing various tricks for training deeper networks until they discovered the deep residual network architecture.^[54]

Generative adversarial networks

In 1991, Jürgen Schmidhuber published "artificial curiosity", neural networks in a zero-sum game.^{[‡ 76]} The first network is a generative model that models a probability distribution over output patterns. The second network learns by gradient descent to predict the reactions of the environment to these patterns. GANs can be regarded as a case where the environmental reaction is 1 or 0 depending on whether the first network's output is in a given set.^[55] It was extended to "predictability minimization" to create disentangled representations of input patterns.^{[‡ 77]}^{[‡ 78]}

Other people had similar ideas but did not develop them similarly. An idea involving adversarial networks was published in a 2010 blog post by Olli Niemitalo.^{[‡ 79]}^{[importance?]} This idea was never implemented and did not involve stochasticity in the generator and thus was not a generative model. It is now known as a conditional GAN or cGAN.^{[citation needed]} An idea similar to GANs was used to model animal behavior by Li, Gauci and Gross in 2013.^{[‡ 80]}

Another inspiration for GANs was noise-contrastive estimation,^{[‡ 81]} which uses the same loss function as GANs and which Goodfellow studied during his PhD in 2010–2014.^{[citation needed]}

Generative adversarial networks (GANs) as introduced by Ian Goodfellow et al. in 2014^{[‡ 82]} became state of the art in generative modeling during 2014-2018 period. Excellent^[opinion] image quality is achieved by Nvidia's StyleGAN (2018)^[56] based on the Progressive GAN by Tero Karras et al.^{[‡ 83]} Here the GAN generator is grown from small to large scale in a pyramidal fashion. Image generation by GAN reached popular success, and provoked discussions concerning deepfakes.^[57] Diffusion models (2015)^{[‡ 84]} eclipsed GANs in generative modeling since then, with systems such as DALL·E 2 (2022) and Stable Diffusion (2022).

Attention mechanism and transformers

Human selective attention has been studied in both neuroscience and cognitive psychology.^[58] Selective attention of auditory inputs was studied by Colin Cherry in 1953, who first defined and named the cocktail party effect.^{[‡ 85]} Donald Broadbent proposed the filter model of attention in 1958.^{[‡ 86]} Selective attention of vision was studied in the 1960s by George Sperling using the partial report paradigm. Saccade control is modulated by cognitive processes, in that the eye moves preferentially towards areas of high salience. As the fovea of the eye is small, the eye cannot sharply resolve all of the visual field at once. The use of saccade control allows the eye to quickly scan important features of a scene.^{[‡ 87]}^{[importance?]}

These research studies inspired algorithms such as a variant of the neocognitron.^[59]^{[failed verification]}^{[‡ 88]} Developments in neural networks have inspired circuit models of biological visual attention.^[60]^[61]

A key aspect of attention mechanism is the use of multiplicative operations, which have been studied under the names of higher-order neural networks,^{[‡ 89]} multiplication units,^{[‡ 90]} sigma-pi units,^[62] fast weight controllers,^{[‡ 91]} and hyper-networks.^{[‡ 92]}

Recurrent attention

During the deep learning era, the attention mechanism was developed to address problems in sequence encoding and decoding.^[63]^{[incomprehensible]}

The idea of encoder-decoder sequence transduction had been developed in the early 2010s. Two papers from 2014 are most commonly cited as the originators of seq2seq.^{[‡ 93]}^{[‡ 94]} The seq2seq architecture employs two RNN, typically LSTM, an "encoder" and a "decoder", for sequence transduction, such as machine translation. Seq2seq became state-of-the-art in machine translation and was instrumental in the development of the attention mechanism and transformer.

An image captioning model that would encode an input image into a fixed-length vector was proposed in 2015, citing inspiration from the seq2seq model.^{[‡ 40]} In 2015, Kelvin Xu et al. applied the attention mechanism as used in the seq2seq model to image captioning,^[64] citing Bahdanau et al. 2014,^[65].

Transformer

One problem with seq2seq models was their use of recurrent neural networks, which are not able to be made parallel, as both the encoder and the decoder processes the sequence token-by-token. Decomposable attention attempted to solve this problem by processing the input sequence in parallel, before computing a "soft alignment matrix"; "alignment" is the terminology used by Bahdanau et al. 2014.^{[‡ 95]} This allowed parallel processing.^{[citation needed]}

The idea of using attention mechanism instead of an encoder-decoder (cross-attention) for self-attention was also proposed during this period, such as in differentiable neural computers and neural Turing machines.^{[‡ 96]} Using an attention mechanism for self-attention was termed intra-attention by Jiangpeng Cheng et al.^{[‡ 97]} Intra-attention occurs where an LSTM is augmented with a memory network as it encodes an input sequence.

These strands of development were combined in the transformer architecture, published in Attention Is All You Need in 2017. Subsequently, attention mechanisms were extended within the framework of the transformer architecture.

Seq2seq models with attention still suffered from the same issue with recurrent networks, which is that they are hard to be made parallel, which prevented them to be accelerated on GPUs. In 2016, decomposable attention applied attention mechanism to the feedforward network, which are easy to be made parallel.^{[‡ 98]} One of its authors, Jakob Uszkoreit, suspected that attention without recurrence is sufficient for language translation, thus the title "attention is all you need".^[66]

In 2017, the original (100M-sized) encoder-decoder transformer model was also proposed in the "Attention is all you need" paper. The focus of that research was on improving seq2seq for machine translation by removing its recurrence to process all tokens in parallel and by preserving its dot-product attention mechanism to keep its text processing performance,^{[‡ 3]} which were important factors in its widespread use in large neural networks.^[67]

Unsupervised and self-supervised learning

Self-organizing maps

Self-organizing maps (SOMs) were described by Teuvo Kohonen in 1982.^[68]^{[‡ 99]} SOMs are neurophysiologically inspired^[69]^{[better source needed]} artificial neural networks that learn low-dimensional representations of high-dimensional data while preserving the topological structure of the data. They are trained using competitive learning.

SOMs create internal representations reminiscent of the cortical homunculus, a distorted representation of the human body, based on a neurological "map" of the areas and proportions of the human brain dedicated to processing sensory functions, for different parts of the body.

Boltzmann machines

During 1985–1995, inspired by statistical mechanics, several architectures and methods were developed by Terry Sejnowski, Peter Dayan, Geoffrey Hinton, etc., including the Boltzmann machine,^{[‡ 100]} restricted Boltzmann machine,^[70] Helmholtz machine,^{[‡ 101]} and the wake-sleep algorithm.^{[‡ 102]} These were designed for unsupervised learning of deep generative models. However, those were more computationally expensive compared to backpropagation. Boltzmann's machine learning algorithm, published in 1985, was briefly popular before being eclipsed by the backpropagation algorithm in 1986.^[71]^: 112

Geoffrey Hinton et al. (2006) proposed learning a high-level internal representation using successive layers of binary or real-valued latent variables with a restricted Boltzmann machine (RBM)^[72] to model each layer. This RBM is a generative stochastic feedforward neural network that can learn a probability distribution over its set of inputs. Once sufficiently many layers have been learned, the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.^{[‡ 103]}^[73]

Deep learning

In 2012, Andrew Ng and Jeff Dean created a neural network that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images taken from YouTube videos.^{[‡ 104]}^{[importance?]}

Other aspects

Knowledge distillation

Knowledge distillation or model distillation is the process of transferring knowledge from a large model to a smaller one. The idea of using the output of one neural network to train another neural network was studied as the teacher-student network configuration.^[74] In 1992, several papers studied the statistical mechanics of teacher-student network configuration, where both networks are committee machines^{[‡ 105]}^{[‡ 106]} or both are parity machines.^{[‡ 107]}

Another early example of network distillation was also published in 1992, in the field of recurrent neural networks (RNNs). The problem was sequence prediction. It was solved by two RNNs. One of them ("atomizer") predicted the sequence, and another ("chunker") predicted the errors of the atomizer. Simultaneously, the atomizer predicted the internal states of the chunker. After the atomizer manages to predict the chunker's internal states well, it would start fixing the errors, and soon the chunker is obsoleted, leaving just one RNN in the end.^{[‡ 108]}

A related methodology was model compression or pruning, where a trained network is reduced in size. It was inspired by neurobiological studies showing that the human brain is resistant to damage, and was studied in the 1980s, via methods such as Biased Weight Decay^{[‡ 109]} and Optimal Brain Damage.^{[‡ 110]}

Hardware-based designs

The development of metal–oxide–semiconductor (MOS) very-large-scale integration (VLSI), combining millions or billions of MOS transistors onto a single chip in the form of complementary MOS (CMOS) technology, enabled the development of practical artificial neural networks in the 1980s.^[75]

Computational devices were created in CMOS, for both biophysical simulation and neuromorphic computing inspired by the structure and function of the human brain. Nanodevices^[76] for very large scale principal components analyses and convolution may create a new class of neural computing because they are fundamentally analog rather than digital (even though the first implementations may use digital devices).^[77]

Notes

^ The simplest feedforward neural network consists of a single layer of output nodes without any nonlinear activation functions; the inputs are fed directly to the outputs via a series of weights. The sum of the products of the weights and the inputs is calculated at each node. The mean squared errors between these calculated outputs and the given target values are minimized by creating an adjustment to the weights.
^ Neurons generate an action potential—the release of neurotransmitters that are chemical inputs to other neurons—based on the sum of their incoming chemical inputs.

References

^ Crevier, Daniel (1993). AI: The Tumultuous Search for Artificial Intelligence. New York, NY: BasicBooks. ISBN 0-465-02997-3.
^ Gershgorn, Dave (26 July 2017). "The data that transformed AI research—and possibly the world". Quartz.
^ Merriman, Mansfield. A List of Writings Relating to the Method of Least Squares: With Historical and Critical Notes. Vol. 4. Academy, 1877.
^ Stigler, Stephen M. (1981). "Gauss and the Invention of Least Squares". Ann. Stat. 9 (3): 465–474. Bibcode:1981AnSta...945451S. doi:10.1214/aos/1176345451.
^ Dutka, Jacques (1990). "Robert Adrain and the Method of Least Squares". Archive for History of Exact Sciences. 41 (2). Springer Nature: 171–184. doi:10.1007/BF00411864. JSTOR 41133885.
^ White, Halbert (1990). "Least Squares". Time Series and Statistics. Palgrave Macmillan UK. pp. 118–125. doi:10.1007/978-1-349-20865-4_15. ISBN 978-0-333-49551-3.
^ ^a ^b ^c ^d Schmidhuber, Jürgen (2022). "Annotated History of Modern AI and Deep Learning". arXiv:2212.11279 [cs.NE].
^ Roheen Qamar, Baqar Ali Zardari (2 August 2023). "Artificial Neural Networks: An Overview". Mesopotamian Journal of Computer Science. 2023. Mesopotamian Academic Press: 124–133. doi:10.58496/mjcsc/2023/015.
^ Macukow, Bohdan (2016). Neural Networks – State of Art, Brief History, Basic Models and Architecture. Computer Information Systems and Industrial Management. Springer International Publishing. doi:10.1007/978-3-319-45378-1_1.
^ López-Muñoz F, Boya J, Alamo C (October 2006). "Neuron theory, the cornerstone of neuroscience, on the centenary of the Nobel Prize award to Santiago Ramón y Cajal". Brain Research Bulletin. 70 (4–6): 391–405. doi:10.1016/j.brainresbull.2006.07.010. PMID 17027775. S2CID 11273256.
^ ^a ^b ^c ^d ^e Bishop, John Mark (2014), History and Philosophy of Neural Networks (PDF)
^ Eberhart, R.C.; Dobbins, R.W. (1990). "Early neural network development history: the age of Camelot". IEEE Engineering in Medicine and Biology Magazine. 9 (3): 15–18. Bibcode:1990IEMBM...9c..15E. doi:10.1109/51.59207. ISSN 0739-5175. PMID 18238341.
^ Piccinini, Gualtiero (August 2004). "The First Computational Theory of Mind and Brain: A Close Look at Mcculloch and Pitts's "Logical Calculus of Ideas Immanent in Nervous Activity"". Synthese. 141 (2): 175–215. doi:10.1023/B:SYNT.0000043018.52445.3e. ISSN 0039-7857.
^ Kleene, S. C. (1956-12-31), Shannon, C. E.; McCarthy, J. (eds.), "Representation of Events in Nerve Nets and Finite Automata", Automata Studies. (AM-34), Princeton University Press, pp. 3–42, doi:10.1515/9781400882618-002, ISBN 978-1-4008-8261-8, retrieved 2024-10-14 {{citation}}: ISBN / Date incompatibility (help)CS1 maint: work parameter with ISBN (link)
^ Buckner, Cameron; Garson, James (2025). "Connectionism". In Zalta, Edward N.; Nodelman, Uri (eds.). The Stanford Encyclopedia of Philosophy.
^ Webster, Craig S. (2012). "Alan Turing's unorganized machines and artificial neural networks: his remarkable early work and future possibilities". Evolutionary Intelligence. 5 (1): 35–43. doi:10.1007/s12065-011-0060-5. ISSN 1864-5909. Retrieved 2026-05-05.
^ ^a ^b Ulhaq, Anwaar (2021-05-07), Deep learning, past present and future: An odyssey, doi:10.31224/osf.io/vrmk4, retrieved 2026-05-05
^ Cios, Krzysztof J. (2017), Deep Neural Networks - A Brief History, arXiv:1701.05549
^ "History of artificial intelligence (AI)". Encyclopedia Britannica. Retrieved 2026-05-05.
^ ^a ^b ^c Rodriguez, Olazaran; Miguel, Jose (1991). A Historical Sociology of Neural Network Research (PDF) (Thesis). The University of Edinburgh. Retrieved 2026-05-05.
^ David H. Hubel and Torsten N. Wiesel (2005). Brain and visual perception: the story of a 25-year collaboration. Oxford University Press US. p. 106. ISBN 978-0-19-517618-6.
^ Wythoff, Barry J. (1993-02-01). "Backpropagation neural networks: A tutorial" (PDF). Chemometrics and Intelligent Laboratory Systems. 18 (2): 115–155. doi:10.1016/0169-7439(93)80052-J. ISSN 0169-7439.
^ Wu, Lei, Neural Networks: Architecture Design (PDF)
^ Widrow, Bernard; Lehr, Michael A. (September 1990). "30 years of adaptive neural networks: perceptron, Madaline, and backpropagation" (PDF). Proceedings of the IEEE. 78 (9): 1415–1442. Bibcode:1990IEEEP..78.1415W. doi:10.1109/5.58323.
^ "Neural Networks - History". Stanford University. Retrieved 2026-05-05.
^ "From boom to bust: the AI winter". OpenLearn - The Open University. 2024-01-29. Retrieved 2026-05-05.
^ Franklin, Stan (2014). "History, motivations, and core themes". In Frankish, Keith; Ramsey, William M. (eds.). The Cambridge Handbook of Artificial Intelligence. Cambridge: Cambridge University Press. pp. 15–33. ISBN 978-0-521-87142-6.
^ Munro, Paul W. (8 January 2003). Theory and Application of Backpropagation: A Handbook (PDF).
^ Rumelhart, David E.; Durbin, Richard; Golden, Richard; Chauvin, Yves (1995). "Backpropagation: The Basic Theory" (PDF). In Chauvin, Yves; Rumelhart, David E. (eds.). Backpropagation: Theory, Architectures, and Applications.
^ Dreyfus, Stuart E. (1990). "Artificial neural networks, back propagation, and the Kelley-Bryson gradient procedure" (PDF). Journal of Guidance, Control, and Dynamics. 13 (5): 926–928. Bibcode:1990JGCD...13..926D. doi:10.2514/3.25422. ISSN 0731-5090. Retrieved 15 May 2026.
^ Mizutani, Eiji; Dreyfus, Stuart E.; Nishio, Kenichi (2000). "On derivation of MLP backpropagation from the Kelley-Bryson optimal-control gradient formula and its application". Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium. IEEE. p. 167–172 vol.2. doi:10.1109/IJCNN.2000.857892. ISBN 978-0-7695-0619-7.
^ ^a ^b ^c Griewank, Andreas (1 January 2012), "Who invented the reverse mode of differentiation?", in Grötschel, Martin (ed.), Optimization Stories, Documenta Mathematica Series, vol. 6 (1 ed.), EMS Press, pp. 389–400, doi:10.4171/dms/6/38, ISBN 978-3-936609-58-5
^ ^a ^b Schmidhuber, Juergen (2015). "Deep Learning". Scholarpedia. 10 (11) 32832. Bibcode:2015SchpJ..1032832S. doi:10.4249/scholarpedia.32832. ISSN 1941-6016.
^ Anderson, James A.; Rosenfeld, Edward, eds. (2000). Talking Nets: An Oral History of Neural Networks. The MIT Press. doi:10.7551/mitpress/6626.003.0016. ISBN 978-0-262-26715-1.
^ Werbos, Paul (1974). Beyond regression: new tools for prediction and analysis in the behavioral sciences (PDF) (Thesis). Harvard University.
^ Werbos, P.J. (1990). "Backpropagation through time: what it does and how to do it" (PDF). Proceedings of the IEEE. 78 (10): 1550–1560. Bibcode:1990IEEEP..78.1550W. doi:10.1109/5.58337. Retrieved 18 May 2026.
^ ^a ^b ^c Schmidhuber, J. (2015). "Deep Learning in Neural Networks: An Overview". Neural Networks. 61: 85–117. arXiv:1404.7828. Bibcode:2015NN.....61...85S. doi:10.1016/j.neunet.2014.09.003. PMID 25462637. S2CID 11715509.
^ Brush, Stephen G. (1967). "History of the Lenz-Ising Model". Reviews of Modern Physics. 39 (4): 883–893. Bibcode:1967RvMP...39..883B. doi:10.1103/RevModPhys.39.883.
^ Espinosa-Sanchez, Juan Manuel; Gomez-Marin, Alex; de Castro, Fernando (2023-07-05). "The Importance of Cajal's and Lorente de Nó's Neuroscience to the Birth of Cybernetics". The Neuroscientist. 31 (1): 14–30. doi:10.1177/10738584231179932. hdl:10261/348372. ISSN 1073-8584. PMID 37403768.
^ Larriva-Sahd, Jorge A. (2014-12-03). "Some predictions of Rafael Lorente de Nó 80 years later". Frontiers in Neuroanatomy. 8: 147. doi:10.3389/fnana.2014.00147. ISSN 1662-5129. PMC 4253658. PMID 25520630.
^ "reverberating circuit". Oxford Reference. Retrieved 2024-07-27.
^ Magnuson, James S. (2024-07-24). "Recurrent Neural Networks". Open Encyclopedia of Cognitive Science. doi:10.21428/e2759450.9e968b77.
^ Schmidhuber, Jürgen (1992). "Learning complex, extended sequences using the principle of history compression (based on TR FKI-148, 1991)" (PDF). Neural Computation. 4 (2): 234–242. Bibcode:1992NeCom...4..234S. doi:10.1162/neco.1992.4.2.234. S2CID 18271205. Archived from the original (PDF) on 2017-07-06.
^ Groumpos, Peter P. (2023-07-24). "A Critical Historic Overview of Artificial Intelligence: Issues, Challenges, Opportunities, and Threats". Artificial Intelligence and Applications. 1 (4): 181–197. doi:10.47852/bonviewAIA3202689. ISSN 2811-0854.
^ Hochreiter, S.; et al. (15 January 2001). "Gradient flow in recurrent nets: the difficulty of learning long-term dependencies". In Kolen, John F.; Kremer, Stefan C. (eds.). A Field Guide to Dynamical Recurrent Networks (PDF). John Wiley & Sons. ISBN 978-0-7803-5369-5.
^ Hochreiter, Sepp; Schmidhuber, Jürgen (1997-11-01). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID 9377276. S2CID 1915014. Q98967430.
^ Jozefowicz, Rafal; Vinyals, Oriol; Schuster, Mike; Shazeer, Noam; Wu, Yonghui (2016-02-07). "Exploring the Limits of Language Modeling". arXiv:1602.02410 [cs.CL].
^ Fukushima, Kunihiko (2007). "Neocognitron". Scholarpedia. 2 (1): 1717. Bibcode:2007SchpJ...2.1717F. doi:10.4249/scholarpedia.1717. ISSN 1941-6016.
^ LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey (2015). "Deep learning" (PDF). Nature. 521 (7553): 436–444. Bibcode:2015Natur.521..436L. doi:10.1038/nature14539. PMID 26017442. S2CID 3074096.
^ Ramachandran, Prajit; Barret, Zoph; Quoc, V. Le (October 16, 2017). "Searching for Activation Functions". arXiv:1710.05941 [cs.NE].
^ Sze, Vivienne; Chen, Yu-Hsin; Yang, Tien-Ju; Emer, Joel (2017). "Efficient Processing of Deep Neural Networks: A Tutorial and Survey". arXiv:1703.09039 [cs.CV].
^ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification". arXiv:1502.01852 [cs.CV].
^ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (10 Dec 2015). Deep Residual Learning for Image Recognition. arXiv:1512.03385.
^ Linn, Allison (2015-12-10). "Microsoft researchers win ImageNet computer vision challenge". The AI Blog. Microsoft. Retrieved 2024-06-29.
^ Schmidhuber, Jürgen (2020). "Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991)". Neural Networks. 127: 58–66. arXiv:1906.04493. doi:10.1016/j.neunet.2020.04.008. PMID 32334341. S2CID 216056336.
^ "GAN 2.0: NVIDIA's Hyperrealistic Face Generator". SyncedReview.com. December 14, 2018. Retrieved October 3, 2019.
^ "Prepare, Don't Panic: Synthetic Media and Deepfakes". WITNESS Media Lab. Archived from the original on 2 December 2020. Retrieved 25 November 2020.
^ Kramer, Arthur F.; Wiegmann, Douglas A.; Kirlik, Alex (2006-12-28). "1 Attention: From History to Application". Attention: From Theory to Practice. Oxford University Press. doi:10.1093/acprof:oso/9780195305722.003.0001. ISBN 978-0-19-530572-2.
^ Ba, Jimmy; Mnih, Volodymyr; Kavukcuoglu, Koray (2015-04-23). "Multiple Object Recognition with Visual Attention". arXiv:1412.7755 [cs.LG].
^ Koch, Christof; Ullman, Shimon (1987), Vaina, Lucia M. (ed.), "Shifts in Selective Visual Attention: Towards the Underlying Neural Circuitry", Matters of Intelligence: Conceptual Structures in Cognitive Neuroscience, Dordrecht: Springer Netherlands, pp. 115–141, doi:10.1007/978-94-009-3833-5_5, ISBN 978-94-009-3833-5{{citation}}: CS1 maint: work parameter with ISBN (link)
^ Soydaner, Derya (August 2022). "Attention mechanism in neural networks: where it comes and where it goes". Neural Computing and Applications. 34 (16): 13371–13385. arXiv:2204.13154. doi:10.1007/s00521-022-07366-3. ISSN 0941-0643.
^ Rumelhart, David E.; Mcclelland, James L.; Hinton, Geoffrey E. (1987-07-29). "A General Framework for Parallel Distributed Computing". Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations (PDF). Vol. 1. Cambridge, Mass: Bradford Books. ISBN 978-0-262-68053-0.
^ Niu, Zhaoyang; Zhong, Guoqiang; Yu, Hui (2021-09-10). "A review on the attention mechanism of deep learning". Neurocomputing. 452: 48–62. doi:10.1016/j.neucom.2021.03.091. ISSN 0925-2312.
^ Xu, Kelvin; Ba, Jimmy; Kiros, Ryan; Cho, Kyunghyun; Courville, Aaron; Salakhudinov, Ruslan; Zemel, Rich; Bengio, Yoshua (2015-06-01). "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention". Proceedings of the 32nd International Conference on Machine Learning. PMLR: 2048–2057. arXiv:1502.03044.
^ Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2016-05-19). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv:1409.0473 [cs.CL].
^ Levy, Steven. "8 Google Employees Invented Modern AI. Here's the Inside Story". Wired. ISSN 1059-1028. Archived from the original on 20 March 2024. Retrieved 2024-08-06.
^ Peng, Bo; Alcaide, Eric; Anthony, Quentin; Albalak, Alon; Arcadinho, Samuel; Biderman, Stella; Cao, Huanqi; Cheng, Xin; Chung, Michael (2023-12-10). "RWKV: Reinventing RNNs for the Transformer Era". arXiv:2305.13048 [cs.CL].
^ Kohonen, Teuvo; Honkela, Timo (2007). "Kohonen Network". Scholarpedia. 2 (1): 1568. Bibcode:2007SchpJ...2.1568K. doi:10.4249/scholarpedia.1568.
^ Von der Malsburg, C (1973). "Self-organization of orientation sensitive cells in the striate cortex". Kybernetik. 14 (2): 85–100. doi:10.1007/bf00288907. PMID 4786750. S2CID 3351573.
^ Smolensky, Paul (1986). "Chapter 6: Information Processing in Dynamical Systems: Foundations of Harmony Theory" (PDF). In Rumelhart, David E.; McLelland, James L. (eds.). Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations. MIT Press. pp. 194–281. ISBN 0-262-68053-X.
^ Sejnowski, Terrence J. (2018). The deep learning revolution. Cambridge, Massachusetts: The MIT Press. ISBN 978-0-262-03803-4.
^ Smolensky, P. (1986). "Information processing in dynamical systems: Foundations of harmony theory.". In D. E. Rumelhart; J. L. McClelland; PDP Research Group (eds.). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1. MIT Press. pp. 194–281. ISBN 978-0-262-68053-0.
^ Hinton, Geoffrey (2009-05-31). "Deep belief networks". Scholarpedia. 4 (5): 5947. Bibcode:2009SchpJ...4.5947H. doi:10.4249/scholarpedia.5947. ISSN 1941-6016.
^ Watkin, Timothy L. H.; Rau, Albrecht; Biehl, Michael (1993-04-01). "The statistical mechanics of learning a rule". Reviews of Modern Physics. 65 (2): 499–556. Bibcode:1993RvMP...65..499W. doi:10.1103/RevModPhys.65.499. hdl:11370/02b0cd15-dfc5-4acb-9566-4ab937ee0d13.
^ Mead, Carver A.; Ismail, Mohammed (8 May 1989). Analog VLSI Implementation of Neural Systems (PDF). The Kluwer International Series in Engineering and Computer Science. Vol. 80. Norwell, MA: Kluwer Academic Publishers. doi:10.1007/978-1-4613-1639-8. ISBN 978-1-4613-1639-8.
^ Yang, J. J.; Pickett, M. D.; Li, X. M.; Ohlberg, D. A. A.; Stewart, D. R.; Williams, R. S. (2008). "Memristive switching mechanism for metal/oxide/metal nanodevices". Nat. Nanotechnol. 3 (7): 429–433. Bibcode:2008NatNa...3..429Y. doi:10.1038/nnano.2008.160. PMID 18654568.
^ Strukov, D. B.; Snider, G. S.; Stewart, D. R.; Williams, R. S. (2008). "The missing memristor found". Nature. 453 (7191): 80–83. Bibcode:2008Natur.453...80S. doi:10.1038/nature06932. PMID 18451858. S2CID 4367148.

Primary sources

In the text, these references are preceded by a double dagger (‡):

^ ^a ^b Rosenblatt, F. (1958). "The Perceptron: A Probabilistic Model For Information Storage And Organization In The Brain". Psychological Review. 65 (6): 386–408. CiteSeerX 10.1.1.588.3775. doi:10.1037/h0042519. PMID 13602029. S2CID 12781225.
^ Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E. (2017-05-24). "ImageNet classification with deep convolutional neural networks" (PDF). Communications of the ACM. 60 (6): 84–90. doi:10.1145/3065386. ISSN 0001-0782. S2CID 195908774.
^ ^a ^b Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need" (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
^ McCulloch, Warren; Pitts, Walter (1943). "A Logical Calculus of Ideas Immanent in Nervous Activity". Bulletin of Mathematical Biophysics. 5 (4): 115–133. Bibcode:1943BMaB....5..115M. doi:10.1007/BF02478259.
^ Hebb, Donald (1949). The Organization of Behavior. New York: Wiley. ISBN 978-1-135-63190-1. {{cite book}}: ISBN / Date incompatibility (help)
^ Farley, B.G.; W.A. Clark (1954). "Simulation of Self-Organizing Systems by Digital Computer". IRE Transactions on Information Theory. 4 (4): 76–84. Bibcode:1954TIPIT...4...76F. doi:10.1109/TIT.1954.1057468.
^ Rochester, Nathaniel; Holland, J.H.; Habit, L.H.; Duda, W.L. (1956). "Tests on a cell assembly theory of the action of the brain, using a large digital computer". IRE Transactions on Information Theory. 2 (3): 80–93. Bibcode:1956IRTIT...2...80R. doi:10.1109/TIT.1956.1056810.
^ Rosenblatt, Frank (1962). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books. Retrieved 2026-05-05.
^ Block, H. D., Knight, B. W., Jr., Rosenblatt, F. (1962) Analysis of a four-layer series-coupled perceptron. II. Rev. Mod. Phys. 34: 135
^ Ivakhnenko, A. G.; Lapa, V. G. (1967). Cybernetics and Forecasting Techniques. American Elsevier Publishing Co. ISBN 978-0-444-00020-0.
^ Ivakhnenko, A.G. (March 1970). "Heuristic self-organization in problems of engineering cybernetics". Automatica. 6 (2): 207–219. Bibcode:1970Autom...6..207I. doi:10.1016/0005-1098(70)90092-0.
^ Ivakhnenko, Alexey (1971). "Polynomial theory of complex systems" (PDF). IEEE Transactions on Systems, Man, and Cybernetics. SMC-1 (4): 364–378. Bibcode:1971ITSMC...1..364I. doi:10.1109/TSMC.1971.4308320. Archived (PDF) from the original on 2017-08-29. Retrieved 2019-11-05.
^ Amari, Shun'ichi (1967). "A theory of adaptive pattern classifier". IEEE Transactions. EC (16): 279–307.
^ Minsky, Marvin; Papert, Seymour (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press. ISBN 978-0-262-63022-1.
^ Rosenblatt, Frank (1962). Principles of Neurodynamics. Spartan, New York.
^ Leibniz, Gottfried Wilhelm Freiherr von (1920). Child, J. M. (ed.). The Early Mathematical Manuscripts of Leibniz: Translated from the Latin Texts Published by Carl Immanuel Gerhardt with Critical and Historical Notes. Open court publishing Company. ISBN 978-0-598-81846-1. {{cite book}}: ISBN / Date incompatibility (help)
^ Kelley, Henry J. (1960). "Gradient theory of optimal flight paths". ARS Journal. 30 (10): 947–954. doi:10.2514/8.5282.
^ Bryson, Arthur E. (April 1961). A gradient method for optimizing multi-stage allocation processes. Proceedings of the Harvard Univ. Symposium on digital computers and their applications. Vol. 72. p. 22.
^ Linnainmaa, Seppo (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors (PDF) (Masters) (in Finnish). University of Helsinki. p. 6–7. Archived from the original (PDF) on 2013-12-06.
^ Linnainmaa, Seppo (1976). "Taylor expansion of the accumulated rounding error". BIT Numerical Mathematics. 16 (2): 146–160. doi:10.1007/bf01931367. S2CID 122357351.
^ Werbos, Paul (1982). "Applications of advances in nonlinear sensitivity analysis" (PDF). System modeling and optimization. Springer. pp. 762–770. Archived (PDF) from the original on 14 April 2016. Retrieved 2 July 2017.
^ Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (October 1986). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536. Bibcode:1986Natur.323..533R. doi:10.1038/323533a0. ISSN 1476-4687.
^ Lenz, W. (1920). "Beiträge zum Verständnis der magnetischen Eigenschaften in festen Körpern" (PDF). Physikalische Zeitschrift (in German). 21: 613–615.
^ Ising, E. (1925). "Beitrag zur Theorie des Ferromagnetismus". Zeitschrift für Physik (in German). 31 (1): 253–258. Bibcode:1925ZPhy...31..253I. doi:10.1007/BF02980577. S2CID 122157319.
^ Glauber, Roy J. (February 1963). "Roy J. Glauber "Time-Dependent Statistics of the Ising Model"". Journal of Mathematical Physics. 4 (2): 294–307. doi:10.1063/1.1703954. Retrieved 2021-03-21.
^ Amari, Shun'ichi (November 1972). "Learning Patterns and Pattern Sequences by Self-Organizing Nets of Threshold Elements". IEEE Transactions on Computers. C-21 (11): 1197–1206. Bibcode:1972ITCmp.100.1197A. doi:10.1109/T-C.1972.223477. ISSN 0018-9340.
^ Hopfield, John Joseph (1982). "Neural networks and physical systems with emergent collective computational abilities". Proceedings of the National Academy of Sciences. 79 (8): 2554–2558. Bibcode:1982PNAS...79.2554H. doi:10.1073/pnas.79.8.2554. PMC 346238. PMID 6953413.
^ de Nó, Rafael Lorente (1933-08-01). "Vestibulo-Ocular Reflex Arc". Archives of Neurology and Psychiatry. 30 (2): 245. doi:10.1001/archneurpsyc.1933.02240140009001. ISSN 0096-6754.
^ Schmidhuber, Jürgen (15 April 1993). Netzwerkarchitekturen, Zielfunktionen und Kettenregel (PDF) (Dr. rer. nat. habil. thesis). Archived from the original (PDF) on 6 July 2017. Page 150 ff demonstrates credit assignment across the equivalent of 1,200 layers in an unfolded RNN.^{[verification needed]}
^ ^a ^b Hochreiter, Sepp (1991). Untersuchungen zu dynamischen neuronalen Netzen (PDF) (Diploma thesis) (in German). Advisor: J. Schmidhuber. Institut f. Informatik, Technische Univ. Munich. Archived from the original (PDF) on 2015-03-06.
^ Gers, Felix; Schmidhuber, Jürgen; Cummins, Fred (1999). "Learning to forget: Continual prediction with LSTM". 9th International Conference on Artificial Neural Networks: ICANN '99. Vol. 1999. pp. 850–855. doi:10.1049/cp:19991218. ISBN 0-85296-721-7.
^ Graves, Alex; Schmidhuber, Jürgen (2005-07-01). "Framewise phoneme classification with bidirectional LSTM and other neural network architectures". Neural Networks. IJCNN 2005. 18 (5): 602–610. CiteSeerX 10.1.1.331.5800. doi:10.1016/j.neunet.2005.06.042. PMID 16112549. S2CID 1856462.
^ Fernández, Santiago; Graves, Alex; Schmidhuber, Jürgen (2007). "An Application of Recurrent Neural Networks to Discriminative Keyword Spotting". Proceedings of the 17th International Conference on Artificial Neural Networks. ICANN'07. Berlin, Heidelberg: Springer-Verlag. pp. 220–229. ISBN 978-3-540-74693-5.
^ Sak, Haşim; Senior, Andrew; Beaufays, Françoise (2014). "Long Short-Term Memory recurrent neural network architectures for large scale acoustic modeling" (PDF). Google Research.
^ Li, Xiangang; Wu, Xihong (2014-10-15). "Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition". arXiv:1410.4281 [cs.CL].
^ Fan, Bo; Wang, Lijuan; Soong, Frank K.; Xie, Lei (2015). "Photo-Real Talking Head with Deep Bidirectional LSTM". Proceedings of ICASSP 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4884–8. doi:10.1109/ICASSP.2015.7178899. ISBN 978-1-4673-6997-8.
^ Sak, Haşim; Senior, Andrew; Rao, Kanishka; Beaufays, Françoise; Schalkwyk, Johan (September 2015). "Google voice search: faster and more accurate". Google Research.
^ Sutskever, Ilya; Vinyals, Oriol; Le, Quoc V. (2014). "Sequence to Sequence Learning with Neural Networks" (PDF). Electronic Proceedings of the Neural Information Processing Systems Conference. 27: 5346. arXiv:1409.3215. Bibcode:2014arXiv1409.3215S.
^ Gillick, Dan; Brunk, Cliff; Vinyals, Oriol; Subramanya, Amarnag (2015-11-30). "Multilingual Language Processing From Bytes". arXiv:1512.00103 [cs.CL].
^ ^a ^b ^c Vinyals, Oriol; Toshev, Alexander; Bengio, Samy; Erhan, Dumitru (2014). "Show and Tell: A Neural Image Caption Generator". arXiv:1411.4555 [cs.CV].
^ Fukushima, Kunihiko (1980). "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position" (PDF). Biological Cybernetics. 36 (4): 193–202. doi:10.1007/BF00344251. PMID 7370364. S2CID 206775608. Archived from the original (PDF) on 3 June 2014. Retrieved 16 November 2013.
^ ^a ^b Fukushima, K. (1969). "Visual feature extraction by a multilayered network of analog threshold elements". IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322–333. Bibcode:1969ITSSC...5..322F. doi:10.1109/TSSC.1969.300225.
^ ^a ^b Waibel, Alex (December 1987). Phoneme Recognition Using Time-Delay Neural Networks (PDF). Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE). Tokyo, Japan.
^ Waibel, Alexander; Hanazawa, Toshiyuki; Hinton, Geoffrey; Shikano, Kiyohiro; Lang, Kevin J. (March 1989). "Phoneme recognition using time-delay neural networks" (PDF). IEEE Transactions on Acoustics, Speech, and Signal Processing. 37 (3): 328–339. Bibcode:1989ITASS..37..328W. doi:10.1109/29.21701.
^ Zhang, Wei (1988). "Shift-invariant pattern recognition neural network and its optical architecture". Proceedings of Annual Conference of the Japan Society of Applied Physics.
^ Zhang, Wei (1990). "Parallel distributed processing model with local space-invariant interconnections and its optical architecture". Applied Optics. 29 (32): 4790–7. Bibcode:1990ApOpt..29.4790Z. doi:10.1364/AO.29.004790. PMID 20577468.
^ Fukushima, Kunihiko; Miyake, Sei (1982-01-01). "Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position". Pattern Recognition. 15 (6): 455–469. Bibcode:1982PatRe..15..455F. doi:10.1016/0031-3203(82)90024-3. ISSN 0031-3203.
^ LeCun et al., "Backpropagation Applied to Handwritten Zip Code Recognition," Neural Computation, 1, pp. 541–551, 1989.
^ LeCun, Yann; Boser, Bernhard; Denker, John; Henderson, Donnie; Howard, R.; Hubbard, Wayne; Jackel, Lawrence (1989). "Handwritten Digit Recognition with a Back-Propagation Network". Advances in Neural Information Processing Systems. 2. Morgan-Kaufmann.
^ Zhang, Wei (1991). "Image processing of human corneal endothelium based on a learning network". Applied Optics. 30 (29): 4211–7. Bibcode:1991ApOpt..30.4211Z. doi:10.1364/AO.30.004211. PMID 20706526.
^ Zhang, Wei (1994). "Computerized detection of clustered microcalcifications in digital mammograms using a shift-invariant artificial neural network". Medical Physics. 21 (4): 517–24. Bibcode:1994MedPh..21..517Z. doi:10.1118/1.597177. PMID 8058017.
^ Weng, Juyang; Ahuja, Narendra; Huang, Thomas S. (June 1992). "Cresceptron: a self-organizing neural network which grows adaptively" (PDF). Proceedings of the International Joint Conference on Neural Networks. Vol. 1. Baltimore, MD, USA: IEEE. pp. 576–581. doi:10.1109/IJCNN.1992.287150.
^ Weng, Juyang; Ahuja, Narendra; Huang, Thomas S. (May 1993). "Learning recognition and segmentation of 3-D objects from 2-D images" (PDF). Proceedings of the 4th International Conference on Computer Vision. Berlin, Germany: IEEE. pp. 121–128. doi:10.1109/ICCV.1993.378228.
^ Weng, Juyang; Ahuja, Narendra; Huang, Thomas S. (November 1997). "Learning recognition and segmentation using the Cresceptron" (PDF). International Journal of Computer Vision. 25 (2): 105–139. doi:10.1023/A:1007967800668.
^ Weng, J; Ahuja, N; Huang, TS (1993). "Learning recognition and segmentation of 3-D objects from 2-D images". 1993 (4th) International Conference on Computer Vision. pp. 121–128. doi:10.1109/ICCV.1993.378228. ISBN 0-8186-3870-2. S2CID 8619176.
^ LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick Haffner (1998). "Gradient-based learning applied to document recognition" (PDF). Proceedings of the IEEE. 86 (11): 2278–2324. Bibcode:1998IEEEP..86.2278L. CiteSeerX 10.1.1.32.9552. doi:10.1109/5.726791. S2CID 14542261. Retrieved October 7, 2016.
^ Scherer, Dominik; Müller, Andreas; Behnke, Sven (2010). "Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition" (PDF). In Diamantaras, Konstantinos; Duch, Wlodek; Iliadis, Lazaros S. (eds.). Artificial Neural Networks – ICANN 2010. Lecture Notes in Computer Science. Vol. 6354. Berlin, Heidelberg: Springer. pp. 92–101. doi:10.1007/978-3-642-15825-4_10. ISBN 978-3-642-15825-4.
^ Riedmiller, Martin; Braun, Heinrich (November 1992). "Rprop – A Fast Adaptive Learning Algorithm". Proceedings of the International Symposium on Computer and Information Science VII.
^ Riedmiller, M.; Braun, H. (1993). "A direct adaptive method for faster backpropagation learning: The RPROP algorithm". IEEE International Conference on Neural Networks. IEEE. pp. 586–591. doi:10.1109/icnn.1993.298623. ISBN 0-7803-0999-5.
^ Behnke, Sven (2003). Hierarchical Neural Networks for Image Interpretation (PDF). Lecture Notes in Computer Science. Vol. 2766. Springer.
^ ^a ^b Oh, K.-S.; Jung, K. (2004). "GPU implementation of neural networks". Pattern Recognition. 37 (6): 1311–1314. Bibcode:2004PatRe..37.1311O. doi:10.1016/j.patcog.2004.01.013.
^ ^a ^b Chellapilla, Kumar; Puri, Sidd; Simard, Patrice (2006), High performance convolutional neural networks for document processing, archived from the original on 2020-05-18, retrieved 2021-02-14
^ Raina, Rajat; Madhavan, Anand; Ng, Andrew Y. (2009-06-14). "Large-scale deep unsupervised learning using graphics processors". Proceedings of the 26th Annual International Conference on Machine Learning. ICML '09. New York, NY, USA: Association for Computing Machinery. pp. 873–880. doi:10.1145/1553374.1553486. ISBN 978-1-60558-516-1.
^ Cireşan, Dan Claudiu; Meier, Ueli; Gambardella, Luca Maria; Schmidhuber, Jürgen (21 September 2010). "Deep, Big, Simple Neural Nets for Handwritten Digit Recognition". Neural Computation. 22 (12): 3207–3220. arXiv:1003.0358. Bibcode:2010NeCom..22.3207C. doi:10.1162/neco_a_00052. ISSN 0899-7667. PMID 20858131. S2CID 1918673.
^ Ciresan, D. C.; Meier, U.; Masci, J.; Gambardella, L.M.; Schmidhuber, J. (2011). "Flexible, High Performance Convolutional Neural Networks for Image Classification" (PDF). International Joint Conference on Artificial Intelligence. doi:10.5591/978-1-57735-516-8/ijcai11-210. Archived (PDF) from the original on 2014-09-29. Retrieved 2017-06-13.
^ Ciresan, Dan; Giusti, Alessandro; Gambardella, Luca M.; Schmidhuber, Jürgen (2012). Pereira, F.; Burges, C. J. C.; Bottou, L.; Weinberger, K. Q. (eds.). Advances in Neural Information Processing Systems 25 (PDF). Curran Associates, Inc. pp. 2843–2851. Archived (PDF) from the original on 2017-08-09. Retrieved 2017-06-13.
^ Ciresan, D.; Giusti, A.; Gambardella, L.M.; Schmidhuber, J. (2013). "Mitosis Detection in Breast Cancer Histology Images with Deep Neural Networks". Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013. Lecture Notes in Computer Science. Vol. 7908. pp. 411–418. doi:10.1007/978-3-642-40763-5_51. ISBN 978-3-642-38708-1. PMID 24579167.
^ Ciresan, D.; Meier, U.; Schmidhuber, J. (2012). "Multi-column deep neural networks for image classification". 2012 IEEE Conference on Computer Vision and Pattern Recognition. pp. 3642–3649. arXiv:1202.2745. doi:10.1109/cvpr.2012.6248110. ISBN 978-1-4673-1228-8. S2CID 2161592.
^ Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey (2012). "ImageNet Classification with Deep Convolutional Neural Networks" (PDF). NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada. Archived (PDF) from the original on 2017-01-10. Retrieved 2017-05-24.
^ ^a ^b Simonyan, Karen; Andrew, Zisserman (2014). "Very Deep Convolution Networks for Large Scale Image Recognition". arXiv:1409.1556 [cs.CV].
^ Szegedy, Christian (2015). "Going deeper with convolutions" (PDF). Cvpr2015. arXiv:1409.4842.
^ Fang, Hao; Gupta, Saurabh; Iandola, Forrest; Srivastava, Rupesh; Deng, Li; Dollár, Piotr; Gao, Jianfeng; He, Xiaodong; Mitchell, Margaret; Platt, John C; Lawrence Zitnick, C; Zweig, Geoffrey (2014). "From Captions to Visual Concepts and Back". arXiv:1411.4952 [cs.CV]..
^ Kiros, Ryan; Salakhutdinov, Ruslan; Zemel, Richard S (2014). "Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models". arXiv:1411.2539 [cs.LG]..
^ Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (2 May 2015). "Highway Networks". arXiv:1505.00387 [cs.LG].
^ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). "Deep Residual Learning for Image Recognition". 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 770–778. arXiv:1512.03385. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1.
^ Schmidhuber, Jürgen (1991). "A possibility for implementing curiosity and boredom in model-building neural controllers". Proc. SAB'1991. MIT Press/Bradford Books. pp. 222–227.
^ Schmidhuber, Jürgen (November 1992). "Learning Factorial Codes by Predictability Minimization". Neural Computation. 4 (6): 863–879. doi:10.1162/neco.1992.4.6.863. S2CID 42023620.
^ Schmidhuber, Jürgen; Eldracher, Martin; Foltin, Bernhard (1996). "Semilinear predictability minimzation produces well-known feature detectors". Neural Computation. 8 (4): 773–786. doi:10.1162/neco.1996.8.4.773. S2CID 16154391.
^ Niemitalo, Olli (February 24, 2010). "A method for training artificial neural networks to generate missing data within a variable context". Internet Archive (Wayback Machine). Archived from the original on March 12, 2012. Retrieved February 22, 2019.
^ Li, Wei; Gauci, Melvin; Gross, Roderich (July 6, 2013). "Proceeding of the fifteenth annual conference on Genetic and evolutionary computation conference - GECCO '13". Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO 2013). Amsterdam, the Netherlands: ACM. pp. 223–230. doi:10.1145/2463372.2465801. ISBN 978-1-4503-1963-8.
^ Gutmann, Michael; Hyvärinen, Aapo. "Noise-Contrastive Estimation" (PDF). International Conference on AI and Statistics.
^ Goodfellow, Ian; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014). Generative Adversarial Networks (PDF). Proceedings of the International Conference on Neural Information Processing Systems (NIPS 2014). pp. 2672–2680. Archived (PDF) from the original on 22 November 2019. Retrieved 20 August 2019.
^ Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. (26 February 2018). "Progressive Growing of GANs for Improved Quality, Stability, and Variation". arXiv:1710.10196 [cs.NE].
^ Sohl-Dickstein, Jascha; Weiss, Eric; Maheswaranathan, Niru; Ganguli, Surya (2015-06-01). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" (PDF). Proceedings of the 32nd International Conference on Machine Learning. 37. PMLR: 2256–2265. arXiv:1503.03585.
^ Cherry EC (1953). "Some Experiments on the Recognition of Speech, with One and with Two Ears" (PDF). The Journal of the Acoustical Society of America. 25 (5): 975–79. Bibcode:1953ASAJ...25..975C. doi:10.1121/1.1907229. hdl:11858/00-001M-0000-002A-F750-3. ISSN 0001-4966.
^ Broadbent, Donald E. (1958). Perception and Communication (PDF). London: Pergamon Press.
^ Kowler, Eileen; Anderson, Eric; Dosher, Barbara; Blaser, Erik (1995-07-01). "The role of attention in the programming of saccades". Vision Research. 35 (13): 1897–1916. doi:10.1016/0042-6989(94)00279-U. ISSN 0042-6989. PMID 7660596.
^ Fukushima, Kunihiko (1987-12-01). "Neural network model for selective attention in visual pattern recognition and associative recall". Applied Optics. 26 (23): 4985–4992. Bibcode:1987ApOpt..26.4985F. doi:10.1364/AO.26.004985. ISSN 0003-6935. PMID 20523477.
^ Giles, C. Lee; Maxwell, Tom (1987-12-01). "Learning, invariance, and generalization in high-order neural networks". Applied Optics. 26 (23): 4972–4978. doi:10.1364/AO.26.004972. ISSN 0003-6935. PMID 20523475.
^ Feldman, J. A.; Ballard, D. H. (1982-07-01). "Connectionist models and their properties". Cognitive Science. 6 (3): 205–254. doi:10.1016/S0364-0213(82)80001-3. ISSN 0364-0213.
^ Schmidhuber, Jürgen (January 1992). "Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks". Neural Computation. 4 (1): 131–139. doi:10.1162/neco.1992.4.1.131. ISSN 0899-7667.
^ Ha, David; Dai, Andrew; Le, Quoc V. (2016-12-01). "HyperNetworks". arXiv:1609.09106 [cs.LG].
^ Cho, Kyunghyun; van Merrienboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (2014). "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation". Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. pp. 1724–1734. arXiv:1406.1078. doi:10.3115/v1/D14-1179.
^ Sutskever, Ilya; Vinyals, Oriol; Le, Quoc Viet (14 Dec 2014). "Sequence to sequence learning with neural networks". arXiv:1409.3215 [cs.CL].
^ Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2016-05-19). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv:1409.0473 [cs.CL].
^ Graves, Alex; Wayne, Greg; Danihelka, Ivo (2014-12-10). "Neural Turing Machines". arXiv:1410.5401 [cs.NE].
^ Cheng, Jianpeng; Dong, Li; Lapata, Mirella (2016-09-20). "Long Short-Term Memory-Networks for Machine Reading". arXiv:1601.06733 [cs.CL].
^ Parikh, Ankur P.; Täckström, Oscar; Das, Dipanjan; Uszkoreit, Jakob (2016-09-25). "A Decomposable Attention Model for Natural Language Inference". arXiv:1606.01933 [cs.CL].
^ Kohonen, Teuvo (1982). "Self-Organized Formation of Topologically Correct Feature Maps". Biological Cybernetics. 43 (1): 59–69. doi:10.1007/bf00337288. S2CID 206775459.
^ Ackley, David H.; Hinton, Geoffrey E.; Sejnowski, Terrence J. (1985-01-01). "A learning algorithm for boltzmann machines". Cognitive Science. 9 (1): 147–169. doi:10.1016/S0364-0213(85)80012-4. ISSN 0364-0213.
^ Peter, Dayan; Hinton, Geoffrey E.; Neal, Radford M.; Zemel, Richard S. (1995). "The Helmholtz machine". Neural Computation. 7 (5): 889–904. doi:10.1162/neco.1995.7.5.889. hdl:21.11116/0000-0002-D6D3-E. PMID 7584891. S2CID 1890561.
^ Hinton, Geoffrey E.; Dayan, Peter; Frey, Brendan J.; Neal, Radford (1995-05-26). "The wake-sleep algorithm for unsupervised neural networks". Science. 268 (5214): 1158–1161. Bibcode:1995Sci...268.1158H. doi:10.1126/science.7761831. PMID 7761831. S2CID 871473.
^ Hinton, G. E.; Osindero, S.; Teh, Y. (2006). "A fast learning algorithm for deep belief nets" (PDF). Neural Computation. 18 (7): 1527–1554. CiteSeerX 10.1.1.76.1541. doi:10.1162/neco.2006.18.7.1527. PMID 16764513. S2CID 2309950.
^ Ng, Andrew; Dean, Jeff (2012). "Building High-level Features Using Large Scale Unsupervised Learning". arXiv:1112.6209 [cs.LG].
^ Schwarze, H; Hertz, J (1992-10-15). "Generalization in a Large Committee Machine". Europhysics Letters. 20 (4): 375–380. Bibcode:1992EL.....20..375S. doi:10.1209/0295-5075/20/4/015. ISSN 0295-5075.
^ Mato, G; Parga, N (1992-10-07). "Generalization properties of multilayered neural networks". Journal of Physics A: Mathematical and General. 25 (19): 5047–5054. Bibcode:1992JPhA...25.5047M. doi:10.1088/0305-4470/25/19/017. ISSN 0305-4470.
^ Hansel, D; Mato, G; Meunier, C (1992-11-01). "Memorization Without Generalization in a Multilayered Neural Network". Europhysics Letters. 20 (5): 471–476. Bibcode:1992EL.....20..471H. doi:10.1209/0295-5075/20/5/015. ISSN 0295-5075.
^ Schmidhuber, Jürgen (1992). "Learning complex, extended sequences using the principle of history compression" (PDF). Neural Computation. 4 (2): 234–242. Bibcode:1992NeCom...4..234S. doi:10.1162/neco.1992.4.2.234. S2CID 18271205. Archived from the original (PDF) on 2017-07-06.
^ Hanson, Stephen; Pratt, Lorien (1988). "Comparing Biases for Minimal Network Construction with Back-Propagation". Advances in Neural Information Processing Systems. 1. Morgan-Kaufmann.
^ LeCun, Yann; Denker, John; Solla, Sara (1989). "Optimal Brain Damage". Advances in Neural Information Processing Systems. 2. Morgan-Kaufmann.

External links

"Lecun 2019-7-11 ACM Tech Talk". Google Docs. Retrieved 2020-02-13.

[11] The simplest feedforward neural network consists of a single layer of output nodes without any nonlinear activation functions; the inputs are fed directly to the outputs via a series of weights. The sum of the products of the weights and the inputs is calculated at each node. The mean squared errors between these calculated outputs and the given target values are minimized by creating an adjustment to the weights.

[17] Neurons generate an action potential—the release of neurotransmitters that are chemical inputs to other neurons—based on the sum of their incoming chemical inputs.

[2] Crevier, Daniel (1993). AI: The Tumultuous Search for Artificial Intelligence. New York, NY: BasicBooks. ISBN 0-465-02997-3.

[:1-4] Gershgorn, Dave (26 July 2017). "The data that transformed AI research—and possibly the world". Quartz.

[legendre18052-6] Merriman, Mansfield. A List of Writings Relating to the Method of Least Squares: With Historical and Critical Notes. Vol. 4. Academy, 1877.

[gauss17952-7] Stigler, Stephen M. (1981). "Gauss and the Invention of Least Squares". Ann. Stat. 9 (3): 465–474. Bibcode:1981AnSta...945451S. doi:10.1214/aos/1176345451.

[adrain-8] Dutka, Jacques (1990). "Robert Adrain and the Method of Least Squares". Archive for History of Exact Sciences. 41 (2). Springer Nature: 171–184. doi:10.1007/BF00411864. JSTOR 41133885.

[timeseries-9] White, Halbert (1990). "Least Squares". Time Series and Statistics. Palgrave Macmillan UK. pp. 118–125. doi:10.1007/978-1-349-20865-4_15. ISBN 978-0-333-49551-3.

[schmidhuber-annotated-history-10] Schmidhuber, Jürgen (2022). "Annotated History of Modern AI and Deep Learning". arXiv:2212.11279 [cs.NE].

[qamar2023-12] Roheen Qamar, Baqar Ali Zardari (2 August 2023). "Artificial Neural Networks: An Overview". Mesopotamian Journal of Computer Science. 2023. Mesopotamian Academic Press: 124–133. doi:10.58496/mjcsc/2023/015.

[macukow2016-13] Macukow, Bohdan (2016). Neural Networks – State of Art, Brief History, Basic Models and Architecture. Computer Information Systems and Industrial Management. Springer International Publishing. doi:10.1007/978-3-319-45378-1_1.

[López-Muñoz-14] López-Muñoz F, Boya J, Alamo C (October 2006). "Neuron theory, the cornerstone of neuroscience, on the centenary of the Nobel Prize award to Santiago Ramón y Cajal". Brain Research Bulletin. 70 (4–6): 391–405. doi:10.1016/j.brainresbull.2006.07.010. PMID 17027775. S2CID 11273256.

[bishop2014-15] Bishop, John Mark (2014), History and Philosophy of Neural Networks (PDF)

[eberhart1990-16] Eberhart, R.C.; Dobbins, R.W. (1990). "Early neural network development history: the age of Camelot". IEEE Engineering in Medicine and Biology Magazine. 9 (3): 15–18. Bibcode:1990IEMBM...9c..15E. doi:10.1109/51.59207. ISSN 0739-5175. PMID 18238341.

[19] Piccinini, Gualtiero (August 2004). "The First Computational Theory of Mind and Brain: A Close Look at Mcculloch and Pitts's "Logical Calculus of Ideas Immanent in Nervous Activity"". Synthese. 141 (2): 175–215. doi:10.1023/B:SYNT.0000043018.52445.3e. ISSN 0039-7857.

[20] Kleene, S. C. (1956-12-31), Shannon, C. E.; McCarthy, J. (eds.), "Representation of Events in Nerve Nets and Finite Automata", Automata Studies. (AM-34), Princeton University Press, pp. 3–42, doi:10.1515/9781400882618-002, ISBN 978-1-4008-8261-8, retrieved 2024-10-14 {{citation}}: ISBN / Date incompatibility (help)CS1 maint: work parameter with ISBN (link)

[sep-ctm-21] Buckner, Cameron; Garson, James (2025). "Connectionism". In Zalta, Edward N.; Nodelman, Uri (eds.). The Stanford Encyclopedia of Philosophy.

[webster2011-22] Webster, Craig S. (2012). "Alan Turing's unorganized machines and artificial neural networks: his remarkable early work and future possibilities". Evolutionary Intelligence. 5 (1): 35–43. doi:10.1007/s12065-011-0060-5. ISSN 1864-5909. Retrieved 2026-05-05.

[ulhaq-24] Ulhaq, Anwaar (2021-05-07), Deep learning, past present and future: An odyssey, doi:10.31224/osf.io/vrmk4, retrieved 2026-05-05

[cios-25] Cios, Krzysztof J. (2017), Deep Neural Networks - A Brief History, arXiv:1701.05549

[27] "History of artificial intelligence (AI)". Encyclopedia Britannica. Retrieved 2026-05-05.

[olazaran1991-29] Rodriguez, Olazaran; Miguel, Jose (1991). A Historical Sociology of Neural Network Research (PDF) (Thesis). The University of Edinburgh. Retrieved 2026-05-05.

[30] David H. Hubel and Torsten N. Wiesel (2005). Brain and visual perception: the story of a 25-year collaboration. Oxford University Press US. p. 106. ISBN 978-0-19-517618-6.

[31] Wythoff, Barry J. (1993-02-01). "Backpropagation neural networks: A tutorial" (PDF). Chemometrics and Intelligent Laboratory Systems. 18 (2): 115–155. doi:10.1016/0169-7439(93)80052-J. ISSN 0169-7439.

[wu2025slides-32] Wu, Lei, Neural Networks: Architecture Design (PDF)

[35] Widrow, Bernard; Lehr, Michael A. (September 1990). "30 years of adaptive neural networks: perceptron, Madaline, and backpropagation" (PDF). Proceedings of the IEEE. 78 (9): 1415–1442. Bibcode:1990IEEEP..78.1415W. doi:10.1109/5.58323.

[nnhistory-36] "Neural Networks - History". Stanford University. Retrieved 2026-05-05.

[42] "From boom to bust: the AI winter". OpenLearn - The Open University. 2024-01-29. Retrieved 2026-05-05.

[43] Franklin, Stan (2014). "History, motivations, and core themes". In Frankish, Keith; Ramsey, William M. (eds.). The Cambridge Handbook of Artificial Intelligence. Cambridge: Cambridge University Press. pp. 15–33. ISBN 978-0-521-87142-6.

[44] Munro, Paul W. (8 January 2003). Theory and Application of Backpropagation: A Handbook (PDF).

[bptheory-46] Rumelhart, David E.; Durbin, Richard; Golden, Richard; Chauvin, Yves (1995). "Backpropagation: The Basic Theory" (PDF). In Chauvin, Yves; Rumelhart, David E. (eds.). Backpropagation: Theory, Architectures, and Applications.

[50] Dreyfus, Stuart E. (1990). "Artificial neural networks, back propagation, and the Kelley-Bryson gradient procedure" (PDF). Journal of Guidance, Control, and Dynamics. 13 (5): 926–928. Bibcode:1990JGCD...13..926D. doi:10.2514/3.25422. ISSN 0731-5090. Retrieved 15 May 2026.

[51] Mizutani, Eiji; Dreyfus, Stuart E.; Nishio, Kenichi (2000). "On derivation of MLP backpropagation from the Kelley-Bryson optimal-control gradient formula and its application". Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium. IEEE. p. 167–172 vol.2. doi:10.1109/IJCNN.2000.857892. ISBN 978-0-7695-0619-7.

[griewank2012-53] Griewank, Andreas (1 January 2012), "Who invented the reverse mode of differentiation?", in Grötschel, Martin (ed.), Optimization Stories, Documenta Mathematica Series, vol. 6 (1 ed.), EMS Press, pp. 389–400, doi:10.4171/dms/6/38, ISBN 978-3-936609-58-5

[scholarpedia-deeplearning-54] Schmidhuber, Juergen (2015). "Deep Learning". Scholarpedia. 10 (11) 32832. Bibcode:2015SchpJ..1032832S. doi:10.4249/scholarpedia.32832. ISSN 1941-6016.

[:14-56] Anderson, James A.; Rosenfeld, Edward, eds. (2000). Talking Nets: An Oral History of Neural Networks. The MIT Press. doi:10.7551/mitpress/6626.003.0016. ISBN 978-0-262-26715-1.

[57] Werbos, Paul (1974). Beyond regression: new tools for prediction and analysis in the behavioral sciences (PDF) (Thesis). Harvard University.

[58] Werbos, P.J. (1990). "Backpropagation through time: what it does and how to do it" (PDF). Proceedings of the IEEE. 78 (10): 1550–1560. Bibcode:1990IEEEP..78.1550W. doi:10.1109/5.58337. Retrieved 18 May 2026.

[SCHIDHUB3-61] Schmidhuber, J. (2015). "Deep Learning in Neural Networks: An Overview". Neural Networks. 61: 85–117. arXiv:1404.7828. Bibcode:2015NN.....61...85S. doi:10.1016/j.neunet.2014.09.003. PMID 25462637. S2CID 11715509.

[64] Brush, Stephen G. (1967). "History of the Lenz-Ising Model". Reviews of Modern Physics. 39 (4): 883–893. Bibcode:1967RvMP...39..883B. doi:10.1103/RevModPhys.39.883.

[68] Espinosa-Sanchez, Juan Manuel; Gomez-Marin, Alex; de Castro, Fernando (2023-07-05). "The Importance of Cajal's and Lorente de Nó's Neuroscience to the Birth of Cybernetics". The Neuroscientist. 31 (1): 14–30. doi:10.1177/10738584231179932. hdl:10261/348372. ISSN 1073-8584. PMID 37403768.

[70] Larriva-Sahd, Jorge A. (2014-12-03). "Some predictions of Rafael Lorente de Nó 80 years later". Frontiers in Neuroanatomy. 8: 147. doi:10.3389/fnana.2014.00147. ISSN 1662-5129. PMC 4253658. PMID 25520630.

[71] "reverberating circuit". Oxford Reference. Retrieved 2024-07-27.

[72] Magnuson, James S. (2024-07-24). "Recurrent Neural Networks". Open Encyclopedia of Cognitive Science. doi:10.21428/e2759450.9e968b77.

[schmidhuber19923-74] Schmidhuber, Jürgen (1992). "Learning complex, extended sequences using the principle of history compression (based on TR FKI-148, 1991)" (PDF). Neural Computation. 4 (2): 234–242. Bibcode:1992NeCom...4..234S. doi:10.1162/neco.1992.4.2.234. S2CID 18271205. Archived from the original (PDF) on 2017-07-06.

[75] Groumpos, Peter P. (2023-07-24). "A Critical Historic Overview of Artificial Intelligence: Issues, Challenges, Opportunities, and Threats". Artificial Intelligence and Applications. 1 (4): 181–197. doi:10.47852/bonviewAIA3202689. ISSN 2811-0854.

[HOCH20012-77] Hochreiter, S.; et al. (15 January 2001). "Gradient flow in recurrent nets: the difficulty of learning long-term dependencies". In Kolen, John F.; Kremer, Stefan C. (eds.). A Field Guide to Dynamical Recurrent Networks (PDF). John Wiley & Sons. ISBN 978-0-7803-5369-5.

[lstm2-78] Hochreiter, Sepp; Schmidhuber, Jürgen (1997-11-01). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID 9377276. S2CID 1915014. Q98967430.

[vinyals20162-87] Jozefowicz, Rafal; Vinyals, Oriol; Schuster, Mike; Shazeer, Noam; Wu, Yonghui (2016-02-07). "Exploring the Limits of Language Modeling". arXiv:1602.02410 [cs.CL].

[fukuneoscholar-90] Fukushima, Kunihiko (2007). "Neocognitron". Scholarpedia. 2 (1): 1717. Bibcode:2007SchpJ...2.1717F. doi:10.4249/scholarpedia.1717. ISSN 1941-6016.

[92] LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey (2015). "Deep learning" (PDF). Nature. 521 (7553): 436–444. Bibcode:2015Natur.521..436L. doi:10.1038/nature14539. PMID 26017442. S2CID 3074096.

[94] Ramachandran, Prajit; Barret, Zoph; Quoc, V. Le (October 16, 2017). "Searching for Activation Functions". arXiv:1710.05941 [cs.NE].

[sze2017-115] Sze, Vivienne; Chen, Yu-Hsin; Yang, Tien-Ju; Emer, Joel (2017). "Efficient Processing of Deep Neural Networks: A Tutorial and Survey". arXiv:1703.09039 [cs.CV].

[prelu2-127] He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification". arXiv:1502.01852 [cs.CV].

[resnet2-128] He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (10 Dec 2015). Deep Residual Learning for Image Recognition. arXiv:1512.03385.

[131] Linn, Allison (2015-12-10). "Microsoft researchers win ImageNet computer vision challenge". The AI Blog. Microsoft. Retrieved 2024-06-29.

[gancurpm20202-133] Schmidhuber, Jürgen (2020). "Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991)". Neural Networks. 127: 58–66. arXiv:1906.04493. doi:10.1016/j.neunet.2020.04.008. PMID 32334341. S2CID 216056336.

[SyncedReview20182-140] "GAN 2.0: NVIDIA's Hyperrealistic Face Generator". SyncedReview.com. December 14, 2018. Retrieved October 3, 2019.

[142] "Prepare, Don't Panic: Synthetic Media and Deepfakes". WITNESS Media Lab. Archived from the original on 2 December 2020. Retrieved 25 November 2020.

[144] Kramer, Arthur F.; Wiegmann, Douglas A.; Kirlik, Alex (2006-12-28). "1 Attention: From History to Application". Attention: From Theory to Practice. Oxford University Press. doi:10.1093/acprof:oso/9780195305722.003.0001. ISBN 978-0-19-530572-2.

[148] Ba, Jimmy; Mnih, Volodymyr; Kavukcuoglu, Koray (2015-04-23). "Multiple Object Recognition with Visual Attention". arXiv:1412.7755 [cs.LG].

[150] Koch, Christof; Ullman, Shimon (1987), Vaina, Lucia M. (ed.), "Shifts in Selective Visual Attention: Towards the Underlying Neural Circuitry", Matters of Intelligence: Conceptual Structures in Cognitive Neuroscience, Dordrecht: Springer Netherlands, pp. 115–141, doi:10.1007/978-94-009-3833-5_5, ISBN 978-94-009-3833-5{{citation}}: CS1 maint: work parameter with ISBN (link)

[:12-151] Soydaner, Derya (August 2022). "Attention mechanism in neural networks: where it comes and where it goes". Neural Computing and Applications. 34 (16): 13371–13385. arXiv:2204.13154. doi:10.1007/s00521-022-07366-3. ISSN 0941-0643.

[PDP-154] Rumelhart, David E.; Mcclelland, James L.; Hinton, Geoffrey E. (1987-07-29). "A General Framework for Parallel Distributed Computing". Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations (PDF). Vol. 1. Cambridge, Mass: Bradford Books. ISBN 978-0-262-68053-0.

[:03-157] Niu, Zhaoyang; Zhong, Guoqiang; Yu, Hui (2021-09-10). "A review on the attention mechanism of deep learning". Neurocomputing. 452: 48–62. doi:10.1016/j.neucom.2021.03.091. ISSN 0925-2312.

[160] Xu, Kelvin; Ba, Jimmy; Kiros, Ryan; Cho, Kyunghyun; Courville, Aaron; Salakhudinov, Ruslan; Zemel, Rich; Bengio, Yoshua (2015-06-01). "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention". Proceedings of the 32nd International Conference on Machine Learning. PMLR: 2048–2057. arXiv:1502.03044.

[:23-161] Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2016-05-19). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv:1409.0473 [cs.CL].

[166] Levy, Steven. "8 Google Employees Invented Modern AI. Here's the Inside Story". Wired. ISSN 1059-1028. Archived from the original on 20 March 2024. Retrieved 2024-08-06.

[167] Peng, Bo; Alcaide, Eric; Anthony, Quentin; Albalak, Alon; Arcadinho, Samuel; Biderman, Stella; Cao, Huanqi; Cheng, Xin; Chung, Michael (2023-12-10). "RWKV: Reinventing RNNs for the Transformer Era". arXiv:2305.13048 [cs.CL].

[KohonenMap-168] Kohonen, Teuvo; Honkela, Timo (2007). "Kohonen Network". Scholarpedia. 2 (1): 1568. Bibcode:2007SchpJ...2.1568K. doi:10.4249/scholarpedia.1568.

[170] Von der Malsburg, C (1973). "Self-organization of orientation sensitive cells in the striate cortex". Kybernetik. 14 (2): 85–100. doi:10.1007/bf00288907. PMID 4786750. S2CID 3351573.

[172] Smolensky, Paul (1986). "Chapter 6: Information Processing in Dynamical Systems: Foundations of Harmony Theory" (PDF). In Rumelhart, David E.; McLelland, James L. (eds.). Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations. MIT Press. pp. 194–281. ISBN 0-262-68053-X.

[175] Sejnowski, Terrence J. (2018). The deep learning revolution. Cambridge, Massachusetts: The MIT Press. ISBN 978-0-262-03803-4.

[smolensky1986-176] Smolensky, P. (1986). "Information processing in dynamical systems: Foundations of harmony theory.". In D. E. Rumelhart; J. L. McClelland; PDP Research Group (eds.). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1. MIT Press. pp. 194–281. ISBN 978-0-262-68053-0.

[hinton2009-178] Hinton, Geoffrey (2009-05-31). "Deep belief networks". Scholarpedia. 4 (5): 5947. Bibcode:2009SchpJ...4.5947H. doi:10.4249/scholarpedia.5947. ISSN 1941-6016.

[180] Watkin, Timothy L. H.; Rau, Albrecht; Biehl, Michael (1993-04-01). "The statistical mechanics of learning a rule". Reviews of Modern Physics. 65 (2): 499–556. Bibcode:1993RvMP...65..499W. doi:10.1103/RevModPhys.65.499. hdl:11370/02b0cd15-dfc5-4acb-9566-4ab937ee0d13.

[Mead-187] Mead, Carver A.; Ismail, Mohammed (8 May 1989). Analog VLSI Implementation of Neural Systems (PDF). The Kluwer International Series in Engineering and Computer Science. Vol. 80. Norwell, MA: Kluwer Academic Publishers. doi:10.1007/978-1-4613-1639-8. ISBN 978-1-4613-1639-8.

[188] Yang, J. J.; Pickett, M. D.; Li, X. M.; Ohlberg, D. A. A.; Stewart, D. R.; Williams, R. S. (2008). "Memristive switching mechanism for metal/oxide/metal nanodevices". Nat. Nanotechnol. 3 (7): 429–433. Bibcode:2008NatNa...3..429Y. doi:10.1038/nnano.2008.160. PMID 18654568.

[189] Strukov, D. B.; Snider, G. S.; Stewart, D. R.; Williams, R. S. (2008). "The missing memristor found". Nature. 453 (7191): 80–83. Bibcode:2008Natur.453...80S. doi:10.1038/nature06932. PMID 18451858. S2CID 4367148.

[rosenblatt-1959-1] Rosenblatt, F. (1958). "The Perceptron: A Probabilistic Model For Information Storage And Organization In The Brain". Psychological Review. 65 (6): 386–408. CiteSeerX 10.1.1.588.3775. doi:10.1037/h0042519. PMID 13602029. S2CID 12781225.

[3] Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E. (2017-05-24). "ImageNet classification with deep convolutional neural networks" (PDF). Communications of the ACM. 60 (6): 84–90. doi:10.1145/3065386. ISSN 0001-0782. S2CID 195908774.

[2017_Attention_Is_All_You_Need-5] Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need" (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.

[18] McCulloch, Warren; Pitts, Walter (1943). "A Logical Calculus of Ideas Immanent in Nervous Activity". Bulletin of Mathematical Biophysics. 5 (4): 115–133. Bibcode:1943BMaB....5..115M. doi:10.1007/BF02478259.

[23] Hebb, Donald (1949). The Organization of Behavior. New York: Wiley. ISBN 978-1-135-63190-1. {{cite book}}: ISBN / Date incompatibility (help)

[26] Farley, B.G.; W.A. Clark (1954). "Simulation of Self-Organizing Systems by Digital Computer". IRE Transactions on Information Theory. 4 (4): 76–84. Bibcode:1954TIPIT...4...76F. doi:10.1109/TIT.1954.1057468.

[28] Rochester, Nathaniel; Holland, J.H.; Habit, L.H.; Duda, W.L. (1956). "Tests on a cell assembly theory of the action of the brain, using a large digital computer". IRE Transactions on Information Theory. 2 (3): 80–93. Bibcode:1956IRTIT...2...80R. doi:10.1109/TIT.1956.1056810.

[33] Rosenblatt, Frank (1962). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books. Retrieved 2026-05-05.

[block1962-34] Block, H. D., Knight, B. W., Jr., Rosenblatt, F. (1962) Analysis of a four-layer series-coupled perceptron. II. Rev. Mod. Phys. 34: 135

[ivak19652-37] Ivakhnenko, A. G.; Lapa, V. G. (1967). Cybernetics and Forecasting Techniques. American Elsevier Publishing Co. ISBN 978-0-444-00020-0.

[38] Ivakhnenko, A.G. (March 1970). "Heuristic self-organization in problems of engineering cybernetics". Automatica. 6 (2): 207–219. Bibcode:1970Autom...6..207I. doi:10.1016/0005-1098(70)90092-0.

[ivak1971-39] Ivakhnenko, Alexey (1971). "Polynomial theory of complex systems" (PDF). IEEE Transactions on Systems, Man, and Cybernetics. SMC-1 (4): 364–378. Bibcode:1971ITSMC...1..364I. doi:10.1109/TSMC.1971.4308320. Archived (PDF) from the original on 2017-08-29. Retrieved 2019-11-05.

[Amari19672-40] Amari, Shun'ichi (1967). "A theory of adaptive pattern classifier". IEEE Transactions. EC (16): 279–307.

[41] Minsky, Marvin; Papert, Seymour (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press. ISBN 978-0-262-63022-1.

[rosenblatt19622-45] Rosenblatt, Frank (1962). Principles of Neurodynamics. Spartan, New York.

[leibniz16762-47] Leibniz, Gottfried Wilhelm Freiherr von (1920). Child, J. M. (ed.). The Early Mathematical Manuscripts of Leibniz: Translated from the Latin Texts Published by Carl Immanuel Gerhardt with Critical and Historical Notes. Open court publishing Company. ISBN 978-0-598-81846-1. {{cite book}}: ISBN / Date incompatibility (help)

[kelley19602-48] Kelley, Henry J. (1960). "Gradient theory of optimal flight paths". ARS Journal. 30 (10): 947–954. doi:10.2514/8.5282.

[49] Bryson, Arthur E. (April 1961). A gradient method for optimizing multi-stage allocation processes. Proceedings of the Harvard Univ. Symposium on digital computers and their applications. Vol. 72. p. 22.

[lin19703-52] Linnainmaa, Seppo (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors (PDF) (Masters) (in Finnish). University of Helsinki. p. 6–7. Archived from the original (PDF) on 2013-12-06.

[lin19763-55] Linnainmaa, Seppo (1976). "Taylor expansion of the accumulated rounding error". BIT Numerical Mathematics. 16 (2): 146–160. doi:10.1007/bf01931367. S2CID 122357351.

[werbos19823-59] Werbos, Paul (1982). "Applications of advances in nonlinear sensitivity analysis" (PDF). System modeling and optimization. Springer. pp. 762–770. Archived (PDF) from the original on 14 April 2016. Retrieved 2 July 2017.

[60] Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (October 1986). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536. Bibcode:1986Natur.323..533R. doi:10.1038/323533a0. ISSN 1476-4687.

[lenz1920-62] Lenz, W. (1920). "Beiträge zum Verständnis der magnetischen Eigenschaften in festen Körpern" (PDF). Physikalische Zeitschrift (in German). 21: 613–615.

[ising1925-63] Ising, E. (1925). "Beitrag zur Theorie des Ferromagnetismus". Zeitschrift für Physik (in German). 31 (1): 253–258. Bibcode:1925ZPhy...31..253I. doi:10.1007/BF02980577. S2CID 122157319.

[:22-65] Glauber, Roy J. (February 1963). "Roy J. Glauber "Time-Dependent Statistics of the Ising Model"". Journal of Mathematical Physics. 4 (2): 294–307. doi:10.1063/1.1703954. Retrieved 2021-03-21.

[66] Amari, Shun'ichi (November 1972). "Learning Patterns and Pattern Sequences by Self-Organizing Nets of Threshold Elements". IEEE Transactions on Computers. C-21 (11): 1197–1206. Bibcode:1972ITCmp.100.1197A. doi:10.1109/T-C.1972.223477. ISSN 0018-9340.

[Hopfield19822-67] Hopfield, John Joseph (1982). "Neural networks and physical systems with emergent collective computational abilities". Proceedings of the National Academy of Sciences. 79 (8): 2554–2558. Bibcode:1982PNAS...79.2554H. doi:10.1073/pnas.79.8.2554. PMC 346238. PMID 6953413.

[69] Nó, Rafael Lorente (1933-08-01). "Vestibulo-Ocular Reflex Arc". Archives of Neurology and Psychiatry. 30 (2): 245. doi:10.1001/archneurpsyc.1933.02240140009001. ISSN 0096-6754.

[schmidhuber19933-73] Schmidhuber, Jürgen (15 April 1993). Netzwerkarchitekturen, Zielfunktionen und Kettenregel (PDF) (Dr. rer. nat. habil. thesis). Archived from the original (PDF) on 6 July 2017. Page 150 ff demonstrates credit assignment across the equivalent of 1,200 layers in an unfolded RNN.^{[verification needed]}

[HOCH19912-76] Hochreiter, Sepp (1991). Untersuchungen zu dynamischen neuronalen Netzen (PDF) (Diploma thesis) (in German). Advisor: J. Schmidhuber. Institut f. Informatik, Technische Univ. Munich. Archived from the original (PDF) on 2015-03-06.

[lstm19992-79] Gers, Felix; Schmidhuber, Jürgen; Cummins, Fred (1999). "Learning to forget: Continual prediction with LSTM". 9th International Conference on Artificial Neural Networks: ICANN '99. Vol. 1999. pp. 850–855. doi:10.1049/cp:19991218. ISBN 0-85296-721-7.

[80] Graves, Alex; Schmidhuber, Jürgen (2005-07-01). "Framewise phoneme classification with bidirectional LSTM and other neural network architectures". Neural Networks. IJCNN 2005. 18 (5): 602–610. CiteSeerX 10.1.1.331.5800. doi:10.1016/j.neunet.2005.06.042. PMID 16112549. S2CID 1856462.

[fernandez2007keyword2-81] Fernández, Santiago; Graves, Alex; Schmidhuber, Jürgen (2007). "An Application of Recurrent Neural Networks to Discriminative Keyword Spotting". Proceedings of the 17th International Conference on Artificial Neural Networks. ICANN'07. Berlin, Heidelberg: Springer-Verlag. pp. 220–229. ISBN 978-3-540-74693-5.

[sak20142-82] Sak, Haşim; Senior, Andrew; Beaufays, Françoise (2014). "Long Short-Term Memory recurrent neural network architectures for large scale acoustic modeling" (PDF). Google Research.

[liwu20152-83] Li, Xiangang; Wu, Xihong (2014-10-15). "Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition". arXiv:1410.4281 [cs.CL].

[fan20152-84] Fan, Bo; Wang, Lijuan; Soong, Frank K.; Xie, Lei (2015). "Photo-Real Talking Head with Deep Bidirectional LSTM". Proceedings of ICASSP 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4884–8. doi:10.1109/ICASSP.2015.7178899. ISBN 978-1-4673-6997-8.

[sak20152-85] Sak, Haşim; Senior, Andrew; Rao, Kanishka; Beaufays, Françoise; Schalkwyk, Johan (September 2015). "Google voice search: faster and more accurate". Google Research.

[sutskever20142-86] Sutskever, Ilya; Vinyals, Oriol; Le, Quoc V. (2014). "Sequence to Sequence Learning with Neural Networks" (PDF). Electronic Proceedings of the Neural Information Processing Systems Conference. 27: 5346. arXiv:1409.3215. Bibcode:2014arXiv1409.3215S.

[gillick20152-88] Gillick, Dan; Brunk, Cliff; Vinyals, Oriol; Subramanya, Amarnag (2015-11-30). "Multilingual Language Processing From Bytes". arXiv:1512.00103 [cs.CL].

[1411.4555-89] Vinyals, Oriol; Toshev, Alexander; Bengio, Samy; Erhan, Dumitru (2014). "Show and Tell: A Neural Image Caption Generator". arXiv:1411.4555 [cs.CV].

[intro-91] Fukushima, Kunihiko (1980). "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position" (PDF). Biological Cybernetics. 36 (4): 193–202. doi:10.1007/BF00344251. PMID 7370364. S2CID 206775608. Archived from the original (PDF) on 3 June 2014. Retrieved 16 November 2013.

[Fukushima1969-93] Fukushima, K. (1969). "Visual feature extraction by a multilayered network of analog threshold elements". IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322–333. Bibcode:1969ITSSC...5..322F. doi:10.1109/TSSC.1969.300225.

[Waibel1987-95] Waibel, Alex (December 1987). Phoneme Recognition Using Time-Delay Neural Networks (PDF). Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE). Tokyo, Japan.

[speechsignal-96] Waibel, Alexander; Hanazawa, Toshiyuki; Hinton, Geoffrey; Shikano, Kiyohiro; Lang, Kevin J. (March 1989). "Phoneme recognition using time-delay neural networks" (PDF). IEEE Transactions on Acoustics, Speech, and Signal Processing. 37 (3): 328–339. Bibcode:1989ITASS..37..328W. doi:10.1109/29.21701.

[wz1988-97] Zhang, Wei (1988). "Shift-invariant pattern recognition neural network and its optical architecture". Proceedings of Annual Conference of the Japan Society of Applied Physics.

[wz1990-98] Zhang, Wei (1990). "Parallel distributed processing model with local space-invariant interconnections and its optical architecture". Applied Optics. 29 (32): 4790–7. Bibcode:1990ApOpt..29.4790Z. doi:10.1364/AO.29.004790. PMID 20577468.

[99] Fukushima, Kunihiko; Miyake, Sei (1982-01-01). "Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position". Pattern Recognition. 15 (6): 455–469. Bibcode:1982PatRe..15..455F. doi:10.1016/0031-3203(82)90024-3. ISSN 0031-3203.

[LECUN1989-100] LeCun et al., "Backpropagation Applied to Handwritten Zip Code Recognition," Neural Computation, 1, pp. 541–551, 1989.

[101] LeCun, Yann; Boser, Bernhard; Denker, John; Henderson, Donnie; Howard, R.; Hubbard, Wayne; Jackel, Lawrence (1989). "Handwritten Digit Recognition with a Back-Propagation Network". Advances in Neural Information Processing Systems. 2. Morgan-Kaufmann.

[102] Zhang, Wei (1991). "Image processing of human corneal endothelium based on a learning network". Applied Optics. 30 (29): 4211–7. Bibcode:1991ApOpt..30.4211Z. doi:10.1364/AO.30.004211. PMID 20706526.

[103] Zhang, Wei (1994). "Computerized detection of clustered microcalcifications in digital mammograms using a shift-invariant artificial neural network". Medical Physics. 21 (4): 517–24. Bibcode:1994MedPh..21..517Z. doi:10.1118/1.597177. PMID 8058017.

[Weng1992-104] Weng, Juyang; Ahuja, Narendra; Huang, Thomas S. (June 1992). "Cresceptron: a self-organizing neural network which grows adaptively" (PDF). Proceedings of the International Joint Conference on Neural Networks. Vol. 1. Baltimore, MD, USA: IEEE. pp. 576–581. doi:10.1109/IJCNN.1992.287150.

[Weng19932-105] Weng, Juyang; Ahuja, Narendra; Huang, Thomas S. (May 1993). "Learning recognition and segmentation of 3-D objects from 2-D images" (PDF). Proceedings of the 4th International Conference on Computer Vision. Berlin, Germany: IEEE. pp. 121–128. doi:10.1109/ICCV.1993.378228.

[Weng1997-106] Weng, Juyang; Ahuja, Narendra; Huang, Thomas S. (November 1997). "Learning recognition and segmentation using the Cresceptron" (PDF). International Journal of Computer Vision. 25 (2): 105–139. doi:10.1023/A:1007967800668.

[weng1993-107] Weng, J; Ahuja, N; Huang, TS (1993). "Learning recognition and segmentation of 3-D objects from 2-D images". 1993 (4th) International Conference on Computer Vision. pp. 121–128. doi:10.1109/ICCV.1993.378228. ISBN 0-8186-3870-2. S2CID 8619176.

[lecun98-108] LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick Haffner (1998). "Gradient-based learning applied to document recognition" (PDF). Proceedings of the IEEE. 86 (11): 2278–2324. Bibcode:1998IEEEP..86.2278L. CiteSeerX 10.1.1.32.9552. doi:10.1109/5.726791. S2CID 14542261. Retrieved October 7, 2016.

[Scherer2010-109] Scherer, Dominik; Müller, Andreas; Behnke, Sven (2010). "Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition" (PDF). In Diamantaras, Konstantinos; Duch, Wlodek; Iliadis, Lazaros S. (eds.). Artificial Neural Networks – ICANN 2010. Lecture Notes in Computer Science. Vol. 6354. Berlin, Heidelberg: Springer. pp. 92–101. doi:10.1007/978-3-642-15825-4_10. ISBN 978-3-642-15825-4.

[110] Riedmiller, Martin; Braun, Heinrich (November 1992). "Rprop – A Fast Adaptive Learning Algorithm". Proceedings of the International Symposium on Computer and Information Science VII.

[riedmiller1992-111] Riedmiller, M.; Braun, H. (1993). "A direct adaptive method for faster backpropagation learning: The RPROP algorithm". IEEE International Conference on Neural Networks. IEEE. pp. 586–591. doi:10.1109/icnn.1993.298623. ISBN 0-7803-0999-5.

[112] Behnke, Sven (2003). Hierarchical Neural Networks for Image Interpretation (PDF). Lecture Notes in Computer Science. Vol. 2766. Springer.

[jung2004-113] Oh, K.-S.; Jung, K. (2004). "GPU implementation of neural networks". Pattern Recognition. 37 (6): 1311–1314. Bibcode:2004PatRe..37.1311O. doi:10.1016/j.patcog.2004.01.013.

[chellapilla2006-114] Chellapilla, Kumar; Puri, Sidd; Simard, Patrice (2006), High performance convolutional neural networks for document processing, archived from the original on 2020-05-18, retrieved 2021-02-14

[116] Raina, Rajat; Madhavan, Anand; Ng, Andrew Y. (2009-06-14). "Large-scale deep unsupervised learning using graphics processors". Proceedings of the 26th Annual International Conference on Machine Learning. ICML '09. New York, NY, USA: Association for Computing Machinery. pp. 873–880. doi:10.1145/1553374.1553486. ISBN 978-1-60558-516-1.

[:32-117] Cireşan, Dan Claudiu; Meier, Ueli; Gambardella, Luca Maria; Schmidhuber, Jürgen (21 September 2010). "Deep, Big, Simple Neural Nets for Handwritten Digit Recognition". Neural Computation. 22 (12): 3207–3220. arXiv:1003.0358. Bibcode:2010NeCom..22.3207C. doi:10.1162/neco_a_00052. ISSN 0899-7667. PMID 20858131. S2CID 1918673.

[:62-118] Ciresan, D. C.; Meier, U.; Masci, J.; Gambardella, L.M.; Schmidhuber, J. (2011). "Flexible, High Performance Convolutional Neural Networks for Image Classification" (PDF). International Joint Conference on Artificial Intelligence. doi:10.5591/978-1-57735-516-8/ijcai11-210. Archived (PDF) from the original on 2014-09-29. Retrieved 2017-06-13.

[:82-119] Ciresan, Dan; Giusti, Alessandro; Gambardella, Luca M.; Schmidhuber, Jürgen (2012). Pereira, F.; Burges, C. J. C.; Bottou, L.; Weinberger, K. Q. (eds.). Advances in Neural Information Processing Systems 25 (PDF). Curran Associates, Inc. pp. 2843–2851. Archived (PDF) from the original on 2017-08-09. Retrieved 2017-06-13.

[ciresan2013miccai-120] Ciresan, D.; Giusti, A.; Gambardella, L.M.; Schmidhuber, J. (2013). "Mitosis Detection in Breast Cancer Histology Images with Deep Neural Networks". Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013. Lecture Notes in Computer Science. Vol. 7908. pp. 411–418. doi:10.1007/978-3-642-40763-5_51. ISBN 978-3-642-38708-1. PMID 24579167.

[:9-121] Ciresan, D.; Meier, U.; Schmidhuber, J. (2012). "Multi-column deep neural networks for image classification". 2012 IEEE Conference on Computer Vision and Pattern Recognition. pp. 3642–3649. arXiv:1202.2745. doi:10.1109/cvpr.2012.6248110. ISBN 978-1-4673-1228-8. S2CID 2161592.

[krizhevsky20122-122] Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey (2012). "ImageNet Classification with Deep Convolutional Neural Networks" (PDF). NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada. Archived (PDF) from the original on 2017-01-10. Retrieved 2017-05-24.

[VGG-123] Simonyan, Karen; Andrew, Zisserman (2014). "Very Deep Convolution Networks for Large Scale Image Recognition". arXiv:1409.1556 [cs.CV].

[szegedy-124] Szegedy, Christian (2015). "Going deeper with convolutions" (PDF). Cvpr2015. arXiv:1409.4842.

[1411.4952-125] Fang, Hao; Gupta, Saurabh; Iandola, Forrest; Srivastava, Rupesh; Deng, Li; Dollár, Piotr; Gao, Jianfeng; He, Xiaodong; Mitchell, Margaret; Platt, John C; Lawrence Zitnick, C; Zweig, Geoffrey (2014). "From Captions to Visual Concepts and Back". arXiv:1411.4952 [cs.CV]..

[1411.2539-126] Kiros, Ryan; Salakhutdinov, Ruslan; Zemel, Richard S (2014). "Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models". arXiv:1411.2539 [cs.LG]..

[highway20153-129] Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (2 May 2015). "Highway Networks". arXiv:1505.00387 [cs.LG].

[resnet20153-130] He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). "Deep Residual Learning for Image Recognition". 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 770–778. arXiv:1512.03385. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1.

[curiosity19912-132] Schmidhuber, Jürgen (1991). "A possibility for implementing curiosity and boredom in model-building neural controllers". Proc. SAB'1991. MIT Press/Bradford Books. pp. 222–227.

[pm1992-134] Schmidhuber, Jürgen (November 1992). "Learning Factorial Codes by Predictability Minimization". Neural Computation. 4 (6): 863–879. doi:10.1162/neco.1992.4.6.863. S2CID 42023620.

[pm1996-135] Schmidhuber, Jürgen; Eldracher, Martin; Foltin, Bernhard (1996). "Semilinear predictability minimzation produces well-known feature detectors". Neural Computation. 8 (4): 773–786. doi:10.1162/neco.1996.8.4.773. S2CID 16154391.

[olli2010-136] Niemitalo, Olli (February 24, 2010). "A method for training artificial neural networks to generate missing data within a variable context". Internet Archive (Wayback Machine). Archived from the original on March 12, 2012. Retrieved February 22, 2019.

[Li-etal-GECCO2013-137] Li, Wei; Gauci, Melvin; Gross, Roderich (July 6, 2013). "Proceeding of the fifteenth annual conference on Genetic and evolutionary computation conference - GECCO '13". Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO 2013). Amsterdam, the Netherlands: ACM. pp. 223–230. doi:10.1145/2463372.2465801. ISBN 978-1-4503-1963-8.

[138] Gutmann, Michael; Hyvärinen, Aapo. "Noise-Contrastive Estimation" (PDF). International Conference on AI and Statistics.

[GANnips2-139] Goodfellow, Ian; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014). Generative Adversarial Networks (PDF). Proceedings of the International Conference on Neural Information Processing Systems (NIPS 2014). pp. 2672–2680. Archived (PDF) from the original on 22 November 2019. Retrieved 20 August 2019.

[progressiveGAN20172-141] Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. (26 February 2018). "Progressive Growing of GANs for Improved Quality, Stability, and Variation". arXiv:1710.10196 [cs.NE].

[143] Sohl-Dickstein, Jascha; Weiss, Eric; Maheswaranathan, Niru; Ganguli, Surya (2015-06-01). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" (PDF). Proceedings of the 32nd International Conference on Machine Learning. 37. PMLR: 2256–2265. arXiv:1503.03585.

[Cherry_1953-145] Cherry EC (1953). "Some Experiments on the Recognition of Speech, with One and with Two Ears" (PDF). The Journal of the Acoustical Society of America. 25 (5): 975–79. Bibcode:1953ASAJ...25..975C. doi:10.1121/1.1907229. hdl:11858/00-001M-0000-002A-F750-3. ISSN 0001-4966.

[Broadbent-146] Broadbent, Donald E. (1958). Perception and Communication (PDF). London: Pergamon Press.

[147] Kowler, Eileen; Anderson, Eric; Dosher, Barbara; Blaser, Erik (1995-07-01). "The role of attention in the programming of saccades". Vision Research. 35 (13): 1897–1916. doi:10.1016/0042-6989(94)00279-U. ISSN 0042-6989. PMID 7660596.

[149] Fukushima, Kunihiko (1987-12-01). "Neural network model for selective attention in visual pattern recognition and associative recall". Applied Optics. 26 (23): 4985–4992. Bibcode:1987ApOpt..26.4985F. doi:10.1364/AO.26.004985. ISSN 0003-6935. PMID 20523477.

[152] Giles, C. Lee; Maxwell, Tom (1987-12-01). "Learning, invariance, and generalization in high-order neural networks". Applied Optics. 26 (23): 4972–4978. doi:10.1364/AO.26.004972. ISSN 0003-6935. PMID 20523475.

[153] Feldman, J. A.; Ballard, D. H. (1982-07-01). "Connectionist models and their properties". Cognitive Science. 6 (3): 205–254. doi:10.1016/S0364-0213(82)80001-3. ISSN 0364-0213.

[155] Schmidhuber, Jürgen (January 1992). "Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks". Neural Computation. 4 (1): 131–139. doi:10.1162/neco.1992.4.1.131. ISSN 0899-7667.

[156] Ha, David; Dai, Andrew; Le, Quoc V. (2016-12-01). "HyperNetworks". arXiv:1609.09106 [cs.LG].

[:2-158] Cho, Kyunghyun; van Merrienboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (2014). "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation". Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. pp. 1724–1734. arXiv:1406.1078. doi:10.3115/v1/D14-1179.

[sequence-159] Sutskever, Ilya; Vinyals, Oriol; Le, Quoc Viet (14 Dec 2014). "Sequence to sequence learning with neural networks". arXiv:1409.3215 [cs.CL].

[:23-162] Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2016-05-19). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv:1409.0473 [cs.CL].

[163] Graves, Alex; Wayne, Greg; Danihelka, Ivo (2014-12-10). "Neural Turing Machines". arXiv:1410.5401 [cs.NE].

[cheng2016-164] Cheng, Jianpeng; Dong, Li; Lapata, Mirella (2016-09-20). "Long Short-Term Memory-Networks for Machine Reading". arXiv:1601.06733 [cs.CL].

[165] Parikh, Ankur P.; Täckström, Oscar; Das, Dipanjan; Uszkoreit, Jakob (2016-09-25). "A Decomposable Attention Model for Natural Language Inference". arXiv:1606.01933 [cs.CL].

[169] Kohonen, Teuvo (1982). "Self-Organized Formation of Topologically Correct Feature Maps". Biological Cybernetics. 43 (1): 59–69. doi:10.1007/bf00337288. S2CID 206775459.

[171] Ackley, David H.; Hinton, Geoffrey E.; Sejnowski, Terrence J. (1985-01-01). "A learning algorithm for boltzmann machines". Cognitive Science. 9 (1): 147–169. doi:10.1016/S0364-0213(85)80012-4. ISSN 0364-0213.

[173] Peter, Dayan; Hinton, Geoffrey E.; Neal, Radford M.; Zemel, Richard S. (1995). "The Helmholtz machine". Neural Computation. 7 (5): 889–904. doi:10.1162/neco.1995.7.5.889. hdl:21.11116/0000-0002-D6D3-E. PMID 7584891. S2CID 1890561.

[:13-174] Hinton, Geoffrey E.; Dayan, Peter; Frey, Brendan J.; Neal, Radford (1995-05-26). "The wake-sleep algorithm for unsupervised neural networks". Science. 268 (5214): 1158–1161. Bibcode:1995Sci...268.1158H. doi:10.1126/science.7761831. PMID 7761831. S2CID 871473.

[hinton2006-177] Hinton, G. E.; Osindero, S.; Teh, Y. (2006). "A fast learning algorithm for deep belief nets" (PDF). Neural Computation. 18 (7): 1527–1554. CiteSeerX 10.1.1.76.1541. doi:10.1162/neco.2006.18.7.1527. PMID 16764513. S2CID 2309950.

[ng2012-179] Ng, Andrew; Dean, Jeff (2012). "Building High-level Features Using Large Scale Unsupervised Learning". arXiv:1112.6209 [cs.LG].

[181] Schwarze, H; Hertz, J (1992-10-15). "Generalization in a Large Committee Machine". Europhysics Letters. 20 (4): 375–380. Bibcode:1992EL.....20..375S. doi:10.1209/0295-5075/20/4/015. ISSN 0295-5075.

[182] Mato, G; Parga, N (1992-10-07). "Generalization properties of multilayered neural networks". Journal of Physics A: Mathematical and General. 25 (19): 5047–5054. Bibcode:1992JPhA...25.5047M. doi:10.1088/0305-4470/25/19/017. ISSN 0305-4470.

[183] Hansel, D; Mato, G; Meunier, C (1992-11-01). "Memorization Without Generalization in a Multilayered Neural Network". Europhysics Letters. 20 (5): 471–476. Bibcode:1992EL.....20..471H. doi:10.1209/0295-5075/20/5/015. ISSN 0295-5075.

[schmidhuber19922-184] Schmidhuber, Jürgen (1992). "Learning complex, extended sequences using the principle of history compression" (PDF). Neural Computation. 4 (2): 234–242. Bibcode:1992NeCom...4..234S. doi:10.1162/neco.1992.4.2.234. S2CID 18271205. Archived from the original (PDF) on 2017-07-06.

[185] Hanson, Stephen; Pratt, Lorien (1988). "Comparing Biases for Minimal Network Construction with Back-Propagation". Advances in Neural Information Processing Systems. 1. Morgan-Kaufmann.

[186] LeCun, Yann; Denker, John; Solla, Sara (1989). "Optimal Brain Damage". Advances in Neural Information Processing Systems. 2. Morgan-Kaufmann.

[‡ 1]

[1]

[‡ 2]

[2]

[‡ 3]

[3]

[4]

[5]

[6]

[7]

[a]

[8]

[9]

[10]

[11]

[12]

[b]

[‡ 4]

[13]

[14]

[15]

[16]

[‡ 5]

[17]

[18]

[‡ 6]

[19]

[‡ 7]

[20]

[21]

[22]

[23]

[‡ 8]

[‡ 9]

[24]

[25]

[‡ 10]

[‡ 11]

[‡ 12]

[‡ 13]

[‡ 14]

[26]

[27]

[28]

[‡ 15]

[29]

[‡ 16]

[‡ 17]

[‡ 18]

[30]

[31]

[‡ 19]

[32]

[33]

[‡ 20]

[34]

[35]

[36]

[‡ 21]

[‡ 22]

[37]

[‡ 23]

[‡ 24]

[38]

[‡ 25]

[‡ 26]

[‡ 27]

[39]

[‡ 28]

[40]

[41]

[42]

[‡ 29]

[43]

[44]

[‡ 30]

[45]

[46]

[‡ 31]

[‡ 32]

[‡ 33]

[‡ 34]

[‡ 35]

[‡ 36]

[‡ 37]

[‡ 38]

[47]

[‡ 39]

[‡ 40]

[48]

[‡ 41]

[49]

[‡ 42]

[50]

[‡ 43]

[‡ 44]

[‡ 45]

[‡ 46]

[‡ 47]

[‡ 48]

History of artificial neural networks

Mathematical foundations

Biological and computational models

Perceptrons and other early neural networks

Multi-layer learning

Backlash and AI winter

Backpropagation

Recurrent network architectures

Long short-term memory

Convolutional neural networks

Deep learning

Generative adversarial networks

Attention mechanism and transformers

Recurrent attention

Transformer

Unsupervised and self-supervised learning

Self-organizing maps

Boltzmann machines

Deep learning

Other aspects

Knowledge distillation

Hardware-based designs

Notes

References

Primary sources

External links

Lütfen Doğrulayın/Doğrulanmasını bekleyin