The learning rate is how quickly a network abandons old beliefs for new ones.
If a child sees 10 examples of cats and all of them have orange fur, it will think that cats have orange fur and will look for orange fur when trying to identify a cat.
Now it sees a black a cat and her parents tell her it's a cat (supervised learning). With a large “learning rate”, it will quickly realize that “orange fur” is not the most
important feature of cats. With a small learning rate, it will think that this black cat is an outlier and cats are still orange.
If the learning rate is too high, it might start to think that all cats are black even though it has seen more orange cats than black ones.
In general, you want to find a learning rate that is low enough that the network converges to something useful, but high enough that you don't have to spend years training it.
one epoch = one forward pass and one backward pass of all the training examples
batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass
and backward pass as two different passes).
Simply put, dropout refers to ignoring units (i.e. neurons) during the training phase of certain set of neurons which is chosen at random. By “ignoring”, I mean these units are not considered during
a particular forward or backward pass.
More technically, At each training stage, individual nodes are either dropped out of the net with probability 1-p or kept with probability p, so that a reduced network is left; incoming and outgoing
edges to a dropped-out node are also removed.