<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>CapsNets &#8211; NN in XL</title>
	<atom:link href="https://www.richardmaddison.com/tag/capsnets/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.richardmaddison.com</link>
	<description>Richard Maddison</description>
	<lastBuildDate>Mon, 28 Jan 2019 08:58:28 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=5.5.7</generator>
	<item>
		<title>Building a Capsule Net in Excel</title>
		<link>https://www.richardmaddison.com/2019/01/13/building-a-capsule-net-in-excel/</link>
					<comments>https://www.richardmaddison.com/2019/01/13/building-a-capsule-net-in-excel/#comments</comments>
		
		<dc:creator><![CDATA[Richard Maddison]]></dc:creator>
		<pubDate>Sun, 13 Jan 2019 18:02:31 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[AI in Excel]]></category>
		<category><![CDATA[CapsNets]]></category>
		<category><![CDATA[Capsule Nets]]></category>
		<category><![CDATA[Capsule Networks]]></category>
		<category><![CDATA[Capsules]]></category>
		<category><![CDATA[Neural Networks]]></category>
		<category><![CDATA[Neural Networks in Excel]]></category>
		<guid isPermaLink="false">https://www.richardmaddison.com/?p=21515</guid>

					<description><![CDATA[<p>Capsule networks are possibly the biggest advance in neural network design in the last decade. They appear to mimic the human brain far more than convolutional neural networks and move us significantly closer to artificial general intelligence. As a step towards demystifying these new algorithms...</p>
<p>The post <a rel="nofollow" href="https://www.richardmaddison.com/2019/01/13/building-a-capsule-net-in-excel/">Building a Capsule Net in Excel</a> appeared first on <a rel="nofollow" href="https://www.richardmaddison.com">NN in XL</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Capsule networks are possibly the biggest advance in neural network design in the last decade. They appear to mimic the human brain far more than convolutional neural networks and move us significantly closer to artificial general intelligence. As a step towards demystifying these new algorithms I’ve built one on-sheet in Excel.</p>



<ul class="wp-block-gallery columns-3 is-cropped"><li class="blocks-gallery-item"><figure><img loading="lazy" width="396" height="514" src="https://www.richardmaddison.com/wp-content/uploads/2019/01/20190113_225324.gif" alt="" data-id="21540" data-link="https://www.richardmaddison.com/?attachment_id=21540" class="wp-image-21540"/></figure></li><li class="blocks-gallery-item"><figure><img loading="lazy" width="396" height="514" src="https://www.richardmaddison.com/wp-content/uploads/2019/01/20190113_225517.gif" alt="" data-id="21541" data-link="https://www.richardmaddison.com/?attachment_id=21541" class="wp-image-21541"/></figure></li><li class="blocks-gallery-item"><figure><img loading="lazy" width="396" height="514" src="https://www.richardmaddison.com/wp-content/uploads/2019/01/20190113_225759.gif" alt="" data-id="21542" data-link="https://www.richardmaddison.com/?attachment_id=21542" class="wp-image-21542"/></figure></li><li class="blocks-gallery-item"><figure><img loading="lazy" width="396" height="514" src="https://www.richardmaddison.com/wp-content/uploads/2019/01/20190113_230143.gif" alt="" data-id="21543" data-link="https://www.richardmaddison.com/?attachment_id=21543" class="wp-image-21543"/></figure></li><li class="blocks-gallery-item"><figure><img loading="lazy" width="396" height="514" src="https://www.richardmaddison.com/wp-content/uploads/2019/01/20190113_230334-1.gif" alt="" data-id="21544" data-link="https://www.richardmaddison.com/?attachment_id=21544" class="wp-image-21544"/></figure></li><li class="blocks-gallery-item"><figure><img loading="lazy" width="396" height="514" src="https://www.richardmaddison.com/wp-content/uploads/2019/01/20190113_230839.gif" alt="" data-id="21545" data-link="https://www.richardmaddison.com/?attachment_id=21545" class="wp-image-21545"/></figure></li></ul>



<p style="text-align:center"><em>Fig: These GIFs stride back and forth over all 8 dimensions of the linear manifold of various digits while holding the other dimensions constant. They are from a Capsule Net built on-sheet in Excel which learns the linear manifold of the 10 MNIST handwritten digits and then uses these for categorisation of new Digits.</em></p>



<p>Since a turning point in 2012, neural networks, have become dominant in the field of machine learning and Artificial intelligence (AI). They are so named because they loosely model the structure of neurons in the brain. Nowadays, they pretty much form the default approach for computer vision, translation, speech recognition etc. Convolutional Neural Networks (ConvNets), are a sub-class of <g class="gr_ gr_9 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar only-ins doubleReplace replaceWithoutSep" id="9" data-gr-id="9">neural</g> networks and our most powerful tool for image analysis but only now, after 20 years of incremental improvements and optimization. Capsule Networks (CapsNets) are new, different and significant. In their first incarnation in 2017, they hit or beat state of the art performance benchmarks in several areas. I, for one, think they represent the next step in humanity’s march toward Artificial General Intelligence (AGI). </p>



<p>CapsNets appear to behave more like the brain
than ConvNets. Their inventor Geoff Hinton talks about these characteristics in
this presentation <a href="https://youtu.be/rTawFwUvnLE">https://youtu.be/rTawFwUvnLE</a> , given shortly after he released the paper in late 2017.&nbsp; Mimicking the human brain is a promising
route to our understanding and developing a theory of intelligence. An analogy
is the development of a theory of aerodynamics by initially studying birds. The
current fruits of this initial approach are aircraft that can circle the earth
in 6 hours or carry 500 passengers across the Atlantic. With a similar
trajectory, we can only wonder what a theory of intelligence will yield.</p>



<p>Over the last two years I’ve been trying to
master and implement Neural Nets in my work. To help me get up to speed, I’ve
been building in Microsoft Excel. This is slow but gives me a different and
intuitive way to see how they work, and given that some of these neural
networks were considered almost magical and certainly state-of-the-art quite
recently, to build in Excel is quite demystifying. I’ve built several
relatively large neural Networks and posted quite a few online. This batch run ConvNet
with Adam optimization hits recent benchmarks for recognizing human
handwriting&nbsp; <a href="https://youtu.be/OP7wi2MoSeM">https://youtu.be/OP7wi2MoSeM</a> and should give you a flavour.</p>



<p>CapsNets are exciting and look to me like a massive development in AI that brings us closer to understanding how the human brain works or at least some of the key maths that play a part in human learning and intelligence. I say this because they learn, fail and succeed in far more human ways than do ConvNets.</p>



<p></p>



<ul><li>Visual twists, shifts, squeezes and expansions of objects (affine transforms) put a CapsNet off the scent far less than a ConvNet. Think of those verification “captchas” that websites present you with to prove you’re human. We are far better at recognizing distorted digit captures than ConvNets. For a ConvNet to do the same it would have had to be trained on some similar distortion before whereas a CapsNet can extrapolate along those transformations more easily. This also makes them far better at recognizing 3D objects from different viewpoints than ConvNets.</li><li>Humans learn patterns that represent canonical objects with very few examples or rather instructions.&nbsp; You don’t need to subject a child to 60,000 examples of handwritten digits before they know the numbers 0 to 9. CapsNets have been trained with as few as 25 interventions. We still need to show them a load of handwritten characters but not actually tell them what these are. The advantages they offer for unsupervised learning are already phenomenal.</li><li>CapsNets effectively take bets on what they are seeing and seek information to confirm this. This is what the core routing by agreement algorithm does. As proof builds up they instantaneously prune densely connected layers to sparsely connected layers linking lower level features to specific higher-level features. This is analogous to our making an assumption on what we see, imposing a reference frame and seeking information to fill in the rest. This only becomes apparent when we get it wrong as we generally do with trick images like this shadow face: <a href="https://www.youtube.com/watch?v=sKa0eaKsdA0">https://www.youtube.com/watch?v=sKa0eaKsdA0</a></li><li>CapsNets suffer from the human visual problem known as crowding. This is where too many examples of the same object occur closely together and simply confuse our mind, an example being &#8211; how hard it is to count the separate lines here IIIIIIII- I can’t do this as easily as reading the word seven ????.</li></ul>



<p></p>



<p>CapsNets are effectively a vectorized version of ConvNets. In ConvNets, each neuron in a layer gives the probability of the presence of a feature defined by its kernel. CapsNets do more, they convey not only the presence but the “pose” of the feature. By <g class="gr_ gr_20 gr-alert gr_gramm gr_inline_cards gr_disable_anim_appear Punctuation only-ins replaceWithoutSep" id="20" data-gr-id="20">pose</g> we could mean the scale, skew, rotation, viewpoint etc. The routing by agreement algorithm in a CapsNet assesses the match between the ”pose” of a lower level features and the features in the level above it, say bits of digits to a whole digit or components of a face to a whole face. When these agree e.g. the eyes, mouth and nose elements all correspond to a face of size X looking left, we get an indication that the higher-level feature is present. When the pose of many lower level features <g class="gr_ gr_18 gr-alert gr_gramm gr_inline_cards gr_disable_anim_appear Grammar multiReplace" id="18" data-gr-id="18">match</g> a single higher-level feature we can be very certain that a higher level feature exists. If this is hard to digest have a look at the Hinton or Géron videos I reference later. These may help.</p>



<p>This blog post is about building a full capsule
net in Excel for handwritten digit classification using MNIST data. MNIST is a
data set of hand written digits provided by Yann LeCun and a staple in data
science. Given the novelty of this new algorithm there is not so much
information available on the net but the best I came across was Geoff Hinton’s
original paper https://arxiv.org/pdf/1710.09829v2.pdf and Aurélien Géron’s Keras code and associated videos <a href="https://youtu.be/pPN8d0E3900">https://youtu.be/pPN8d0E3900</a> . I also found this talk by Dr Charles Martin helpful <a href="https://youtu.be/YqazfBLLV4U">https://youtu.be/YqazfBLLV4U</a> . </p>



<p>&#8212;&#8212;- The remainder of this blog post <g class="gr_ gr_3 gr-alert gr_gramm gr_inline_cards gr_run_anim Punctuation only-del replaceWithoutSep" id="3" data-gr-id="3">improves,</g> but is probably only of interest to nerds and insomniacs. &#8212;&#8212;- </p>



<p>CapsNets are exciting and potentially far more powerful than standard convolutional neural nets because: </p>



<p></p>



<ul><li>They don’t lose information via subsampling or max pooling which is the ConvNet way to introduce some invariance, CapsNets weights encode viewpoint-invariant knowledge and are equivariant.</li><li>Through the above approach, they know the “pose” of parts and the whole which allows them to extrapolate their understanding of geometric relationships to radically new viewpoints (equivariance),</li><li>They have built-in knowledge of the relationship of parts to a whole.</li><li>They contain the notion of entities and those GIFs on the top of the blog represent movement along one dimension of the linear manifold of the top level entity that represents an eight.</li></ul>



<p></p>



<p>The Hinton &amp; Géron resources were superb for the forward model i.e. the algorithm that identifies categories based on trained parameters which <g class="gr_ gr_21 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar multiReplace" id="21" data-gr-id="21">is</g> the true innovation of the CapsNet. However, information on the backward model, the mechanism by which it learns, was sparse and scattered. This is not surprising because back-propagation of the gradient of the loss function with respect to the parameters of each layer is mathematically so straightforward that the deep learning frameworks of TensorFlow and Keras do this automatically. However, wiring the chain rule backwards through the twists and turns of an <g class="gr_ gr_10 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" id="10" data-gr-id="10">Excel based</g> CapsNet architecture was a challenge. This was largely because instead of reading through the theory first, I guessed, fiddled and played, until in exasperation, I looked harder for the “right” approach. I certainly learned a lot about how not to do it and it’s possible that I turned up some novel ideas, but above all, when I eventually “got it right’, the theory sunk in and meant a lot more to me.</p>



<p><strong>How a capsule net works</strong></p>



<p>I’ve uploaded a video walkthrough of the Excel model here: <a href="https://youtu.be/4uiFJZjw6fU">https://youtu.be/4uiFJZjw6fU</a> It&#8217;s probably not for the casual reader but is a more visual way to see what&#8217;s happening and also covers a lot of the issues I&#8217;ve written about in this blog.</p>



<p>A big difference between CapsNets and standard neural networks is that CapsNets contain the notion of entities with pose parameters i.e. the network identifies component parts (lower level capsules) and determines if their pose parameters match those of the higher-level capsules where these parts are combined. Capsules require multiple dimensions to convey their pose and the diagram below shows where the additional dimensions appear:</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" width="1024" height="731" src="https://www.richardmaddison.com/wp-content/uploads/2019/01/Schematic-shwoing-8D-Capsule-Transition-1-1024x731.jpg" alt="" class="wp-image-21525" srcset="https://www.richardmaddison.com/wp-content/uploads/2019/01/Schematic-shwoing-8D-Capsule-Transition-1-1024x731.jpg 1024w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Schematic-shwoing-8D-Capsule-Transition-1-300x214.jpg 300w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Schematic-shwoing-8D-Capsule-Transition-1-768x548.jpg 768w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Schematic-shwoing-8D-Capsule-Transition-1-700x500.jpg 700w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Schematic-shwoing-8D-Capsule-Transition-1-1100x785.jpg 1100w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Schematic-shwoing-8D-Capsule-Transition-1.jpg 1471w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption>Fig: This schematic shows the transition from scalar values by neuron at layer 2 to 8-dimensional vectors that represent the capsules in Layer 3.</figcaption></figure></div>



<p><em>F</em>The CapsNet I built is like the structure in
Hinton’s paper but quite a bit smaller with 5&#215;5 kernels for 4 &amp; 8 channels
in the first two layers and 8-dimensional Digit Capsules. This gives only 100 x
8D primary capsules and 10 x 8D digit capsules. The results for this are still
impressive i.e. 98.7% accuracy rather than the 99.5% accuracy that we see with
1152 x 8D primary and 10 x 16D digit capsules in the paper. I chose this
reduction after paring down the Keras model to a size that would be manageable
in Excel without too much Excel build optimization. The structure of my forward
CapsNet, or rather a screenshot of the actual CapsNet as it appears in my Excel
spreadsheet, is below.<br></p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" width="1024" height="574" src="https://www.richardmaddison.com/wp-content/uploads/2019/01/Excel-View-of-Forward-Net-1-1024x574.jpg" alt="" class="wp-image-21536" srcset="https://www.richardmaddison.com/wp-content/uploads/2019/01/Excel-View-of-Forward-Net-1-1024x574.jpg 1024w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Excel-View-of-Forward-Net-1-300x168.jpg 300w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Excel-View-of-Forward-Net-1-768x430.jpg 768w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Excel-View-of-Forward-Net-1-700x392.jpg 700w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Excel-View-of-Forward-Net-1-1100x617.jpg 1100w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Excel-View-of-Forward-Net-1.jpg 1504w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption>Fig: View of the forward CapsNet built on-sheet in Excel</figcaption></figure></div>



<p>The backpropagation or learning mechanism is much bigger. In the figure below, I’ve put together several screen-shots of the entire spreadsheet model. This covers 1000 rows and 7500 columns. The bulk of the area relates to the decoder sections with their 784 neurons in the final layer and the Adam optimization I used to get the learning speed up. I’ve highlighted the big blue collection of layer 3 transform matrices on this to give you an indication of the size w.r.t. the above forward model, additional calculations, and complexity required for the backward pass.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" src="https://www.richardmaddison.com/wp-content/uploads/2019/01/Excel-View-of-Full-Net-1024x182.jpg" alt="" class="wp-image-21527" width="580" height="103" srcset="https://www.richardmaddison.com/wp-content/uploads/2019/01/Excel-View-of-Full-Net-1024x182.jpg 1024w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Excel-View-of-Full-Net-300x53.jpg 300w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Excel-View-of-Full-Net-768x137.jpg 768w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Excel-View-of-Full-Net-700x125.jpg 700w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Excel-View-of-Full-Net-1100x196.jpg 1100w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Excel-View-of-Full-Net.jpg 1506w" sizes="(max-width: 580px) 100vw, 580px" /><figcaption>Fig: Excel Screenshot of the full Capsule Net spreadsheet with the lengthy decoder network and multiple parameter optimization blocks.</figcaption></figure></div>



<h2>The Process of Building.</h2>



<p>Now I know what I’m doing I could probably mechanically build this in a couple of days or modify the size of a layer in an hour or so. However, if I built it in Keras it would take a couple of hours and <g class="gr_ gr_8 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar replaceWithoutSep" id="8" data-gr-id="8">to</g> modify a layer would take seconds. Now I understand what I’m building but the initial build took several months and even understanding Aurélien Géron’s Keras code took days.&nbsp; </p>



<p>My initial approach was to build only the forward model and feed this with pre-trained parameters from a modified version of Aurélien Géron’s code. My reduced spec (L1: 5x5K, 4c, L2: 5&#215;5, 8c s2 L3: 8Dx10caps, Decoder 50, 50, 784) took the parameter count down from the 44,489,744 of the original paper that Aurélien had replicated to a more Excel manageable 127,943. Keras trials on this gave an MNIST test result of 98.69%, higher that I could regularly obtain with my Excel ConvNets but way below the 99.43% that the bigger model achieves. </p>



<p>Another important modification I made to get a clear comparison was to initially train and test the reduced size Keras model only on the 10k MNIST data set. This reached a 100% overfit after about 34 epochs. What I mean by this is that the model was able to learn the 10K data set to 100% accuracy i.e. the model could store sufficient information in its parameters to categorize all 10k MNIST digits correctly. This is useless as a generalizable model but gave me an easy test to see if the Keras trained parameters, when transferred to Excel would deliver the same result on the same test set. </p>



<p>I was on several steep learning curves throughout this process and was delighted when I eventually got a perfect match. However, as I added the backward model and learned from these already overfit parameters, the model’s precision collapsed to 30% or so and only then began to learn. I saw more failure modes than I can recall and given the slow speed at which the excel model learned had plenty of time to hypothesize the causes.  </p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" width="1024" height="731" src="https://www.richardmaddison.com/wp-content/uploads/2019/01/Examples-of-Learning-Curves-1024x731.jpg" alt="" class="wp-image-21528" srcset="https://www.richardmaddison.com/wp-content/uploads/2019/01/Examples-of-Learning-Curves-1024x731.jpg 1024w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Examples-of-Learning-Curves-300x214.jpg 300w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Examples-of-Learning-Curves-768x548.jpg 768w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Examples-of-Learning-Curves-700x500.jpg 700w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Examples-of-Learning-Curves-1100x785.jpg 1100w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Examples-of-Learning-Curves.jpg 1471w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption>Fig: A small selection of the model’s precision curves as the bugs dropped out.</figcaption></figure></div>



<p>I began this process in August of 2018 and eventually confirmed that I had a working CapsNet in Excel on 31-December 2018. The closing stages of using this odd approach of matching to a 100% overfit are summarized below.</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" width="1024" height="731" src="https://www.richardmaddison.com/wp-content/uploads/2019/01/Test-By-Learn-from-Overfit-1024x731.jpg" alt="" class="wp-image-21529" srcset="https://www.richardmaddison.com/wp-content/uploads/2019/01/Test-By-Learn-from-Overfit-1024x731.jpg 1024w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Test-By-Learn-from-Overfit-300x214.jpg 300w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Test-By-Learn-from-Overfit-768x548.jpg 768w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Test-By-Learn-from-Overfit-700x500.jpg 700w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Test-By-Learn-from-Overfit-1100x785.jpg 1100w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Test-By-Learn-from-Overfit.jpg 1471w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption>Fig: The figure above was sufficient proof to ensure that my Excel backward pass wiring was functionally the same as Aurélien Géron’s Keras code.</figcaption></figure></div>



<p>Once I had confirmation that it was eventually working, adding the full 60k digit training data, loading 50 epoch Keras trained parameters and running a 10k test in Excel that matched the 98.69% Keras Test result took no time. I headed out happy that night to celebrate the new year.  </p>



<h2>Interesting Learning</h2>



<p><strong>Triple Axel</strong></p>



<p>One challenge that I faced was wiring the back
propagation of the convolutions. Though I now know this to be straightforward,
I went through the process without really thinking through the maths or
approach and miraculously ended up with the right answer. On further research I
found that this obscure transpose and flip of the kernel over its anti-diagonal
is apparently called a Triple Axel named after a figure skater from the 19<sup>th</sup>
century. This is according to User1551 on math.stackexchange, though I can’t
find any other evidence, I love the idea and am happy to propagate the meme. </p>



<p>In TensorFlow and as I understand it, <g class="gr_ gr_14 gr-alert gr_gramm gr_inline_cards gr_run_anim Punctuation only-ins replaceWithoutSep" id="14" data-gr-id="14">basically</g> all other code, the same approach is handled with a sparse weight-sharing matrix such that the reverse path through the matrix multiplication can simply be accomplished with a transpose of this matrix to get the same connectivity. </p>



<p>In Excel, the Triple Axel transformation of the
kernel is much easier to code, use and audit, so makes for a nice approach.</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" width="1024" height="618" src="https://www.richardmaddison.com/wp-content/uploads/2019/01/Triple-Axel-B-1024x618.png" alt="" class="wp-image-21530" srcset="https://www.richardmaddison.com/wp-content/uploads/2019/01/Triple-Axel-B.png 1024w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Triple-Axel-B-300x181.png 300w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Triple-Axel-B-768x464.png 768w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Triple-Axel-B-700x422.png 700w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption>Fig: The above figure shows the different approach to convolutional matrix multiplication on the forward and backward path for Excel v Python.</figcaption></figure></div>



<p><strong>Backpropagation through a stride</strong></p>



<p>I tried many attempts at getting this to work
before I began looking for a proper explanation. The best I came across was “A
guide to convolution arithmetic for deep learning” by Vincent Dumoulin and
Francesco Visin from Institut des algorithmes d’apprentissage de Montréa. I
would check their paper out if you’re confused.</p>



<p>Ultimately the wiring in Excel for this is very
straightforward and simply requires interspersing zeros in the post-stride
channel to bring the size of the channel up to the pre-stride size as shown
below. The re-shaping from the 8D capsule gradients was also straightforward
and the figure below shows how I unrolled these capsules into the convolutional
channels. Again, I tried all sorts of approaches to this simple unfurling of
the channels and arrived at the correct one by chance.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" src="https://www.richardmaddison.com/wp-content/uploads/2019/01/Backprop-through-a-stride-B.png" alt="" class="wp-image-21531" width="392" height="257" srcset="https://www.richardmaddison.com/wp-content/uploads/2019/01/Backprop-through-a-stride-B.png 781w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Backprop-through-a-stride-B-300x197.png 300w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Backprop-through-a-stride-B-768x504.png 768w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Backprop-through-a-stride-B-700x460.png 700w" sizes="(max-width: 392px) 100vw, 392px" /><figcaption>Fig: The figure above shows the path of backpropagation through a stride and from the 8D capsules to one of the 8 channels of my convolutional layer 2.</figcaption></figure></div>



<p><strong>Backprop through the affine transforms</strong></p>



<p>I spent some time working through ways to build the derivative of the layer 2 output function <a>dZ<sup>[2]</sup> </a>This comprises a sum of the matrix products of each transformation matrix by the dZ<sup>[3]</sup> and routed via the derivative of the Layer 2 activation function i.e. only pass the gradient back if the <g class="gr_ gr_4 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" id="4" data-gr-id="4">U</g><sub><g class="gr_ gr_4 gr-alert gr_spell gr_inline_cards gr_disable_anim_appear ContextualSpelling ins-del multiReplace" id="4" data-gr-id="4">i</g></sub> is greater than zero. </p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" width="1024" height="257" src="https://www.richardmaddison.com/wp-content/uploads/2019/01/dZ-2-creation-1024x257.png" alt="" class="wp-image-21533" srcset="https://www.richardmaddison.com/wp-content/uploads/2019/01/dZ-2-creation-1024x257.png 1024w, https://www.richardmaddison.com/wp-content/uploads/2019/01/dZ-2-creation-300x75.png 300w, https://www.richardmaddison.com/wp-content/uploads/2019/01/dZ-2-creation-768x193.png 768w, https://www.richardmaddison.com/wp-content/uploads/2019/01/dZ-2-creation-700x175.png 700w, https://www.richardmaddison.com/wp-content/uploads/2019/01/dZ-2-creation-1100x276.png 1100w, https://www.richardmaddison.com/wp-content/uploads/2019/01/dZ-2-creation.png 1536w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption>Fig: This is a view of the wiring for the derivative of the gradient of the layer 2 output function for the second primary capsule of 100.</figcaption></figure></div>



<p><strong>Margin loss &amp; “Brim Loss?”</strong></p>



<p>Generating a dZ<sup>[3] </sup>with the right
dimension (8D) to pass back through the affine transforms also caused me some
issues. I tried various approaches but the one that mimicked TensorFlow, and one
I therefore assumed to be correct, was simply to multiply the derivative of the
loss function by the final digit capsule vectors i.e. after the vector
nonlinearity or squash function. </p>



<p>I used the margin loss quoted in the paper but made some silly mistakes in calculating its derivative that negated the use of the max function and instead of ignoring gradients for activations greater than 90% and less than 10%, actually penalized high certainty above and below the thresholds.&nbsp; This effectively optimized for uncertainty or specifically a 90% certainty of true and a 10% certainty of false. An interesting result of this was that the model trained up to the 100% overfit benchmark I was using faster. This approach also potentially introduces additional regularization at little cost in time and code.&nbsp;   </p>



<p>Because Excel is so slow I’ve stuck with this approach and until I find the correct name for it am calling it a “Brim” loss because the resulting loss curve looks like the brim of a hat. I explain this further in the figure below. </p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" width="1024" height="731" src="https://www.richardmaddison.com/wp-content/uploads/2019/01/Brim-Loss-1024x731.jpg" alt="" class="wp-image-21552" srcset="https://www.richardmaddison.com/wp-content/uploads/2019/01/Brim-Loss-1024x731.jpg 1024w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Brim-Loss-300x214.jpg 300w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Brim-Loss-768x548.jpg 768w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Brim-Loss-700x500.jpg 700w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Brim-Loss-1100x785.jpg 1100w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Brim-Loss.jpg 1471w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption>Fig: The figure above shows the Margin loss from Geoff Hinton’s paper alongside my made-up “Brim” loss that optimized for uncertainty </figcaption></figure></div>



<p>I ran 20 learning trials over 10 epochs for both the Margin Loss and the Brim loss, each with differing seeds for initialization. Multiple trials are the only way to get a rough measure of the advantage that the Brim loss may offer over the margin loss in this case. The trials below show the learning curves (as precision rather than loss) and the improvement is quite substantial. These were, of <g class="gr_ gr_9 gr-alert gr_gramm gr_inline_cards gr_run_anim Punctuation only-ins replaceWithoutSep" id="9" data-gr-id="9">course</g> run in Python as the process would have taken weeks in Excel.</p>



<figure class="wp-block-image"><img loading="lazy" width="1024" height="556" src="https://www.richardmaddison.com/wp-content/uploads/2019/01/Brim-and-Margin-Trails-1-1024x556.png" alt="" class="wp-image-21578" srcset="https://www.richardmaddison.com/wp-content/uploads/2019/01/Brim-and-Margin-Trails-1-1024x556.png 1024w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Brim-and-Margin-Trails-1-300x163.png 300w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Brim-and-Margin-Trails-1-768x417.png 768w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Brim-and-Margin-Trails-1-700x380.png 700w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Brim-and-Margin-Trails-1-1100x597.png 1100w, https://www.richardmaddison.com/wp-content/uploads/2019/01/Brim-and-Margin-Trails-1.png 1152w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption>20 trials of Brim v Margin loss learning with the mean precision by epoch in heavy solid, 1 Standard Deviation lines are dotted and individual 10 epoch trial in thin solid. I added a copy on the Brim mean to the Margin chart for comparison.</figcaption></figure>



<p><strong>The Next Steps </strong></p>



<p>If I carry this further in Excel I think the
next step will be to introduce an innate graphics model along the lines of “Extracting
pose information by using a domain specific decoder” Navdeep Jaitly &amp; Tijmen Tieleman. This will allow the model to run unsupervised learning to go from
pixels to entities with poses and opens the ability to train on MNIST to with
only a handful of supervised inputs. </p>



<p>I’m also keen to explore Matrix capsules, EM
routing, running on the SmallNORB data set and of course optimizing Excel to
run more quickly, perhaps making use of the iterative functions in Excel.

I’ll update this blog as I make progress but
would welcome any encouragement or tips and corrections.



</p>
<p>The post <a rel="nofollow" href="https://www.richardmaddison.com/2019/01/13/building-a-capsule-net-in-excel/">Building a Capsule Net in Excel</a> appeared first on <a rel="nofollow" href="https://www.richardmaddison.com">NN in XL</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.richardmaddison.com/2019/01/13/building-a-capsule-net-in-excel/feed/</wfw:commentRss>
			<slash:comments>4</slash:comments>
		
		
			</item>
	</channel>
</rss>
