Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient calculation in paper #27

Open
vb123er951 opened this issue May 15, 2020 · 11 comments
Open

Gradient calculation in paper #27

vb123er951 opened this issue May 15, 2020 · 11 comments

Comments

@vb123er951
Copy link

Hi,
I am interested in CSPNet recently, and reading the paper: https://arxiv.org/pdf/1911.11929.pdf.
But I have a question about the gradient calculation in page 4, in the paper the gradient calculate as

w1' = f(w1, g0)
w2' = f(w2, g0, g1)
...
wk' = f(wk, g0, g1, g2, ..., gk-1)

Don't this part is calculated as this?

w1' = f(w1, g0, g1, g2, ..., gk)
w2' = f(w2, g1, g2, ..., gk)
...
wk' = f(wk, gk)

also I want to confirm that if the definition of gi is the partial differential of error to weight? that is,

I was very confuse about this part, hope that you can help me.

@WongKinYiu
Copy link
Owner

image

@baopmessi
Copy link

baopmessi commented Jan 12, 2021

I was still confuse about this part (what g_i). It mean :
imageimage
So why " We can find that large amount of gradient information are reused for updating weights of different dense layers. This will result in different dense layers repeatedly learn copied gradient information."
Can you help me. Thank you

@Pcyslist
Copy link

if image
image
so how does the gradient of weight of layer_0 express ? what's the mean of
image?

@Pcyslist
Copy link

if you definate
image
then you would have to definate the g_0 of k+1_th layer as following
image
image
so the red rectangle of g_0s wil not be the same things , Your explanation of g0 is contrary to the repeated g0 in your paper.

@WongKinYiu
Copy link
Owner

WongKinYiu commented Oct 17, 2021

Please notice that $g_{i}$ represents the gradient propagated to the $i^{th} layer, not the gradient generated from the $i^{th}$ layer.

The equation with red rectangle contains only one timestamp of full weights updating, so it shows the case of out-degrees of $k$-th layer. The full weight updating process will accumulate the whole timestamps of gradient.

It is too complicate to show timestamps $t_{j}$s in a equation. If you want to add gradient information of ${k+1}$-th layer in this equation, it means to add gradient generated at ${t-1}$-th timestamp. For this case, the $g$ have to add timestamp annotation too. For example $g^{t}_{0}$ and $g^{t-1}_{0}$. To understand more details about timestamp and partition of gradient, you may want to see Figure 5 and 6 of PRN paper. Edit: for general case, you have to note from_where, to_where, and timestamp of all gradients.

@Pcyslist
Copy link

thanks for your reply. @WongKinYiu
As you said, the gradient of the k+1_th layer is generated in the k-1_th timestamp, which is the general law of back propagation updating weight algorithm . so in equation 6, when calculating the gradient information generated by the k_th layer , we need to use only the generated gradients of (k+1~k+n) layers and the weights information of layers (1 ~ k-1), but not the gradient information of layer (1 ~ k-1) , because the gradients information of layers (1 ~ k-1) gi has not been generated (). So why do you update Wk in the formula with gi (1 < = i < = k-1) that has not been generated yet? Maybe I mean you should replace gi with wi. your explaination of gi is that gradient propagated to i_th layer , but what does the update of Wk have to do with gi ? maybe Wi will be OK?

@WongKinYiu
Copy link
Owner

WongKinYiu commented Oct 17, 2021

Please notice that $g_{i}$ represents the gradient propagated to the $i^{th} layer, not the gradient generated from the $i^{th}$ layer. It means in the equation, the $g_{0}$, $g_{1}, ... are all generated by $k$-th layer at timestamp $t$, and then propagate to $0$-th, $1$-th, ... layers. In your description, the $g_{i}$ still means gradient generated by $i$-th layer, which is not same as our definition.

At a specific timestamp $t$, the gradient will propagate to all of layers which have shortcut layer connect to the current layer. Since the DenseNet has shortcut layers which connect to all previous layers, the gradient used to update $k$-th layer will also propagate to all of $0$-th, $1$-th, ... ${k-1}$-th layers. And due to the architecture has concatenations, it leads the equation become (1) and (2). From (1), you could see inputs of $k$-th layer is concatenation of outputs of all previous layers. It obviously the gradient for updating weights of $k$-th layer will propagate to all of previous layers according to their channel dimension.

Just take a glance on the figure you could know how $g_{0}$, $g_{1}$, ... are used to update weights of which layers.
image

@Pcyslist
Copy link

@WongKinYiu Thanks very much for your patient reply. Good luck to you.

1 similar comment
@Pcyslist
Copy link

@WongKinYiu Thanks very much for your patient reply. Good luck to you.

@NeoZng
Copy link

NeoZng commented Feb 6, 2022

@WongKinYiu dear author ,i still dont understand why we should use g_{0} to update w1.In your discription, g_{0} equals to
image
,but if we are going to update w1,we should use
image
in order to calculate
image
using the train rule.
and that is why i think there is no connection between
image
and
image
only when varieble x_k change can they affect the weights in previous layers,while the weights in later layer do nothing to previous ones.

And another question:
image
what do you mean by truncated?

@JianjianSha
Copy link

After reading the auther's interpretation above, why I think the gradient
image
should be propagated to i-th layer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants