Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format problem in the Google Drive Training Data #1

Open
PosoSAgapo opened this issue Nov 8, 2020 · 15 comments
Open

Format problem in the Google Drive Training Data #1

PosoSAgapo opened this issue Nov 8, 2020 · 15 comments

Comments

@PosoSAgapo
Copy link

PosoSAgapo commented Nov 8, 2020

The paper's work on the temporal commensense is great, however, I have some problems with regard to the formatted tranning data that you provide in the Google Drive, here is the example:

[CLS] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [SEP] [MASK] [MASK] liquid eye ##liner and leopard - print under [MASK] ##ments , her [MASK] [MASK] steel ##y [MASK] [MASK] thin [MASK] like [MASK] [MASK] [MASK] result of [MASK] 20 ##- [MASK] ##und [MASK] [MASK] [MASK] she [unused500] curls [MASK] [SEP] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [SEP] [unused500] [unused502] [unused45] [SEP] -1 43 44 45 46 47 48 49 50 51 120 121 122 7.670003601399247e-05 0.010977975780067789 0.17749198395637333 0.3423587734906385 0.26762063340149095 0.1613272650199883 0.03558053856215351 0.004304815288057253 0.00026131446521643803 0.0 0.0 0.0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 4218 2014 6381 -1 -1 -1 -1 -1 -1 -1 6843 8163 1010 -1 9128 2003 -1 -1 1010 2014 -1 2608 -1 14607 1010 1996 -1 -1 1996 -1 -1 6873 -1 3347 17327 2015 -1 -1 -1 1012 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

1.I guess you masked the event in the sentence that I quoted here, but does the TacoLM could really predict so many masked words correctly? I don't think any human being could guess the [MASK] words in this sentence with so little information given.

2.The 'unuse' token that you described in your paper is used to construct 1-to-1 mapping to the new dictionary. But how could I know what the 'unuse' token really means?

  1. Why there is a always a number attached to the end of the sentence, like the '-1' attached to the '[SEP]' token ? In other exapmles. the number could be 79,43 and so on, what does this number actually mean?

4.After the '-1', there are still several numbers following, which based on the space between them, I don't think they are in the same group with the '-1' that I mentioned in Q3, what does these numbers mean?

5.What does the float number mean after these numbers?

6.The final numbers -1 -1 -1 ..... , I guess these are attention tokens? But it does not correspond to the HuggingFace's attention encoding, which is 1 for attention and 0 for no attention.

7.How does the whole tranning data form the (event, value, dimension) tuple in this case?

@Slash0BZ
Copy link
Member

Slash0BZ commented Nov 8, 2020

Hi, thanks for your interest in the work

  1. That's right, the model cannot recover all the masks, nor the human. The goal here is to give the model a strong prior to how words are associated with temporal information without the effect of surrounding contexts. Please refer to the paper for more intuition.

  2. We will include something to clarify it, sorry about the confusion. You can search these tokens in this file for how they are mapped. In short, 500 is the separator, 501 is duration, 502 is frequency, 503 is typical time, 504 is ordering, 505 is the boundary.

3-6. We will clarify this as well. You can use the file in the above bullet point to see how the sequence is generated. The -1 right after the sequence is the target for a single token to predict soft loss; the instance you showed above has -1 because that instance is not for the model to learn any values directly, but merely an MLM loss. The next sequence of numbers represents the vocabulary labels in that dimension, and the sequence following that is the target value of these vocabs. The last sequence of numbers is the MLM targets of the input sequence.

  1. The training data is a mix of different dimensions, and they can either be predicting a value in each dimension or recovering tokens based on a dimension and a given value (which is what the instance you showed above is doing). There are other instances in the file where you see fewer "[MASKS]", of which the purpose is to predict the values given the events/contexts.

Please let me know how I can further assist.

@PosoSAgapo
Copy link
Author

Sorry for my late reply, sincerely thank your detailed explaination! ;)

@PosoSAgapo
Copy link
Author

PosoSAgapo commented Aug 21, 2021

Hi, for your paper, I still have a question. The pre-calculated soft_target for duration/frequency/boundary is in pattern_extraction.py. This value is calculated as illustrated in your paper in section 3.4. However, I cannot use softmax to reproduce your pre-calculated soft_target following your paper illustration

log_sec=[torch.log(torch.tensor(1)), torch.log(torch.tensor(60)), torch.log(torch.tensor(3600)), torch.log(torch.tensor(86400)), torch.log(torch.tensor(604800)), torch.log(torch.tensor(2592000)), torch.log(torch.tensor(31104000)), torch.log(torch.tensor(311040000)), torch.log(torch.tensor(3110400000))]
sigma = 4
target = torch.log(torch.tensor(60))
value = []
for n in log_sec:
      value.append(torch.exp(-(n -target) ** 2 / (2 * sigma ** 2))/(sigma*torch.sqrt(torch.tensor(2*math.pi))))
value = torch.tensor(value)
print(value.softmax(dim=0))

This code cannot reproduce your calculated minutes soft_target in pattern_extraction.py which is

 [0.1966797321503831, 0.5851870686462161, 0.1966797321503831, 0.018764436502991182, 0.0023274596334087274, 0.0003541235975250493, 7.345990219746334e-06, 1.0063714849200163e-07, 6.917246071508814e-10]

So how could I get the same result as your code? It is important for me since I am also trying to get a temporal distribution from my data and train with your data, but I am not able to reproduce the softmax result in your pattern_extraction.py

@Slash0BZ
Copy link
Member

What are the output values in your version? I will try to find how I computed it exactly. Nonetheless, these are hard-coded distributions and they should not "cheat" anything. Small variations should not have much impact.

@PosoSAgapo
Copy link
Author

What are the output values in your version? I will try to find how I computed it exactly. Nonetheless, these are hard-coded distributions and they should not "cheat" anything. Small variations should not have much impact.

tensor([0.1146, 0.1194, 0.1146, 0.1101, 0.1088, 0.1083, 0.1081, 0.1080, 0.1080])
The above is the calculated value, in which a large gap exists between your result and my result. The core of my problem is that I want to calculate other time-related distributions and try to make my calculation follow your distribution, but the result does not seem close to your result.

@PosoSAgapo
Copy link
Author

The time dimension that is far from the target will have a small probability value, which will cause the corresponding softmax to be 1. Additionally, the target label has a probability less than one so the softmax will make it less than e, and thus causing this problem. I try to multiply the output with a large constant and do the softmax, but the result is still far away from your code.

@Slash0BZ
Copy link
Member

It seems that the standard deviations are different, where mine is much smaller so the peak value has a larger probability (0.585), and converges to 0 much faster (a 2-distance unit has prob 0.02). Without going into the calculations, I just want to make sure that you know these distributions are rather "magic numbers" which have the sole functionality (as labels) to encourage the predicted distribution to be closer to a "thin" gaussian. So may I ask why you cannot use the exact same distribution?

I don't quite understand what you mean by "corresponding softmax to be 1". Could you elaborate?

@PosoSAgapo
Copy link
Author

It seems that the standard deviations are different, where mine is much smaller so the peak value has a larger probability (0.585), and converges to 0 much faster (a 2-distance unit has prob 0.02). Without going into the calculations, I just want to make sure that you know these distributions are rather "magic numbers" which have the sole functionality (as labels) to encourage the predicted distribution to be closer to a "thin" gaussian. So may I ask why you cannot use the exact same distribution?

I don't quite understand what you mean by "corresponding softmax to be 1". Could you elaborate?

I do know that the target is to make the prediction to be closer to the gaussian. However, my goal is to change the thin gaussian into another distribution which I believe to be more accurate in modeling the time distribution, but your probability is useful which I consider it as a prior distribution, and my calculated distribution will be posterior, therefore I have to calculate a distribution on my own, which I try to implement based on your paper illustration, but I find the output result is not the same.

@PosoSAgapo
Copy link
Author

It seems that the standard deviations are different, where mine is much smaller so the peak value has a larger probability (0.585), and converges to 0 much faster (a 2-distance unit has prob 0.02). Without going into the calculations, I just want to make sure that you know these distributions are rather "magic numbers" which have the sole functionality (as labels) to encourage the predicted distribution to be closer to a "thin" gaussian. So may I ask why you cannot use the exact same distribution?

I don't quite understand what you mean by "corresponding softmax to be 1". Could you elaborate?

To elaborate, the probability assigned to the label that is far away from the target will have a very small probability, which is close to 0. In the softmax, the equation is p(x_n)=(e^x_n)/(e^x_1+e^x_2......). The probability from the gaussian is all small than 1, therefore a very small probability and a large probability do not have a large difference, like e^0=1 e^1=e, in which the 0 and 1 correspond to the probability. However, in the softmax, they do not lead to a large gap since even probability 1 is just basically 2 times larger than e^0, therefore my code that follows your paper cannot converge.

@Slash0BZ
Copy link
Member

I see the issue now. I will try to find what exactly happened during that computation and get back to you. Right now I suspect that some kind of scaling was done to avoid the softmax issue.

@PosoSAgapo
Copy link
Author

I see the issue now. I will try to find what exactly happened during that computation and get back to you. Right now I suspect that some kind of scaling was done to avoid the softmax issue.

Thank you for your reply ;)

@Slash0BZ
Copy link
Member

Okay as it turned out, I used a [x / sum(X)] normalization instead of a softmax. The ratio in the final probability vectors are 1-1 proportional to those in the PDF values. It is a mistake in the writing, apologies.

@PosoSAgapo
Copy link
Author

Okay as it turned out, I used a [x / sum(X)] normalization instead of a softmax. The ratio in the final probability vectors are 1-1 proportional to those in the PDF values. It is a mistake in the writing, apologies.

Ok, but a quick modification still seems does not give the same result...... Is your log take e as base? I will also check my implementation.

log_sec=[torch.log(torch.tensor(1)), torch.log(torch.tensor(60)), torch.log(torch.tensor(3600)), torch.log(torch.tensor(86400)), torch.log(torch.tensor(604800)), torch.log(torch.tensor(2592000)), torch.log(torch.tensor(31104000)), torch.log(torch.tensor(311040000)), torch.log(torch.tensor(3110400000))]
    ...: sigma = 4
    ...: target = torch.log(torch.tensor(60))
    ...: value = []
    ...: for n in log_sec:
    ...:       value.append(torch.exp(-(n -target) ** 2 / (2 * sigma ** 2)))
    ...: value = torch.tensor(value)
    ...: value/sum(value)
Out[329]: 
tensor([2.3882e-01, 4.0326e-01, 2.3882e-01, 7.7235e-02, 2.8334e-02, 1.1466e-02,
        1.8018e-03, 2.2979e-04, 2.1041e-05])

@Slash0BZ
Copy link
Member

Yes, here is a quick re-implementation, please let me know if you spot anything out of the ordinary:

from scipy.stats import norm
convert_map = {
    "seconds": 1.0,
    "minutes": 60.0,
    "hours": 60.0 * 60.0,
    "days": 24.0 * 60.0 * 60.0,
    "weeks": 7.0 * 24.0 * 60.0 * 60.0,
    "months": 30.0 * 24.0 * 60.0 * 60.0,
    "years": 365.0 * 24.0 * 60.0 * 60.0,
    "decades": 10.0 * 365.0 * 24.0 * 60.0 * 60.0,
    "centuries": 100.0 * 365.0 * 24.0 * 60.0 * 60.0,
}
import math
means = [math.log(x[1], 2.0) for x in convert_map.items()]

for mean in means:
    all_vals = []
    for m in means:
        all_vals.append(norm.pdf(m, mean, 4.0))
    print(all_vals)
    s = 0.0
    for v in all_vals:
        s += v
    for i in range(0, len(all_vals)):
        all_vals[i] /= s
    print(all_vals)

@PosoSAgapo
Copy link
Author

Yes, here is a quick re-implementation, please let me know if you spot anything out of the ordinary:

from scipy.stats import norm
convert_map = {
    "seconds": 1.0,
    "minutes": 60.0,
    "hours": 60.0 * 60.0,
    "days": 24.0 * 60.0 * 60.0,
    "weeks": 7.0 * 24.0 * 60.0 * 60.0,
    "months": 30.0 * 24.0 * 60.0 * 60.0,
    "years": 365.0 * 24.0 * 60.0 * 60.0,
    "decades": 10.0 * 365.0 * 24.0 * 60.0 * 60.0,
    "centuries": 100.0 * 365.0 * 24.0 * 60.0 * 60.0,
}
import math
means = [math.log(x[1], 2.0) for x in convert_map.items()]

for mean in means:
    all_vals = []
    for m in means:
        all_vals.append(norm.pdf(m, mean, 4.0))
    print(all_vals)
    s = 0.0
    for v in all_vals:
        s += v
    for i in range(0, len(all_vals)):
        all_vals[i] /= s
    print(all_vals)

Ok, thank you for your implementation, I will take a brief look and try it. Anyway, sincerely thank you for your kind help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants