Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assert_packing_loss.py Invalid for deepseek-v2-lite #266

Open
bao-xiaoyi opened this issue Sep 13, 2024 · 13 comments
Open

assert_packing_loss.py Invalid for deepseek-v2-lite #266

bao-xiaoyi opened this issue Sep 13, 2024 · 13 comments

Comments

@bao-xiaoyi
Copy link

RuntimeError: CUDA error: an illegal memory access was encountered

Looking forward to the expert's answer

@khai-meetkai
Copy link
Collaborator

Hi @bao-xiaoyi, can you send me your command you ran assert_packing_loss.py ?

@bao-xiaoyi
Copy link
Author

python assert_packing_loss.py /kas/kas_workspace/open_llm/DeepSeek-Coder-V2-Lite-Instruct

@bao-xiaoyi
Copy link
Author

你好@bao-xiaoyi,你能把你运行的命令发给我assert_packing_loss.py吗?

Additionally, when I use Starcoderv2 for testing, there are also errors reported:
assert (
original_token_count == mk_token_count
), f"number of tokens for computing loss is different: original_token_count = {original_token_count}, mk_token_count={mk_token_count}"

@bao-xiaoyi
Copy link
Author

Hi @bao-xiaoyi, can you send me your command you ran assert_packing_loss.py ?

When I use starcoderv2, original_token_count = 147277, And mk_token_count=4014

@khai-meetkai
Copy link
Collaborator

khai-meetkai commented Sep 13, 2024

Hi @bao-xiaoyi, I think the reason for this error is because for this model, it uses the remote code (I mean, it is using modeling_deepseek.py). So you can do as follows:

  • Directly copy all .py files from https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct/tree/main and save to a folder, for example, remote_deepseek (in the folder packing). Directly replace the function _get_unpad_data to use the following monkey-patched code (this is equivalent to monkey-patch: modeling_deepseek._get_unpad_data = get_unpad_data ). You can also download the remote_deepseek.zip I attached in this post
def get_max_seqlen_in_batch(attention_mask):
    max_num = torch.max(attention_mask)
    # attention_mask: B x N
    counts = []
    for i in range(1, max_num + 1):
        counts.append(
            torch.sum(attention_mask == i, axis=-1)
        )  # shape: B, count length of data point maksed with i
    result = torch.stack(counts, axis=1)
    result = result.flatten()
    return result[result.nonzero()].squeeze(-1).to(dtype=torch.int32)

def _get_unpad_data(attention_mask):
    print("monkey-patched")
    seqlens_in_batch = get_max_seqlen_in_batch(
        attention_mask
    )  # attention_mask.sum(dim=-1, dtype=torch.int32)
    indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
    max_seqlen_in_batch = seqlens_in_batch.max().item()
    cu_seqlens = F.pad(
        torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0)
    )
    return (
        indices,
        cu_seqlens,
        max_seqlen_in_batch,
    )

About assert_packing_loss.py you can change as follows:

  • in computing the loss of original data, load model using transformers.AutoModelForCausalLM
  • in computing the loss of packed data, load model using DeepseekV2ForCausalLM (from remote_deepseek.modeling_deepseek import DeepseekV2ForCausalLM)
    You will see the result that the loss results are almost the same, the difference is only: 0.0021%
    You can also download the assert_packing_loss.py I provided in this Post

remote_deepseek.zip
assert_packing_loss.py.zip

@khai-meetkai
Copy link
Collaborator

khai-meetkai commented Sep 13, 2024

@bao-xiaoyi for starcoder, which base_model you used, I tested following command and it works:

python assert_packing_loss.py bigcode/starcoder2-7b

@bao-xiaoyi
Copy link
Author

@bao-xiaoyi对于 starcoder,您使用了哪个 base_model,我测试了以下命令并且它有效:

python assert_packing_loss.py bigcode/starcoder2-7b

I chose the 15b model, and the average loss is a bit large

@bao-xiaoyi
Copy link
Author

你好@bao-xiaoyi,我认为出现此错误的原因是因为对于此模型,它使用了远程代码(我的意思是,它正在使用models_deepseek.py)。因此,您可以执行以下操作:

def get_max_seqlen_in_batch(attention_mask):
    max_num = torch.max(attention_mask)
    # attention_mask: B x N
    counts = []
    for i in range(1, max_num + 1):
        counts.append(
            torch.sum(attention_mask == i, axis=-1)
        )  # shape: B, count length of data point maksed with i
    result = torch.stack(counts, axis=1)
    result = result.flatten()
    return result[result.nonzero()].squeeze(-1).to(dtype=torch.int32)

def _get_unpad_data(attention_mask):
    print("monkey-patched")
    seqlens_in_batch = get_max_seqlen_in_batch(
        attention_mask
    )  # attention_mask.sum(dim=-1, dtype=torch.int32)
    indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
    max_seqlen_in_batch = seqlens_in_batch.max().item()
    cu_seqlens = F.pad(
        torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0)
    )
    return (
        indices,
        cu_seqlens,
        max_seqlen_in_batch,
    )

关于assert_packing_loss.py你可以进行如下更改:

  • 在计算原始数据的损失时,使用加载模型transformers.AutoModelForCausalLM
  • 在计算打包数据的损失时,使用DeepseekV2ForCausalLMfrom remote_deepseek.modeling_deepseek import DeepseekV2ForCausalLM)加载模型
    ,你会看到损失结果几乎相同,差异仅在:0.0021%
    你也可以下载assert_packing_loss.py 我在这篇文章中提供的

remote_deepseek.zipassert_packing_loss.py.zip

I don't quite understand why local code should be used when using packing, and remote code can be used when not packing?
Why doesn't modeling_deepseek._get_unpad_data = get_unpad_data work?

@bao-xiaoyi
Copy link
Author

Hi @bao-xiaoyi, I think the reason for this error is because for this model, it uses the remote code (I mean, it is using modeling_deepseek.py). So you can do as follows:

  • Directly copy all .py files from https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct/tree/main and save to a folder, for example, remote_deepseek (in the folder packing). Directly replace the function _get_unpad_data to use the following monkey-patched code (this is equivalent to monkey-patch: modeling_deepseek._get_unpad_data = get_unpad_data ). You can also download the remote_deepseek.zip I attached in this post
def get_max_seqlen_in_batch(attention_mask):
    max_num = torch.max(attention_mask)
    # attention_mask: B x N
    counts = []
    for i in range(1, max_num + 1):
        counts.append(
            torch.sum(attention_mask == i, axis=-1)
        )  # shape: B, count length of data point maksed with i
    result = torch.stack(counts, axis=1)
    result = result.flatten()
    return result[result.nonzero()].squeeze(-1).to(dtype=torch.int32)

def _get_unpad_data(attention_mask):
    print("monkey-patched")
    seqlens_in_batch = get_max_seqlen_in_batch(
        attention_mask
    )  # attention_mask.sum(dim=-1, dtype=torch.int32)
    indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
    max_seqlen_in_batch = seqlens_in_batch.max().item()
    cu_seqlens = F.pad(
        torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0)
    )
    return (
        indices,
        cu_seqlens,
        max_seqlen_in_batch,
    )

About assert_packing_loss.py you can change as follows:

  • in computing the loss of original data, load model using transformers.AutoModelForCausalLM
  • in computing the loss of packed data, load model using DeepseekV2ForCausalLM (from remote_deepseek.modeling_deepseek import DeepseekV2ForCausalLM)
    You will see the result that the loss results are almost the same, the difference is only: 0.0021%
    You can also download the assert_packing_loss.py I provided in this Post

remote_deepseek.zip assert_packing_loss.py.zip

Moreover, the comparison of time consumption does not seem as exaggerated as shown in the readme. I tested Deepseek using the code you modified, and the time comparison is 18.712671 vs 7.400667 or 9.163215 vs 6.737796

@khai-meetkai
Copy link
Collaborator

@bao-xiaoyi I think directly monkey-patching remote code (trust_remote_code=True) doesn't work, and to find out the reason, we must investigate deeper on how transformers implemented this feature, I haven't investigated this.

@khai-meetkai
Copy link
Collaborator

By the way, I have just run:

python original_assert.py bigcode/starcoder2-15b
no errors were found. The result is: the difference between losses are only: 0.0011%

@bao-xiaoyi
Copy link
Author

@bao-xiaoyi我认为直接对远程代码进行 monkey-patching ( trust_remote_code=True) 不起作用,要找出原因,我们必须更深入地研究 transformers 如何实现这个特性,我还没有调查过这个。

Can you provide the time comparison results of your testing on Deepseek? Thank you very much

@khai-meetkai
Copy link
Collaborator

khai-meetkai commented Sep 13, 2024

In running this: python assert_packing_loss.py deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
in the file (assert_packing_loss.py.) I sent you above,
time for computing the loss without packing: 9.336643
time for computing the loss with packing: 2.348312

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants