Fix bugs for bias (#3)
Browse files- Fix bugs for bias (ffd4f7c02c8029728f7479c1163163362ffcf11d)
- README.md +20 -15
- model.safetensors +1 -1
    	
        README.md
    CHANGED
    
    | @@ -10,6 +10,11 @@ language: | |
| 10 | 
             
            The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
         | 
| 11 | 
             
            It is designed for use in Japanese.
         | 
| 12 |  | 
|  | |
|  | |
|  | |
|  | |
|  | |
| 13 | 
             
            ## Model Details
         | 
| 14 |  | 
| 15 | 
             
            ### Model Description
         | 
| @@ -19,12 +24,12 @@ The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM. | |
| 19 | 
             
            It is designed for use in Japanese.
         | 
| 20 |  | 
| 21 | 
             
            This model offers several advanced features compared to traditional BERT models:
         | 
| 22 | 
            -
            - **PreNorm**: Improved stability during training. | 
| 23 | 
            -
            - **SwiGLU**: Enhanced activation function for better performance. | 
| 24 | 
            -
            - **Grouped-Query Attention (Multi-Query Attention)**: Efficient attention mechanism. | 
| 25 | 
            -
            - **Max Sequence Length**: 2048 tokens, allowing for longer context. | 
| 26 | 
            -
            - **Parameters**: 1.3 billion parameters. | 
| 27 | 
            -
            - **Pre-training Objective**: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP). | 
| 28 | 
             
            - **Token Type IDs**: Not used in this model.
         | 
| 29 |  | 
| 30 | 
             
            ### Model Sources
         | 
| @@ -44,9 +49,9 @@ Depending on your use case, follow the appropriate section below. | |
| 44 |  | 
| 45 | 
             
            This model is pre-trained using Masked Language Modeling.
         | 
| 46 | 
             
            The mask token used is `<MASK|LLM-jp>`.
         | 
| 47 | 
            -
            Note that you need to set `trust_remote_code` to `True` because RetrievaBERT uses a custom model implementation. | 
| 48 | 
            -
             | 
| 49 | 
            -
            Example code for direct use: | 
| 50 |  | 
| 51 | 
             
            ```python
         | 
| 52 | 
             
            from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
         | 
| @@ -98,7 +103,7 @@ The model was trained on the following hyperparameters. | |
| 98 | 
             
            - Floating point expression: BF16
         | 
| 99 |  | 
| 100 | 
             
            ## Evaluation
         | 
| 101 | 
            -
            We fine-tuned the following models and evaluated them on the [JGLUE](https://github.com/yahoojapan/JGLUE) development set. | 
| 102 | 
             
            We adjusted the learning rate and training epochs for each model and task in accordance with [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).
         | 
| 103 |  | 
| 104 | 
             
            | Model                            | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
         | 
| @@ -106,7 +111,7 @@ We adjusted the learning rate and training epochs for each model and task in acc | |
| 106 | 
             
            | tohoku-nlp/bert-base-japanese-v3 | 0.957       | 0.914        | 0.876         | 0.906    | 0.878     | 0.946     | 0.849      |
         | 
| 107 | 
             
            | tohoku-nlp/bert-large-japanese-v2| 0.959       | 0.916        | 0.877         | 0.901    | 0.884     | 0.951     | 0.867      |
         | 
| 108 | 
             
            | ku-nlp/deberta-v3-base-japaneseγγγγ| 0.958       | 0.925        | 0.890         | 0.902    | 0.925     | 0.910     | 0.882      |
         | 
| 109 | 
            -
            | retrieva-jp/bert-1.3bγγγγγγγγγγγγγγγγγγγγγγγγ| 0. | 
| 110 |  | 
| 111 |  | 
| 112 | 
             
            ## Technical Specifications
         | 
| @@ -121,9 +126,9 @@ The RetrievaBERT model is based on BERT with the following hyperparameters: | |
| 121 | 
             
            - Maximum length of position embeddings: 2048
         | 
| 122 |  | 
| 123 | 
             
            As mentioned earlier, the main differences from the original BERT are:
         | 
| 124 | 
            -
            - PreNorm: Improved stability during training. | 
| 125 | 
            -
            - SwiGLU: Enhanced activation function for better performance. | 
| 126 | 
            -
            - Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism. | 
| 127 |  | 
| 128 |  | 
| 129 | 
             
            ### Compute Infrastructure
         | 
| @@ -145,4 +150,4 @@ https://note.com/retrieva/n/n715bea2c2cd1 (in Japanese) | |
| 145 | 
             
            Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba
         | 
| 146 |  | 
| 147 | 
             
            ## Model Card Contact
         | 
| 148 | 
            -
            pr@retrieva.jp
         | 
|  | |
| 10 | 
             
            The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
         | 
| 11 | 
             
            It is designed for use in Japanese.
         | 
| 12 |  | 
| 13 | 
            +
            ## What's New
         | 
| 14 | 
            +
             | 
| 15 | 
            +
            - November 2024 (`v1.0.1`): Bug fix for the model parameters.
         | 
| 16 | 
            +
              - The up_proj's bias was initialized with the gate's one. This bug was fixed.
         | 
| 17 | 
            +
             | 
| 18 | 
             
            ## Model Details
         | 
| 19 |  | 
| 20 | 
             
            ### Model Description
         | 
|  | |
| 24 | 
             
            It is designed for use in Japanese.
         | 
| 25 |  | 
| 26 | 
             
            This model offers several advanced features compared to traditional BERT models:
         | 
| 27 | 
            +
            - **PreNorm**: Improved stability during training.
         | 
| 28 | 
            +
            - **SwiGLU**: Enhanced activation function for better performance.
         | 
| 29 | 
            +
            - **Grouped-Query Attention (Multi-Query Attention)**: Efficient attention mechanism.
         | 
| 30 | 
            +
            - **Max Sequence Length**: 2048 tokens, allowing for longer context.
         | 
| 31 | 
            +
            - **Parameters**: 1.3 billion parameters.
         | 
| 32 | 
            +
            - **Pre-training Objective**: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).
         | 
| 33 | 
             
            - **Token Type IDs**: Not used in this model.
         | 
| 34 |  | 
| 35 | 
             
            ### Model Sources
         | 
|  | |
| 49 |  | 
| 50 | 
             
            This model is pre-trained using Masked Language Modeling.
         | 
| 51 | 
             
            The mask token used is `<MASK|LLM-jp>`.
         | 
| 52 | 
            +
            Note that you need to set `trust_remote_code` to `True` because RetrievaBERT uses a custom model implementation.
         | 
| 53 | 
            +
             | 
| 54 | 
            +
            Example code for direct use:
         | 
| 55 |  | 
| 56 | 
             
            ```python
         | 
| 57 | 
             
            from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
         | 
|  | |
| 103 | 
             
            - Floating point expression: BF16
         | 
| 104 |  | 
| 105 | 
             
            ## Evaluation
         | 
| 106 | 
            +
            We fine-tuned the following models and evaluated them on the [JGLUE](https://github.com/yahoojapan/JGLUE) development set.
         | 
| 107 | 
             
            We adjusted the learning rate and training epochs for each model and task in accordance with [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).
         | 
| 108 |  | 
| 109 | 
             
            | Model                            | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
         | 
|  | |
| 111 | 
             
            | tohoku-nlp/bert-base-japanese-v3 | 0.957       | 0.914        | 0.876         | 0.906    | 0.878     | 0.946     | 0.849      |
         | 
| 112 | 
             
            | tohoku-nlp/bert-large-japanese-v2| 0.959       | 0.916        | 0.877         | 0.901    | 0.884     | 0.951     | 0.867      |
         | 
| 113 | 
             
            | ku-nlp/deberta-v3-base-japaneseγγγγ| 0.958       | 0.925        | 0.890         | 0.902    | 0.925     | 0.910     | 0.882      |
         | 
| 114 | 
            +
            | retrieva-jp/bert-1.3bγγγγγγγγγγγγγγγγγγγγγγγγ| 0.959       | 0.917        | 0.881         | 0.898    | 0.875     | 0.874     | 0.827      |
         | 
| 115 |  | 
| 116 |  | 
| 117 | 
             
            ## Technical Specifications
         | 
|  | |
| 126 | 
             
            - Maximum length of position embeddings: 2048
         | 
| 127 |  | 
| 128 | 
             
            As mentioned earlier, the main differences from the original BERT are:
         | 
| 129 | 
            +
            - PreNorm: Improved stability during training.
         | 
| 130 | 
            +
            - SwiGLU: Enhanced activation function for better performance.
         | 
| 131 | 
            +
            - Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.
         | 
| 132 |  | 
| 133 |  | 
| 134 | 
             
            ### Compute Infrastructure
         | 
|  | |
| 150 | 
             
            Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba
         | 
| 151 |  | 
| 152 | 
             
            ## Model Card Contact
         | 
| 153 | 
            +
            pr@retrieva.jp
         | 
    	
        model.safetensors
    CHANGED
    
    | @@ -1,3 +1,3 @@ | |
| 1 | 
             
            version https://git-lfs.github.com/spec/v1
         | 
| 2 | 
            -
            oid sha256: | 
| 3 | 
             
            size 2602880000
         | 
|  | |
| 1 | 
             
            version https://git-lfs.github.com/spec/v1
         | 
| 2 | 
            +
            oid sha256:994bd099f4bb0c9bab36ed16e1a8271f46f637de6b06e32fa1f29643d7b528c9
         | 
| 3 | 
             
            size 2602880000
         | 

