Spaces:
Runtime error
Runtime error
| <!--Copyright 2020 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| --> | |
| # BORT | |
| ## Overview | |
| The BORT model was proposed in [Optimal Subarchitecture Extraction for BERT](https://arxiv.org/abs/2010.10499) by | |
| Adrian de Wynter and Daniel J. Perry. It is an optimal subset of architectural parameters for the BERT, which the | |
| authors refer to as "Bort". | |
| The abstract from the paper is the following: | |
| *We extract an optimal subset of architectural parameters for the BERT architecture from Devlin et al. (2018) by | |
| applying recent breakthroughs in algorithms for neural architecture search. This optimal subset, which we refer to as | |
| "Bort", is demonstrably smaller, having an effective (that is, not counting the embedding layer) size of 5.5% the | |
| original BERT-large architecture, and 16% of the net size. Bort is also able to be pretrained in 288 GPU hours, which | |
| is 1.2% of the time required to pretrain the highest-performing BERT parametric architectural variant, RoBERTa-large | |
| (Liu et al., 2019), and about 33% of that of the world-record, in GPU hours, required to train BERT-large on the same | |
| hardware. It is also 7.9x faster on a CPU, as well as being better performing than other compressed variants of the | |
| architecture, and some of the non-compressed variants: it obtains performance improvements of between 0.3% and 31%, | |
| absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks.* | |
| Tips: | |
| - BORT's model architecture is based on BERT, so one can refer to [BERT's documentation page](bert) for the | |
| model's API as well as usage examples. | |
| - BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, so one can refer to [RoBERTa's documentation page](roberta) for the tokenizer's API as well as usage examples. | |
| - BORT requires a specific fine-tuning algorithm, called [Agora](https://adewynter.github.io/notes/bort_algorithms_and_applications.html#fine-tuning-with-algebraic-topology) , | |
| that is sadly not open-sourced yet. It would be very useful for the community, if someone tries to implement the | |
| algorithm to make BORT fine-tuning work. | |
| This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/alexa/bort/). | |