loodvanniekerkginkgo commited on
Commit
27f3da5
ยท
1 Parent(s): 672339b

Commit before claude code

Browse files
about.py CHANGED
@@ -19,13 +19,25 @@ Antibodies have to be manufacturable, stable in high concentrations, and have lo
19
  Properties such as these can often hinder the progression of an antibody to the clinic, and are collectively referred to as 'developability'.
20
  Here we invite the community to submit and develop better predictors, which will be tested out on a heldout private set to assess model generalization.
21
 
 
 
 
 
 
 
 
 
22
  #### ๐Ÿ† Prizes
23
 
24
  For each of the 5 properties in the competition, there is a prize for the model with the highest performance for that property on the private test set.
25
  There is also an 'open-source' prize for the best model trained on the GDPa1 dataset of monoclonal antibodies (reporting cross-validation results) and assessed on the private test set where authors provide all training code and data.
26
- For each of these 6 prizes, participants have the choice between **$10k in data generation credits** with [Ginkgo Datapoints](https://datapoints.ginkgo.bio/) or a **cash prize** with a value of **$2000**.
 
 
27
 
28
  See the "{FAQ_TAB_NAME}" tab above (you are currently on the "{ABOUT_TAB_NAME}" tab) or the [competition terms]({TERMS_URL}) for more details.
 
 
29
  """
30
 
31
  ABOUT_TEXT = f"""
@@ -34,13 +46,15 @@ ABOUT_TEXT = f"""
34
 
35
  1. **Create a Hugging Face account** [here](https://huggingface.co/join) if you don't have one yet (this is used to track unique submissions and to access the GDPa1 dataset).
36
  2. **Register your team** on the [Competition Registration](https://datapoints.ginkgo.bio/ai-competitions/2025-abdev-competition) page.
37
- 3. **Build a model** or validate it on the [GDPa1](https://huggingface.co/datasets/ginkgo-datapoints/GDPa1) dataset.
38
- 4. **Complete the "Qualifying Exam"**. Before you can submit to the final test set, you must first get a score on the public leaderboard. Choose one of the two tracks:
39
- - Track 1 (Benchmark an existing model): Submit predictions for the `GDPa1` dataset.
40
- - Track 2 (Train from scratch): Train a model using cross-validation on the `GDPa1` dataset and submit cross-validation predictions by selecting `GDPa1_cross_validation`.
41
- 5. **Submit to the "Final Exam"**. Once you have submitted predictions on the validation set, download the private test set sequences from the {SUBMIT_TAB_NAME} tab and submit your final predictions. Your performance on this private set will determine the winners.
42
 
43
- Submissions close on **1 November 2025**.
 
 
 
 
44
 
45
  #### Acknowledgements
46
 
@@ -53,6 +67,8 @@ We gratefully acknowledge [Tamarind Bio](https://www.tamarind.bio/)'s help in ru
53
 
54
  We're working on getting more public models added, so that participants have more precomputed features to use for modeling.
55
 
 
 
56
  #### How to contribute?
57
 
58
  We'd like to add some more existing developability models to the leaderboard. Some examples of models we'd like to add:
@@ -62,6 +78,8 @@ We'd like to add some more existing developability models to the leaderboard. So
62
 
63
  If you would like to form a team or discuss ideas, join the [Slack community]({SLACK_URL}) co-hosted by Bits in Bio.
64
  """
 
 
65
 
66
  # Note(Lood): Significance: Add another note of "many models are trained on different datasets, and differing train/test splits, so this is a consistent way of comparing for a heldout set"
67
  FAQS = {
@@ -98,7 +116,7 @@ FAQS = {
98
  ),
99
  "How exactly can I evaluate my model?": (
100
  "You can easily calculate the Spearman correlation coefficient on the GDPa1 dataset yourself before uploading to the leaderboard. "
101
- "Simply use the `spearmanr(predictions, targets, nan_policy='omit')` function from `scipy.stats`. "
102
  "For the heldout private set, we will calculate these Spearman correlations privately at the end of the competition (and possibly at other points throughout the competition) - but there will not be 'rolling results' on the private test set to prevent test set leakage."
103
  ),
104
  "How often does the leaderboard update?": (
@@ -114,7 +132,7 @@ FAQS = {
114
  "We reserve the right to award the open-source prize to a predictor with competitive results for a subset of properties (e.g. a top polyreactivity model)."
115
  ),
116
  "How does the open-source prize work?": (
117
- "Participants who open-source their code and methods will be eligible for the open-source prize (as well as the other prizes)."
118
  ),
119
  "What do I need to submit?": (
120
  'There is a tab on the Hugging Face competition page to upload predictions for datasets - for each dataset participants need to submit a CSV containing a column for each property they would like to predict (e.g. called "HIC"), '
@@ -124,11 +142,8 @@ FAQS = {
124
  "Can I submit predictions for only one property?": (
125
  "Yes. You do not need to predict all 5 properties to participate. Each property has its own leaderboard and prize, so you may submit models for a subset of the assays if you wish."
126
  ),
127
- "Can I switch between Track 1 and Track 2 during the competition?": (
128
- "Yes. You may submit to both tracks. For example, you can benchmark an existing model on the GDPa1 dataset (Track 1) and later also train and submit a cross-validation model on GDPa1 (Track 2)."
129
- ),
130
  "Are participants required to use the provided cross-validation splits?": (
131
- "Yes, if submitting cross-validation results, to ensure fair comparison. The results will be calculated by taking the average Spearman correlation coefficient across all folds."
132
  ),
133
  "Are there any country restrictions for prize eligibility?": (
134
  "Yes. Due to applicable laws, prizes cannot be awarded to participants from countries under U.S. sanctions. See the competition terms for details."
@@ -141,8 +156,6 @@ FAQS = {
141
 
142
  SUBMIT_INTRUCTIONS = f"""
143
  # Antibody Developability Submission
144
- Upload CSV files to get your scores!
145
- List of valid property names: `{', '.join(ASSAY_LIST)}`.
146
 
147
  You do **not** need to predict all 5 properties โ€” each property has its own leaderboard and prize.
148
 
@@ -151,15 +164,16 @@ You do **not** need to predict all 5 properties โ€” each property has its own le
151
  - **GDPa1 Cross-Validation predictions** (using cross-validation folds)
152
  - **Private Test Set predictions** (final test submission)
153
  2. Each CSV should contain `antibody_name` + one column per property you are predicting (e.g. `"antibody_name,Titer,PR_CHO"` if your model predicts Titer and Polyreactivity).
 
154
 
155
- The GDPa1 results should appear on the leaderboard within a minute, and can also be calculated manually offline. The **private test set results will not appear on the leaderboards**, and will be used to determine the winners at the close of the competition.
156
  We may release private test set results at intermediate points during the competition.
157
 
158
  ## Cross-validation
159
 
160
  For the GDPa1 cross-validation predictions, use the `"hierarchical_cluster_IgG_isotype_stratified_fold"` column to split the dataset into folds and make predictions for each of the folds.
161
  Submit a CSV file in the same format but also containing the `"hierarchical_cluster_IgG_isotype_stratified_fold"` column.
162
- Check out our tutorial on making an antibody developability prediction model [here]({TUTORIAL_URL}).
163
 
164
  Submissions close on **1 November 2025**.
165
  """
 
19
  Properties such as these can often hinder the progression of an antibody to the clinic, and are collectively referred to as 'developability'.
20
  Here we invite the community to submit and develop better predictors, which will be tested out on a heldout private set to assess model generalization.
21
 
22
+ #### ๐Ÿงฌ Developability properties in this competition
23
+
24
+ 1. ๐Ÿ’ง Hydrophobicity
25
+ 2. ๐ŸŽฏ Polyreactivity
26
+ 3. ๐Ÿงฒ Self-association
27
+ 4. ๐ŸŒก๏ธ Thermostability
28
+ 5. ๐Ÿงช Titer
29
+
30
  #### ๐Ÿ† Prizes
31
 
32
  For each of the 5 properties in the competition, there is a prize for the model with the highest performance for that property on the private test set.
33
  There is also an 'open-source' prize for the best model trained on the GDPa1 dataset of monoclonal antibodies (reporting cross-validation results) and assessed on the private test set where authors provide all training code and data.
34
+ For each of these 6 prizes, participants have the choice between
35
+ - **$10 000 in data generation credits** with [Ginkgo Datapoints](https://datapoints.ginkgo.bio/), or
36
+ - A **$2000 cash prize**.
37
 
38
  See the "{FAQ_TAB_NAME}" tab above (you are currently on the "{ABOUT_TAB_NAME}" tab) or the [competition terms]({TERMS_URL}) for more details.
39
+
40
+ ---
41
  """
42
 
43
  ABOUT_TEXT = f"""
 
46
 
47
  1. **Create a Hugging Face account** [here](https://huggingface.co/join) if you don't have one yet (this is used to track unique submissions and to access the GDPa1 dataset).
48
  2. **Register your team** on the [Competition Registration](https://datapoints.ginkgo.bio/ai-competitions/2025-abdev-competition) page.
49
+ 3. **Build a model** using cross-validation on the [GDPa1](https://huggingface.co/datasets/ginkgo-datapoints/GDPa1) dataset, using the `hierarchical_cluster_IgG_isotype_stratified_fold` column to split the dataset into folds, and write out all cross-validation predictions to a CSV file.
50
+ 4. **Use your model to make predictions** on the private test set (download the 80 private test set sequences from the {SUBMIT_TAB_NAME} tab).
51
+ 5. **Submit your training and test set predictions** on the {SUBMIT_TAB_NAME} tab by uploading both your cross-validation and private test set CSV files.
 
 
52
 
53
+ Check out our introductory tutorial on training an antibody developability prediction model with cross-validation [here]({TUTORIAL_URL}).
54
+
55
+ โฐ Submissions close on **1 November 2025**.
56
+
57
+ ---
58
 
59
  #### Acknowledgements
60
 
 
67
 
68
  We're working on getting more public models added, so that participants have more precomputed features to use for modeling.
69
 
70
+ ---
71
+
72
  #### How to contribute?
73
 
74
  We'd like to add some more existing developability models to the leaderboard. Some examples of models we'd like to add:
 
78
 
79
  If you would like to form a team or discuss ideas, join the [Slack community]({SLACK_URL}) co-hosted by Bits in Bio.
80
  """
81
+ # TODO(Lood): Add "๐Ÿ“Š The first test set results will be released on October 13th, ahead of the final submission deadline on November 1st."
82
+
83
 
84
  # Note(Lood): Significance: Add another note of "many models are trained on different datasets, and differing train/test splits, so this is a consistent way of comparing for a heldout set"
85
  FAQS = {
 
116
  ),
117
  "How exactly can I evaluate my model?": (
118
  "You can easily calculate the Spearman correlation coefficient on the GDPa1 dataset yourself before uploading to the leaderboard. "
119
+ "Simply use the `spearmanr(predictions, targets, nan_policy='omit')` function from `scipy.stats` to calculate the Spearman correlation coefficient for each of the 5 folds, and then take the average."
120
  "For the heldout private set, we will calculate these Spearman correlations privately at the end of the competition (and possibly at other points throughout the competition) - but there will not be 'rolling results' on the private test set to prevent test set leakage."
121
  ),
122
  "How often does the leaderboard update?": (
 
132
  "We reserve the right to award the open-source prize to a predictor with competitive results for a subset of properties (e.g. a top polyreactivity model)."
133
  ),
134
  "How does the open-source prize work?": (
135
+ "Participants who open-source their training code and methods will be eligible for the open-source prize (as well as the other prizes)."
136
  ),
137
  "What do I need to submit?": (
138
  'There is a tab on the Hugging Face competition page to upload predictions for datasets - for each dataset participants need to submit a CSV containing a column for each property they would like to predict (e.g. called "HIC"), '
 
142
  "Can I submit predictions for only one property?": (
143
  "Yes. You do not need to predict all 5 properties to participate. Each property has its own leaderboard and prize, so you may submit models for a subset of the assays if you wish."
144
  ),
 
 
 
145
  "Are participants required to use the provided cross-validation splits?": (
146
+ "Yes, to ensure fair comparison between different trained models. The results will be calculated by taking the average Spearman correlation coefficient across all folds."
147
  ),
148
  "Are there any country restrictions for prize eligibility?": (
149
  "Yes. Due to applicable laws, prizes cannot be awarded to participants from countries under U.S. sanctions. See the competition terms for details."
 
156
 
157
  SUBMIT_INTRUCTIONS = f"""
158
  # Antibody Developability Submission
 
 
159
 
160
  You do **not** need to predict all 5 properties โ€” each property has its own leaderboard and prize.
161
 
 
164
  - **GDPa1 Cross-Validation predictions** (using cross-validation folds)
165
  - **Private Test Set predictions** (final test submission)
166
  2. Each CSV should contain `antibody_name` + one column per property you are predicting (e.g. `"antibody_name,Titer,PR_CHO"` if your model predicts Titer and Polyreactivity).
167
+ - List of valid property names: `{', '.join(ASSAY_LIST)}`.
168
 
169
+ The GDPa1 results should appear on the leaderboard within a minute, and can also be calculated manually using Spearman rank correlation. The **private test set results will not appear on the leaderboards at first**, and will be used to determine the winners at the close of the competition.
170
  We may release private test set results at intermediate points during the competition.
171
 
172
  ## Cross-validation
173
 
174
  For the GDPa1 cross-validation predictions, use the `"hierarchical_cluster_IgG_isotype_stratified_fold"` column to split the dataset into folds and make predictions for each of the folds.
175
  Submit a CSV file in the same format but also containing the `"hierarchical_cluster_IgG_isotype_stratified_fold"` column.
176
+ Check out our tutorial on training an antibody developability prediction model with cross-validation [here]({TUTORIAL_URL}).
177
 
178
  Submissions close on **1 November 2025**.
179
  """
app.py CHANGED
@@ -50,7 +50,6 @@ def get_leaderboard_object(assay: str | None = None):
50
  filter_columns = ["dataset"]
51
  if assay is None:
52
  filter_columns.append("property")
53
- # TODO how to sort filter columns alphabetically?
54
  # Bug: Can't leave search_columns empty because then it says "Column None not found in headers"
55
  # Note(Lood): Would be nice to make it clear that the Search Column is searching on model name
56
  current_dataframe = pd.read_csv("debug-current-results.csv")
@@ -101,11 +100,6 @@ async def periodic_data_fetch(app):
101
  event.set()
102
  t.join(3)
103
 
104
-
105
- # Lood: Two problems currently:
106
- # 1. The data_version state value isn't being incremented, it seems (even though it's triggering the dataframe change correctly)
107
- # 2. The global current_dataframe is being shared across all sessions
108
-
109
  # Make font size bigger using gradio theme
110
  with gr.Blocks(theme=gr.themes.Default(text_size=sizes.text_lg)) as demo:
111
  timer = gr.Timer(3) # Run every 3 seconds when page is focused
@@ -131,6 +125,7 @@ with gr.Blocks(theme=gr.themes.Default(text_size=sizes.text_lg)) as demo:
131
  show_label=False,
132
  show_download_button=False,
133
  show_share_button=False,
 
134
  width="25vw", # Take up the width of the column (2/8 = 1/4)
135
  )
136
 
@@ -138,30 +133,34 @@ with gr.Blocks(theme=gr.themes.Default(text_size=sizes.text_lg)) as demo:
138
  with gr.TabItem(ABOUT_TAB_NAME, elem_id="abdev-benchmark-tab-table"):
139
  gr.Markdown(ABOUT_INTRO)
140
  gr.Image(
141
- value="./assets/prediction_explainer.png",
142
  show_label=False,
143
  show_download_button=False,
144
  show_share_button=False,
145
- width="50vw",
 
146
  )
147
  gr.Markdown(ABOUT_TEXT)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
 
149
- # Procedurally make these 5 tabs
150
- # for i, assay in enumerate(ASSAY_LIST):
151
- # with gr.TabItem(
152
- # f"{ASSAY_EMOJIS[assay]} {ASSAY_RENAME[assay]}",
153
- # elem_id="abdev-benchmark-tab-table",
154
- # ) as tab_item:
155
- # gr.Markdown(f"# {ASSAY_DESCRIPTION[assay]}")
156
- # lb = get_leaderboard_object(assay=assay)
157
-
158
- # def refresh_leaderboard(assay=assay):
159
- # return format_leaderboard_table(df_results=current_dataframe, assay=assay)
160
-
161
- # # Refresh when data version changes
162
- # data_version.change(fn=refresh_leaderboard, outputs=lb)
163
-
164
- # Note(Lood): Trying out just one leaderboard. We could also have a dropdown here that shows different leaderboards for each property, but that's just the same as the filters
165
  with gr.TabItem(
166
  "๐Ÿ† Leaderboard", elem_id="abdev-benchmark-tab-table"
167
  ) as leaderboard_tab:
@@ -171,18 +170,13 @@ with gr.Blocks(theme=gr.themes.Default(text_size=sizes.text_lg)) as demo:
171
  Each property has its own prize, and participants can submit models for any combination of properties.
172
 
173
  **Note**: It is *easy to overfit* the public GDPa1 dataset, which results in artificially high Spearman correlations.
174
- We would suggest training using cross-validation a limited number of times to give a better indication of the model's performance on the eventual private test set.
175
  """
176
  )
177
  lb = get_leaderboard_object()
178
  timer.tick(fn=refresh_overall_leaderboard, outputs=lb)
179
  demo.load(fn=refresh_overall_leaderboard, outputs=lb)
180
 
181
- # At the bottom of the leaderboard, we can keep as NaN and explain missing test set results
182
- # gr.Markdown(
183
- # "_โ„น๏ธ Results for the private test set will not be shown here and will be used for final judging at the close of the competition._"
184
- # )
185
-
186
  with gr.TabItem(SUBMIT_TAB_NAME, elem_id="boundary-benchmark-tab-table"):
187
  gr.Markdown(SUBMIT_INTRUCTIONS)
188
 
@@ -218,9 +212,6 @@ with gr.Blocks(theme=gr.themes.Default(text_size=sizes.text_lg)) as demo:
218
 
219
  with gr.Column():
220
  gr.Markdown("### Upload Both Submission Files")
221
- gr.Markdown(
222
- "**Both CSV files are required** - you cannot submit without uploading both files."
223
- )
224
 
225
  # GDPa1 Cross-validation file
226
  gr.Markdown("**GDPa1 Cross-Validation Predictions:**")
@@ -281,5 +272,5 @@ with gr.Blocks(theme=gr.themes.Default(text_size=sizes.text_lg)) as demo:
281
 
282
  if __name__ == "__main__":
283
  demo.launch(
284
- ssr_mode=False, share=True, app_kwargs={"lifespan": periodic_data_fetch}
285
  )
 
50
  filter_columns = ["dataset"]
51
  if assay is None:
52
  filter_columns.append("property")
 
53
  # Bug: Can't leave search_columns empty because then it says "Column None not found in headers"
54
  # Note(Lood): Would be nice to make it clear that the Search Column is searching on model name
55
  current_dataframe = pd.read_csv("debug-current-results.csv")
 
100
  event.set()
101
  t.join(3)
102
 
 
 
 
 
 
103
  # Make font size bigger using gradio theme
104
  with gr.Blocks(theme=gr.themes.Default(text_size=sizes.text_lg)) as demo:
105
  timer = gr.Timer(3) # Run every 3 seconds when page is focused
 
125
  show_label=False,
126
  show_download_button=False,
127
  show_share_button=False,
128
+ show_fullscreen_button=False,
129
  width="25vw", # Take up the width of the column (2/8 = 1/4)
130
  )
131
 
 
133
  with gr.TabItem(ABOUT_TAB_NAME, elem_id="abdev-benchmark-tab-table"):
134
  gr.Markdown(ABOUT_INTRO)
135
  gr.Image(
136
+ value="./assets/prediction_explainer_cv.png",
137
  show_label=False,
138
  show_download_button=False,
139
  show_share_button=False,
140
+ show_fullscreen_button=False,
141
+ width="30vw",
142
  )
143
  gr.Markdown(ABOUT_TEXT)
144
+
145
+ # Sequence download buttons
146
+ gr.Markdown(
147
+ """### ๐Ÿ“ฅ Download Sequences
148
+ The GDPa1 dataset (with assay data and sequences) is available on Hugging Face [here](https://huggingface.co/datasets/ginkgo-datapoints/GDPa1),
149
+ but we provide this and the private test set for convenience.""")
150
+ with gr.Row():
151
+ with gr.Column():
152
+ download_button_cv_about = gr.DownloadButton(
153
+ label="๐Ÿ“ฅ Download GDPa1 sequences",
154
+ value=SEQUENCES_FILE_DICT["GDPa1_cross_validation"],
155
+ variant="secondary",
156
+ )
157
+ with gr.Column():
158
+ download_button_test_about = gr.DownloadButton(
159
+ label="๐Ÿ“ฅ Download Private Test Set sequences",
160
+ value=SEQUENCES_FILE_DICT["Heldout Test Set"],
161
+ variant="secondary",
162
+ )
163
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
164
  with gr.TabItem(
165
  "๐Ÿ† Leaderboard", elem_id="abdev-benchmark-tab-table"
166
  ) as leaderboard_tab:
 
170
  Each property has its own prize, and participants can submit models for any combination of properties.
171
 
172
  **Note**: It is *easy to overfit* the public GDPa1 dataset, which results in artificially high Spearman correlations.
173
+ We would suggest training using cross-validation to give a better indication of the model's performance on the eventual private test set.
174
  """
175
  )
176
  lb = get_leaderboard_object()
177
  timer.tick(fn=refresh_overall_leaderboard, outputs=lb)
178
  demo.load(fn=refresh_overall_leaderboard, outputs=lb)
179
 
 
 
 
 
 
180
  with gr.TabItem(SUBMIT_TAB_NAME, elem_id="boundary-benchmark-tab-table"):
181
  gr.Markdown(SUBMIT_INTRUCTIONS)
182
 
 
212
 
213
  with gr.Column():
214
  gr.Markdown("### Upload Both Submission Files")
 
 
 
215
 
216
  # GDPa1 Cross-validation file
217
  gr.Markdown("**GDPa1 Cross-Validation Predictions:**")
 
272
 
273
  if __name__ == "__main__":
274
  demo.launch(
275
+ ssr_mode=False, app_kwargs={"lifespan": periodic_data_fetch}
276
  )
assets/prediction_explainer.png CHANGED

Git LFS Details

  • SHA256: d9ad3ddc3e4da7261b6b1383315023753fcc3de5ec25d681bbfd0bef14d5ad96
  • Pointer size: 131 Bytes
  • Size of remote file: 154 kB

Git LFS Details

  • SHA256: 1b164ae8a4b29fee8e18382922c5331ba6c71504e3acbac1341bfb228ebdcc28
  • Pointer size: 131 Bytes
  • Size of remote file: 138 kB
assets/prediction_explainer_cv.png ADDED

Git LFS Details

  • SHA256: 1028b5a4034bbeb403b6a015f831dd5715baaca4698ced2b4fff85da00116297
  • Pointer size: 130 Bytes
  • Size of remote file: 79.6 kB