Spaces:
Build error
Build error
Updated README
Browse files- README.md +191 -5
- app.py +1 -1
- test_model.py +6 -6
README.md
CHANGED
|
@@ -11,13 +11,192 @@ pinned: false
|
|
| 11 |
|
| 12 |
# AI Project: Finetuning Language Models - Toxic Tweets
|
| 13 |
|
| 14 |
-
Hello! This is a project for CS-UY 4613: Artificial Intelligence. I'm providing a step-by-step instruction on finetuning language models for detecting toxic tweets.
|
| 15 |
|
| 16 |
-
#
|
|
|
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
Link to
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
Here's the setup block that includes all modules:
|
| 23 |
```
|
|
@@ -121,4 +300,11 @@ trainer.push_to_hub()
|
|
| 121 |
|
| 122 |
Modify [app.py](app.py) so that it takes in one text and generate an analysis using one of the provided models. Details are explained in comment lines. The app should look like this:
|
| 123 |
|
| 124 |
-

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
# AI Project: Finetuning Language Models - Toxic Tweets
|
| 13 |
|
| 14 |
+
Hello! This is a project for CS-UY 4613: Artificial Intelligence. I'm providing a step-by-step instruction on finetuning language models for detecting toxic tweets. All codes are well commented.
|
| 15 |
|
| 16 |
+
# Everthing you need to know
|
| 17 |
+
Link to HuggingFace space: https://huggingface.co/spaces/andyqin18/sentiment-analysis-app
|
| 18 |
|
| 19 |
+
----Code behind app: [app.py](app.py)
|
| 20 |
|
| 21 |
+
Link to finetuned model: https://huggingface.co/andyqin18/finetuned-bert-uncased
|
| 22 |
+
|
| 23 |
+
----Code for how to finetune a language model: [finetune.ipynb](milestone3/finetune.ipynb)
|
| 24 |
+
|
| 25 |
+
Performance of the model using [test_model.py](test_model.py) is shown below. The result is generated on 2000 randomly selected samples from [train.csv](milestone3/comp/train.csv)
|
| 26 |
+
|
| 27 |
+
```
|
| 28 |
+
{'label_accuracy': 0.9821666666666666,
|
| 29 |
+
'prediction_accuracy': 0.9195,
|
| 30 |
+
'precision': 0.8263888888888888,
|
| 31 |
+
'recall': 0.719758064516129}
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
Now let's walk through the details :)
|
| 35 |
+
|
| 36 |
+
# Milestone 1 - Setup
|
| 37 |
+
|
| 38 |
+
This milestone includes setting up docker and creating a development environment on Windows 11.
|
| 39 |
+
|
| 40 |
+
## 1. Enable WSL2 feature
|
| 41 |
+
|
| 42 |
+
The Windows Subsystem for Linux (WSL) lets developers install a Linux distribution on Windows.
|
| 43 |
+
|
| 44 |
+
```
|
| 45 |
+
wsl --install
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
Ubuntu is the default distribution installed and WSL2 is the default version.
|
| 49 |
+
After creating linux username and password, Ubuntu can be seen in Windows Terminal now.
|
| 50 |
+
Details can be found [here](https://learn.microsoft.com/en-us/windows/wsl/install).
|
| 51 |
+
|
| 52 |
+

|
| 53 |
+
|
| 54 |
+
## 2. Download and install the Linux kernel update package
|
| 55 |
+
|
| 56 |
+
The package needs to be downloaded before installing Docker Desktop.
|
| 57 |
+
However, this error might occur:
|
| 58 |
+
|
| 59 |
+
`Error: wsl_update_x64.msi unable to run because "This update only applies to machines with the Windows Subsystem for Linux"`
|
| 60 |
+
|
| 61 |
+
Solution: Opened Windows features and enabled "Windows Subsystem for Linux".
|
| 62 |
+
Successfully ran update [package](https://docs.microsoft.com/windows/wsl/wsl2-kernel).
|
| 63 |
+
|
| 64 |
+

|
| 65 |
+
|
| 66 |
+
## 3. Download Docker Desktop
|
| 67 |
+
|
| 68 |
+
After downloading the [Docker App](https://www.docker.com/products/docker-desktop/), WSL2 based engine is automatically enabled.
|
| 69 |
+
If not, follow [this link](https://docs.docker.com/desktop/windows/wsl/) for steps to turn on WSL2 backend.
|
| 70 |
+
Open the app and input `docker version` in Terminal to check server running.
|
| 71 |
+
|
| 72 |
+

|
| 73 |
+
Docker is ready to go.
|
| 74 |
+
|
| 75 |
+
## 4. Create project container and image
|
| 76 |
+
|
| 77 |
+
First we download the Ubuntu image from Docker’s library with:
|
| 78 |
+
```
|
| 79 |
+
docker pull ubuntu
|
| 80 |
+
```
|
| 81 |
+
We can check the available images with:
|
| 82 |
+
```
|
| 83 |
+
docker image ls
|
| 84 |
+
```
|
| 85 |
+
We can create a container named *AI_project* based on Ubuntu image with:
|
| 86 |
+
```
|
| 87 |
+
docker run -it --name=AI_project ubuntu
|
| 88 |
+
```
|
| 89 |
+
The `–it` options instruct the container to launch in interactive mode and enable a Terminal typing interface.
|
| 90 |
+
After this, a shell is generated and we are directed to Linux Terminal within the container.
|
| 91 |
+
`root` represents the currently logged-in user with highest privileges, and `249cf37645b4` is the container ID.
|
| 92 |
+
|
| 93 |
+

|
| 94 |
+
|
| 95 |
+
## 5. Hello World!
|
| 96 |
+
|
| 97 |
+
Now we can mess with the container by downloading python and pip needed for the project.
|
| 98 |
+
First we update and upgrade packages by: (`apt` is Advanced Packaging Tool)
|
| 99 |
+
```
|
| 100 |
+
apt update && apt upgrade
|
| 101 |
+
```
|
| 102 |
+
Then we download python and pip with:
|
| 103 |
+
```
|
| 104 |
+
apt install python3 pip
|
| 105 |
+
```
|
| 106 |
+
We can confirm successful installation by checking the current version of python and pip.
|
| 107 |
+
Then create a script file of *hello_world.py* under `root` directory, and run the script.
|
| 108 |
+
You will see the following in VSCode and Terminal.
|
| 109 |
+
|
| 110 |
+

|
| 111 |
+

|
| 112 |
+
|
| 113 |
+
## 6. Commit changes to a new image specifically for the project
|
| 114 |
+
|
| 115 |
+
After setting up the container we can commit changes to a specific project image with a tag of *milestone1* with:
|
| 116 |
+
```
|
| 117 |
+
docker commit [CONTAINER] [NEW_IMAGE]:[TAG]
|
| 118 |
+
```
|
| 119 |
+
Now if we check the available images there should be a new image for the project. If we list all containers we should be able to identify the one we were working on through container ID.
|
| 120 |
+
|
| 121 |
+

|
| 122 |
+
|
| 123 |
+
The Docker Desktop app should match the image list we see on Terminal.
|
| 124 |
+
|
| 125 |
+

|
| 126 |
+
|
| 127 |
+
# Milestone 2 - Sentiment Analysis App w/ Pretrained Model
|
| 128 |
+
|
| 129 |
+
This milestone includes creating a Streamlit app in HuggingFace for sentiment analysis.
|
| 130 |
+
|
| 131 |
+
## 1. Space setup
|
| 132 |
+
|
| 133 |
+
After creating a HuggingFace account, we can create our app as a space and choose Streamlit as the space SDK.
|
| 134 |
+
|
| 135 |
+

|
| 136 |
+
|
| 137 |
+
Then we can go back to our Github Repo and create the following files.
|
| 138 |
+
In order for the space to run properly, there must be at least three files in the root directory:
|
| 139 |
+
[README.md](README.md), [app.py](app.py), and [requirements.txt](requirements.txt)
|
| 140 |
+
|
| 141 |
+
Make sure the following metadata is at the top of **README.md** for HuggingFace to identify.
|
| 142 |
+
```
|
| 143 |
+
---
|
| 144 |
+
title: Sentiment Analysis App
|
| 145 |
+
emoji: 🚀
|
| 146 |
+
colorFrom: green
|
| 147 |
+
colorTo: purple
|
| 148 |
+
sdk: streamlit
|
| 149 |
+
sdk_version: 1.17.0
|
| 150 |
+
app_file: app.py
|
| 151 |
+
pinned: false
|
| 152 |
+
---
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
The **app.py** file is the main code of the app and **requirements.txt** should include all the libraries the code uses. HuggingFace will install the libraries listed before running the virtual environment
|
| 156 |
+
|
| 157 |
+
|
| 158 |
+
## 2. Connect and sync to HuggingFace
|
| 159 |
+
|
| 160 |
+
Then we go to settings of the Github Repo and create a secret token to access the new HuggingFace space.
|
| 161 |
+
|
| 162 |
+

|
| 163 |
+

|
| 164 |
+
|
| 165 |
+
Next, we need to setup a workflow in Github Actions. Click "set up a workflow yourself" and replace all the code in `main.yaml` with the following: (Replace `HF_USERNAME` and `SPACE_NAME` with our own)
|
| 166 |
+
|
| 167 |
+
```
|
| 168 |
+
name: Sync to Hugging Face hub
|
| 169 |
+
on:
|
| 170 |
+
push:
|
| 171 |
+
branches: [main]
|
| 172 |
+
|
| 173 |
+
# to run this workflow manually from the Actions tab
|
| 174 |
+
workflow_dispatch:
|
| 175 |
+
|
| 176 |
+
jobs:
|
| 177 |
+
sync-to-hub:
|
| 178 |
+
runs-on: ubuntu-latest
|
| 179 |
+
steps:
|
| 180 |
+
- uses: actions/checkout@v3
|
| 181 |
+
with:
|
| 182 |
+
fetch-depth: 0
|
| 183 |
+
lfs: true
|
| 184 |
+
- name: Push to hub
|
| 185 |
+
env:
|
| 186 |
+
HF_TOKEN: ${{ secrets.HF_TOKEN }}
|
| 187 |
+
run: git push --force https://HF_USERNAME:$HF_TOKEN@huggingface.co/spaces/HF_USERNAME/SPACE_NAME main
|
| 188 |
+
```
|
| 189 |
+
The Repo is now connected and synced with HuggingFace space!
|
| 190 |
+
|
| 191 |
+
## 3. Create the app
|
| 192 |
+
|
| 193 |
+
Modify [app.py](app.py) so that it takes in one text and generate an analysis using one of the provided models. Details are explained in comment lines. The app should look like this:
|
| 194 |
+
|
| 195 |
+

|
| 196 |
+
|
| 197 |
+
# Milestone 3 - Finetuning Language Models
|
| 198 |
+
|
| 199 |
+
This milestone we wish to finetuning our own language model in HuggingFace for sentiment analysis.
|
| 200 |
|
| 201 |
Here's the setup block that includes all modules:
|
| 202 |
```
|
|
|
|
| 300 |
|
| 301 |
Modify [app.py](app.py) so that it takes in one text and generate an analysis using one of the provided models. Details are explained in comment lines. The app should look like this:
|
| 302 |
|
| 303 |
+

|
| 304 |
+
|
| 305 |
+
## Reference:
|
| 306 |
+
For connecting Github with HuggingFace, check this [video](https://www.youtube.com/watch?v=8hOzsFETm4I).
|
| 307 |
+
|
| 308 |
+
For creating the app, check this [video](https://www.youtube.com/watch?v=GSt00_-0ncQ)
|
| 309 |
+
|
| 310 |
+
The HuggingFace documentation is [here](https://huggingface.co/docs), and Streamlit APIs [here](https://docs.streamlit.io/library/api-reference).
|
app.py
CHANGED
|
@@ -18,7 +18,7 @@ def analyze(model_name: str, text: str, top_k=1) -> dict:
|
|
| 18 |
return classifier(text)
|
| 19 |
|
| 20 |
# App title
|
| 21 |
-
st.title("Sentiment Analysis App
|
| 22 |
st.write("This app is to analyze the sentiments behind a text.")
|
| 23 |
st.write("You can choose to use my fine-tuned model or pre-trained models.")
|
| 24 |
|
|
|
|
| 18 |
return classifier(text)
|
| 19 |
|
| 20 |
# App title
|
| 21 |
+
st.title("Toxic Tweet Detection and Sentiment Analysis App")
|
| 22 |
st.write("This app is to analyze the sentiments behind a text.")
|
| 23 |
st.write("You can choose to use my fine-tuned model or pre-trained models.")
|
| 24 |
|
test_model.py
CHANGED
|
@@ -6,8 +6,8 @@ from tqdm import tqdm
|
|
| 6 |
|
| 7 |
|
| 8 |
# Global var
|
| 9 |
-
TEST_SIZE =
|
| 10 |
-
FINE_TUNED_MODEL = "andyqin18/
|
| 11 |
|
| 12 |
|
| 13 |
# Define analyze function
|
|
@@ -77,8 +77,8 @@ for comment_idx in tqdm(range(TEST_SIZE), desc="Analyzing..."):
|
|
| 77 |
|
| 78 |
# Calculate performance
|
| 79 |
performance = {}
|
| 80 |
-
performance["label_accuracy"] = total_true/(len(labels) * TEST_SIZE)
|
| 81 |
-
performance["prediction_accuracy"] = total_success/TEST_SIZE
|
| 82 |
-
performance["precision"] = TP / (TP + FP)
|
| 83 |
-
performance["recall"] = TP / (TP + FN)
|
| 84 |
print(performance)
|
|
|
|
| 6 |
|
| 7 |
|
| 8 |
# Global var
|
| 9 |
+
TEST_SIZE = 2000
|
| 10 |
+
FINE_TUNED_MODEL = "andyqin18/finetuned-bert-uncased"
|
| 11 |
|
| 12 |
|
| 13 |
# Define analyze function
|
|
|
|
| 77 |
|
| 78 |
# Calculate performance
|
| 79 |
performance = {}
|
| 80 |
+
performance["label_accuracy"] = total_true/(len(labels) * TEST_SIZE) # Success prediction of each label
|
| 81 |
+
performance["prediction_accuracy"] = total_success/TEST_SIZE # Success prediction of all 6 labels for 1 sample
|
| 82 |
+
performance["precision"] = TP / (TP + FP) # Label precision
|
| 83 |
+
performance["recall"] = TP / (TP + FN) # Label recall
|
| 84 |
print(performance)
|