Spaces:
Runtime error
Runtime error
Update README architecture
Browse files
README.md
CHANGED
|
@@ -160,8 +160,10 @@ The following environment variables can be set to customize the behavior of the
|
|
| 160 |
|
| 161 |
|
| 162 |
### Architecture
|
| 163 |
-
The input text is first preprocessed and tokenized using `spaCy` where:
|
| 164 |
-
-
|
|
|
|
|
|
|
| 165 |
- Words are converted to lowercase
|
| 166 |
- Lemmatization is performed (words are converted to their base form based on the surrounding context)
|
| 167 |
|
|
|
|
| 160 |
|
| 161 |
|
| 162 |
### Architecture
|
| 163 |
+
The input text is first preprocessed and tokenized using `re` and `spaCy` where:
|
| 164 |
+
- The text is cleaned up by removing any HTML tags and converting emojis to text
|
| 165 |
+
- Stop words and punctuation are removed
|
| 166 |
+
- URLs, email addresses and numbers are removed
|
| 167 |
- Words are converted to lowercase
|
| 168 |
- Lemmatization is performed (words are converted to their base form based on the surrounding context)
|
| 169 |
|