File size: 3,614 Bytes
ab19ad9
 
8ad42f5
ab19ad9
 
8ad42f5
ab19ad9
 
 
 
4d4fccb
f22d5ac
ab19ad9
 
 
f22d5ac
ab19ad9
 
 
f22d5ac
d667f1f
 
ab19ad9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d667f1f
 
ab19ad9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d667f1f
f22d5ac
d667f1f
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
AI USAGE REPORT - Cloudzy AI Challenge
========================================

PROJECT OVERVIEW:
FastAPI-based photo management system with semantic search, AI image analysis, and text-to-image generation.

WHERE & HOW AI WAS USED:
1. Image Analysis - Structured Metadata (cloudzy/agents/image_analyzer.py)
   - Tool: Qwen/Qwen3-VL-8B-Instruct model via HuggingFace API
   - Function: Auto-generate tags, descriptions, and captions for uploaded photos
   
2. Image Analysis - Aesthetic Descriptions (cloudzy/agents/image_analyzer_2.py)
   - Tool: Gemini-2.0-Flash model via smolagents + OpenAI-compatible API
   - Function: Generate aesthetic image descriptions for inspiration-based generation
   
3. Text-to-Image Generation (cloudzy/inference_models/text_to_image.py)
   - Tool: FLUX.1-dev model via HuggingFace Inference API
   - Function: Generate images from text prompts
   
4. Semantic Search (cloudzy/search_engine.py + cloudzy/routes/search.py)
   - Tool: FAISS (vector database) with embeddings from Qwen/Qwen3-Embedding-8B (4096-dimensional)
   - Function: Find visually similar photos via L2-normalized embedding vectors

PROMPTS & MODEL INPUTS:
Image Analysis Prompt #1 - Structured Metadata (image_analyzer.py):
"Describe this image in the following exact format: result: {tags: [...], description: '...', caption: '...'}"
- Input: Image URL sent to vision model
- Model ingests structured format request to ensure JSON output

Image Analysis Prompt #2 - Generative Inspiration (image_analyzer_2.py, Gemini via smolagents):
"Describe this image in a way that could be used as a prompt for generating a new image inspired by it.
Focus on the main subjects, composition, style, mood, and colors.
Avoid mentioning specific names or exact details β€” instead, describe the overall aesthetic and atmosphere so the result feels similar but not identical."
- Input: Local image file sent to Gemini-2.0-Flash model
- Designed for generating aesthetic descriptions usable as prompts for image generation

Search Queries:
- User text β†’ converted to embeddings β†’ matched against photo database
- Album creation: Groups similar photos by distance threshold (randomized each call)

MODEL OUTPUTS REFINED:
βœ“ JSON parsing: Extracted structured data from model text response (with dict type-check for Gemini responses)
βœ“ Embedding model upgrade: Migrated from multilingual-e5-large (1024-d) to Qwen3-Embedding-8B (4096-d)
βœ“ Album randomization: Added random.shuffle() to prevent deterministic groupings
βœ“ Error handling: Wrapped API failures to graceful fallbacks

MANUAL VS AI-GENERATED (BREAKDOWN):
AI-Generated (65%):
- Model integration boilerplate (API clients, token management)
- FAISS index structure and search logic
- Vision model prompt formatting
- Default model selections (Qwen3-VL-8B, FLUX.1-dev)

Manual Refinements (35%):
- Database schema design (Photo, embeddings storage)
- FastAPI route structure and error handling
- Album clustering algorithm and break conditions
- Distance threshold validation and tuning
- File upload validation and storage management
- CORS middleware configuration

KEY TECHNICAL DECISIONS:
1. Embedding model: Qwen3-Embedding-8B (4096-d) for better semantic understanding than smaller models
3. Distance thresholds: search() ≀ 1.5, create_albums() ≀ 1.5 (optimized for normalized embeddings)
4. Model choice: Qwen3-VL for balanced speed/quality in image analysis
5. FLUX.1-dev: High-quality image generation over speed
6. Random album creation: Ensures different groupings per request
7. HuggingFace Hub: Leveraged pre-tuned models vs training custom