I Built a Smart Gemini API Key Manager Because Rate Limits Were Driving Me Crazy
The Problem
If you've ever built something with the Gemini API on the free tier, you know this error intimately:
429 RESOURCE_EXHAUSTED β You exceeded your current quota.
Here's what makes it frustrating: Google actually lets you create multiple Cloud projects, each with its own independent API key and quota. So technically you can have 8 keys, each with their own limits. But managing them manually? Painful.
Most developers do one of three things:
- Give up and upgrade to paid
- Write a hacky round-robin script that rotates every 30 seconds
- Hit rate limits constantly and just deal with it
I did the third one for a while. Then I got fed up and built gemini-flux.
What Is gemini-flux?
gemini-flux is a smart Gemini API key management microservice. You give it N keys, it handles everything else β rotation, cooldowns, model fallbacks, daily resets, key validation.
The key word is smart. This isn't a dumb round-robin rotator. It's a token-aware sliding window scheduler.
GitHub: https://github.com/malikasana/gemini-flux
The Math Behind It
Gemini's free tier has a 250,000 tokens per minute (TPM) limit per project.
Most rotation tools ignore this completely. They just rotate every X seconds regardless of what's happening. That's wasteful and broken for large requests.
The real question is:
"When is the EARLIEST I can send the next request without getting rate limited?"
The answer is pure math:
cooldown = token_count / tokens_per_minute
500k token request: cooldown = 500,000 / 250,000 = 2 minutes
100k token request: cooldown = 100,000 / 250,000 = 24 seconds
10k token request: cooldown = 10,000 / 250,000 = 2.4 seconds
With 8 keys:
worst_case_interval = cooldown / n_keys
1M token request: 240s / 8 = 30 seconds between requests
10k token request: 2.4s / 8 = 0.3 seconds β nearly instant!
The system adapts automatically based on actual token usage. Light requests are nearly instant. Heavy requests get smart cooldowns.
How It Works
1. Token Counting (FREE)
Before every request, gemini-flux counts tokens using Google's free count_tokens API β doesn't cost a single quota unit.
2. Sliding Window Per Key
Each key maintains a 60-second sliding window of token usage. When you ask "can this key handle a 400k token request right now?", it looks at actual usage in the last 60 seconds and answers precisely.
3. Pick the Best Key
For each incoming request:
- Find the key with enough token capacity RIGHT NOW β send immediately
- If no key is ready β calculate exact wait time for soonest available key β wait precisely that long
No blind rotation. No unnecessary waiting.
4. Model Exhaustion Chain
When a model's daily RPD quota is hit on a key, it moves to the next model automatically β not because it failed, but because it's exhausted for the day:
1. gemini-2.5-pro β 100 RPD
2. gemini-2.5-flash β 250 RPD β main workhorse
3. gemini-2.5-flash-lite β 1000 RPD
4. gemini-3.1-pro-preview β newest pro
5. gemini-3-flash-preview β newest flash
6. gemini-3.1-flash-lite-preview β newest lite
5. Smart Policy Fetcher
On startup, gemini-flux uses 1 request to ask Gemini about its own free tier limits. It parses the response and uses those numbers for all internal math. Cached for 7 days. If Google changes their limits tomorrow β the system catches it automatically.
Total Free Capacity With 8 Keys
| Model | RPD/key | Γ 8 keys | Daily total |
|---|---|---|---|
| gemini-2.5-pro | 100 | Γ 8 | 800/day |
| gemini-2.5-flash | 250 | Γ 8 | 2,000/day |
| gemini-2.5-flash-lite | 1000 | Γ 8 | 8,000/day |
| Preview models | varies | Γ 8 | bonus! |
| TOTAL | 10,800+/day |
All free. No credit card.
Using It
Direct Python (works in Kaggle too):
from core import GeminiFlux
flux = GeminiFlux(
keys=["key1", "key2", ..., "key8"],
mode="both",
log=True
)
response = flux.generate("Translate this transcript to Spanish...")
print(response["response"])
# {
# "response": "...",
# "key_used": 3,
# "model_used": "gemini-2.5-flash",
# "tokens_used": 45231,
# "wait_applied": 1.8,
# "retried": False
# }
Docker Microservice:
docker build -t gemini-flux .
docker run -p 8000:8000 --env-file .env gemini-flux
from client.client import GeminiFluxClient
client = GeminiFluxClient(base_url="http://localhost:8000")
response = client.generate("your prompt here")
Keys via .env (no hardcoding):
GEMINI_KEY_1=AIza...
GEMINI_KEY_2=AIza...
...
GEMINI_KEY_8=AIza...
GEMINI_MODE=both
GEMINI_LOG=true
Runtime Controls
flux.set_mode("flash_only") # change mode anytime
flux.disable_key(3) # disable a specific key
flux.enable_key(3) # re-enable it
flux.refresh_policy() # force re-fetch Gemini limits
flux.status() # see all key statuses + usage
Why I Built This
I'm building a dubbing application that needs to send large requests β a system prompt with instructions + a full transcript chunk β continuously. Each request can easily be 100k-500k tokens.
With a single key, you hit cooldowns constantly. With 8 keys and dumb rotation, you still waste time waiting unnecessarily. What I needed was something that knew exactly when each key could accept exactly how many tokens, and scheduled accordingly.
That's gemini-flux.
What's Next
- Async support for parallel requests
- Per-key usage dashboard
- Support for other providers (OpenAI, Anthropic) with the same scheduling logic
Try It
GitHub: https://github.com/malikasana/gemini-flux
git clone https://github.com/malikasana/gemini-flux
cd gemini-flux
pip install -r requirements.txt
cp .env.example .env
# add your keys to .env
python test.py
If you find it useful, drop a star β β it helps a lot!
Built by Muhammad Ali β [email protected]
United States
NORTH AMERICA
Related News
What Does "Building in Public" Actually Mean in 2026?
19h ago
The Agentic Headless Backend: What Vibe Coders Still Need After the UI Is Done
19h ago
Why Iβm Still Learning to Code Even With AI
21h ago
I gave Claude a persistent memory for $0/month using Cloudflare
1d ago
NYT: 'Meta's Embrace of AI Is Making Its Employees Miserable'
1d ago