TechTrends Now - Tech News for Builders and Operators

I Built a Smart Gemini API Key Manager Because Rate Limits Were Driving Me Crazy

The Problem

If you've ever built something with the Gemini API on the free tier, you know this error intimately:

429 RESOURCE_EXHAUSTED — You exceeded your current quota.

Here's what makes it frustrating: Google actually lets you create multiple Cloud projects, each with its own independent API key and quota. So technically you can have 8 keys, each with their own limits. But managing them manually? Painful.

Most developers do one of three things:

Give up and upgrade to paid
Write a hacky round-robin script that rotates every 30 seconds
Hit rate limits constantly and just deal with it

I did the third one for a while. Then I got fed up and built gemini-flux.

What Is gemini-flux?

gemini-flux is a smart Gemini API key management microservice. You give it N keys, it handles everything else — rotation, cooldowns, model fallbacks, daily resets, key validation.

The key word is smart. This isn't a dumb round-robin rotator. It's a token-aware sliding window scheduler.

GitHub: https://github.com/malikasana/gemini-flux

The Math Behind It

Gemini's free tier has a 250,000 tokens per minute (TPM) limit per project.

Most rotation tools ignore this completely. They just rotate every X seconds regardless of what's happening. That's wasteful and broken for large requests.

The real question is:

"When is the EARLIEST I can send the next request without getting rate limited?"

The answer is pure math:

cooldown = token_count / tokens_per_minute

500k token request: cooldown = 500,000 / 250,000 = 2 minutes
100k token request: cooldown = 100,000 / 250,000 = 24 seconds
10k token request:  cooldown = 10,000 / 250,000 = 2.4 seconds

With 8 keys:

worst_case_interval = cooldown / n_keys

1M token request: 240s / 8 = 30 seconds between requests
10k token request: 2.4s / 8 = 0.3 seconds — nearly instant!

The system adapts automatically based on actual token usage. Light requests are nearly instant. Heavy requests get smart cooldowns.

How It Works

1. Token Counting (FREE)

Before every request, gemini-flux counts tokens using Google's free count_tokens API — doesn't cost a single quota unit.

2. Sliding Window Per Key

Each key maintains a 60-second sliding window of token usage. When you ask "can this key handle a 400k token request right now?", it looks at actual usage in the last 60 seconds and answers precisely.

3. Pick the Best Key

For each incoming request:

Find the key with enough token capacity RIGHT NOW → send immediately
If no key is ready → calculate exact wait time for soonest available key → wait precisely that long

No blind rotation. No unnecessary waiting.

4. Model Exhaustion Chain

When a model's daily RPD quota is hit on a key, it moves to the next model automatically — not because it failed, but because it's exhausted for the day:

1. gemini-2.5-pro               → 100 RPD
2. gemini-2.5-flash             → 250 RPD  ← main workhorse
3. gemini-2.5-flash-lite        → 1000 RPD
4. gemini-3.1-pro-preview       → newest pro
5. gemini-3-flash-preview       → newest flash
6. gemini-3.1-flash-lite-preview → newest lite

5. Smart Policy Fetcher

On startup, gemini-flux uses 1 request to ask Gemini about its own free tier limits. It parses the response and uses those numbers for all internal math. Cached for 7 days. If Google changes their limits tomorrow → the system catches it automatically.

Total Free Capacity With 8 Keys

Model	RPD/key	× 8 keys	Daily total
gemini-2.5-pro	100	× 8	800/day
gemini-2.5-flash	250	× 8	2,000/day
gemini-2.5-flash-lite	1000	× 8	8,000/day
Preview models	varies	× 8	bonus!
TOTAL			10,800+/day

All free. No credit card.

Using It

Direct Python (works in Kaggle too):

from core import GeminiFlux

flux = GeminiFlux(
    keys=["key1", "key2", ..., "key8"],
    mode="both",
    log=True
)

response = flux.generate("Translate this transcript to Spanish...")
print(response["response"])
# {
#   "response": "...",
#   "key_used": 3,
#   "model_used": "gemini-2.5-flash",
#   "tokens_used": 45231,
#   "wait_applied": 1.8,
#   "retried": False
# }

Docker Microservice:

docker build -t gemini-flux .
docker run -p 8000:8000 --env-file .env gemini-flux

from client.client import GeminiFluxClient

client = GeminiFluxClient(base_url="http://localhost:8000")
response = client.generate("your prompt here")

Keys via .env (no hardcoding):

GEMINI_KEY_1=AIza...
GEMINI_KEY_2=AIza...
...
GEMINI_KEY_8=AIza...
GEMINI_MODE=both
GEMINI_LOG=true

Runtime Controls

flux.set_mode("flash_only")    # change mode anytime
flux.disable_key(3)            # disable a specific key
flux.enable_key(3)             # re-enable it
flux.refresh_policy()          # force re-fetch Gemini limits
flux.status()                  # see all key statuses + usage

Why I Built This

I'm building a dubbing application that needs to send large requests — a system prompt with instructions + a full transcript chunk — continuously. Each request can easily be 100k-500k tokens.

With a single key, you hit cooldowns constantly. With 8 keys and dumb rotation, you still waste time waiting unnecessarily. What I needed was something that knew exactly when each key could accept exactly how many tokens, and scheduled accordingly.

That's gemini-flux.

What's Next

Async support for parallel requests
Per-key usage dashboard
Support for other providers (OpenAI, Anthropic) with the same scheduling logic

Try It

GitHub: https://github.com/malikasana/gemini-flux

git clone https://github.com/malikasana/gemini-flux
cd gemini-flux
pip install -r requirements.txt
cp .env.example .env
# add your keys to .env
python test.py

If you find it useful, drop a star ⭐ — it helps a lot!

Built by Muhammad Ali — [email protected]