stream=True).
If you’re building a chat interface or real-time agent, you need streaming. But historically, streaming makes it very difficult to get exact token counts because the response arrives in chunks rather than a single object.
TokenSense handles this transparently.
How It Works
When you wrap an LLM call withobserve() and set stream=True:
- Exact Usage Extraction: TokenSense intercepts the generator and scans the chunks as they pass through. For providers that support returning usage in the stream (like Anthropic, or OpenAI/Groq with
stream_options), TokenSense extracts the exact token counts reported by the provider. - Auto-Injection: If you forget to pass
stream_options={"include_usage": True}to an OpenAI or Groq call, TokenSense automatically injects it for you. This doesn’t change your billing or behavior, it just ensures the final chunk contains the exact usage data. - Zero Overhead: The chunk extraction is completely synchronous and non-blocking. The latency added to your first chunk is typically
<0.5ms.
Graceful Early Termination
What happens if a user closes their browser tab mid-stream? Your application usually breaks out of thefor chunk in response loop, which triggers a GeneratorExit exception in Python.
TokenSense catches this cleanly. If a stream is aborted early:
- TokenSense marks the resulting event with
partial=True. - It accurately estimates the tokens and cost based on exactly what was consumed up to the break point, ensuring your cost tracking remains highly accurate even on interrupted streams.
Sync and Async
TokenSense natively supports both synchronous generators (for chunk in response) and asynchronous generators (async for chunk in response). You don’t need to change anything about how you call your client; TokenSense automatically wraps the appropriate type.
