[kakao monthly newsletter_9월호] Gemini와 Google ADK를 활용한 Real-time voice agent 개발하기

Gemini와 Google ADK를 활용한 Real-time voice agent 개발하기

💡참고 링크 : Build a real-time voice agent with Gemini & ADK

고급 대화형 AI를 구축하는 것은 이제 텍스트를 훨씬 뛰어넘는 수준으로 발전했습니다.

이제 우리는 AI를 사용하여 실시간 음성 기반 에이전트를 만들 수 있지만 이러한 시스템은 낮은 지연 시간의 양방향 통신, 실시간 정보 검색, 복잡한 작업 처리 능력이 필요합니다.

이 가이드는 Gemini와 Google ADK(Agent Development Kit)를 사용하여 이러한 에이전트를 구축하는 방법을 보여줍니다. 지능적이고 반응이 빠른 음성 에이전트를 만드는 방법을 배우게 될 것입니다.

기본 에이전트

먼저, 페르소나는 있지만 외부 도구에는 접근할 수 없는 에이전트를 만듭니다.

# In app/server/streaming_service.py

from google.adk.agents import Agent
from core_utils import MODEL, SYSTEM_INSTRUCTION

self.agent = Agent(
    name="voice_assistant_agent",
    model=MODEL,
    instruction=SYSTEM_INSTRUCTION,
    # 'tools' list는 현재 생략합니다.
)

이 에이전트는 대화는 할 수 있지만 외부 정보에는 접근할 수 없습니다.

고급 에이전트

에이전트를 유용하게 만들기 위해 도구를 추가합니다. 이를 통해 에이전트는 실시간 데이터 및 서비스에 접근할 수 있습니다.

streaming_service.py 의 에이전트에게 Google 검색 및 Google 지도에 대한 접근 권한을 부여합니다.

# In app/server/streaming_service.py

from google.adk.tools import GoogleSearch, MCPToolset
from google.adk.tools.mcp_tool.mcp_toolset import StdioServerParameters
from core_utils import MODEL, SYSTEM_INSTRUCTION
import os

Maps_api_key = os.environ.get("Maps_API_KEY")
self.agent = Agent(
    name="voice_assistant_agent",
    model=MODEL,
    instruction=SYSTEM_INSTRUCTION,
    tools=[
        GoogleSearch,
        MCPToolset(
            connection_params=StdioServerParameters(
                command='npx',
                args=["-y", "@modelcontextprotocol/server-google-maps"],
                env={"Maps_API_KEY": Maps_api_key}
            ),
        )
    ],
)

도구 자세히 보기

Google 검색: 이 사전 빌드된 ADK 도구는 에이전트가 Google 검색을 수행하여 현재 사건 및 실시간 정보에 대한 질문에 답변할 수 있도록 합니다.

Google 지도를 위한 MCP Toolset: 이는 모델 컨텍스트 프로토콜(MCP)을 사용하여 에이전트를 전문 서버(이 경우 Google Maps API를 이해하는 서버)에 연결합니다. 주 에이전트는 오케스트레이터 역할을 하여 처리할 수 없는 작업을 전문 도구에 위임합니다.

자연스러운 대화 설계하기

RunConfig 객체는 에이전트가 통신하는 방식을 정의합니다. 음성 선택 및 스트리밍 모드와 같은 측면을 제어합니다.

# In app/server/streaming_service.py (inside the handle_stream method)

from google.adk.agents.run_config import RunConfig, StreamingMode
from google.genai import types
from core_utils import VOICE_NAME

run_config = RunConfig(
    streaming_mode=StreamingMode.BIDI,
    speech_config=types.SpeechConfig(
        voice_config=types.VoiceConfig(
            prebuilt_voice_config=types.PrebuiltVoiceConfig(
                voice_name=VOICE_NAME
            )
        )
    ),
    response_modalities=["AUDIO"],
    output_audio_transcription=types.AudioTranscriptionConfig(),
    input_audio_transcription=types.AudioTranscriptionConfig(),
)

StreamingMode.BIDI(양방향)는 사용자가 에이전트를 방해할 수 있게 하여 더 자연스러운 대화를 만듭니다.

Asynchronous core

실시간 음성 채팅은 듣기, 생각하기, 말하기 등 여러 작업을 동시에 처리해야 합니다. Python의 asyncio와 TaskGroup이 이러한 병렬 작업을 관리합니다.

# In app/server/streaming_service.py (inside the handle_stream method)
import asyncio
async with asyncio.TaskGroup() as tg:
    # Task 1: 사용자의 브라우저에서 오디오를 수신합니다.
    tg.create_task(receive_client_messages(), name="ClientMessageReceiver")
    # Task 2: 오디오를 Gemini 서비스로 전달합니다.
    tg.create_task(send_audio_to_service(), name="AudioSender")
    # Task 3: Gemini로부터의 응답을 수신합니다.
    tg.create_task(receive_service_responses(), name="ServiceResponseReceiver")

Agent의 음성 변환하기

receive_service_responses 작업은 에이전트의 출력을 사용자에게 보내기 전에 처리합니다. 이 출력에는 오디오와 텍스트 스크립트가 포함됩니다.

오디오 처리

오디오는 Base64 인코딩을 사용하여 이진 데이터를 전송을 위한 텍스트 문자열로 변환하여 처리됩니다.

# --- Inside receive_service_responses ---
import base64
import json
# Handling Audio Response
if hasattr(part, "inline_data") and part.inline_data:
    # Encode the raw audio bytes into a Base64 text string.
    b64_audio = base64.b64encode(part.inline_data.data).decode("utf-8")
    # Package it in a JSON message, typed as "audio".
    await websocket.send(json.dumps({"type": "audio", "data": b64_audio}))

텍스트 처리

텍스트 스크립트는 실시간 피드백을 위해 스트리밍됩니다.

# --- Inside receive_service_responses ---
# Handling Text Response
if hasattr(part, "text") and part.text:
    # Check if the text is a partial thought.
    event_str = str(event)
    # Check if the text is a streaming, partial thought.
    if "partial=True" in event_str:
        # Send it for real-time display on the client.
        await websocket.send(json.dumps({"type": "text", "data": part.text}))

참고자료

Gemini의 네이티브 오디오https://blog.google/technology/google-deepmind/gemini-2-5-native-audio/
GitHub 프로젝트 소스 코드https://github.com/Ashwinikumar1/NavigoAI_Voice_Agent_ADk