tech_surveillance4314 wordsRead on Arc Codex

Deploy secure agentic AI: Protocols and performance tuning

Welcome to the final installment of our series on building a conversational analyst using Red Hat OpenShift AI and EnterpriseDB. So far, we have deployed the quickstart in Part 1, explored agentic AI concepts in Part 2, and examined the LLM orchestration architecture in Part 3. Now, we will examine how the Model Context Protocol (MCP) standardizes tool integration, how event streaming improves the user experience, and the vital security and performance tuning practices required to safely deploy stochastic reasoning engines in your enterprise. Model Context Protocol (MCP) While the vLLM inference server uses the configured tool call parser and chat message template to resolve the model-specific message format differences we discussed in Part 3, the MCP protocol abstracts the orchestrator’s interaction with the MCP server. This allows the orchestrator to add or swap MCP servers with minimal disruption to its code base. Let’s look more closely at the provisions of MCP as a message format and protocol. As a message format, MCP extends another standard called JSON-RPC 2.0 by providing a specific semantic model. JSON-RPC is a general-purpose remote procedure call (RPC) standard that forms the basis of many other standards (e.g., Bitcoin RPC and Transmission RPC). By itself, JSON-RPC lacks sufficient details and semantics for our purposes. For example, while JSON-RPC specifies a required result field in a successful response to a remote procedure call like tools/list (as shown in the following snippet), it doesn’t specify what should be contained in the result field or the meaning of its contents. { "jsonrpc": "2.0", "method": "tools/list", "params": {}, "id": 2 } This is the job of the MCP and JSON schema standards, which together specify that the result field should contain a list of tools and each tool must have a name , description , inputSchema and outputSchema . { "jsonrpc": "2.0", "id": 2, "result": { "tools": [ { "name": "list_schemas", "description": "...", "inputSchema": {}, "outputSchema": {} }, { "name": "list_objects", "description": "...", "inputSchema": {}, "outputSchema": {} } ] } } Note in particular the inputSchema and outputSchema data elements (whose contents are omitted in the code block above for brevity) describe the structure of each tool’s expected input and output in great detail, especially in the context of SQL and its complex data types. PG Airman MCP handles these details for us. While this message format is a useful contract between the orchestrator and MCP server, a different, briefer format is used to communicate the MCP server’s tools to the LLM. For most open-weight LLMs, this format follows the OpenAI convention shown below. { "type": "function", "function": { "name": "list_objects", "description": "List objects in a schema with comments", "parameters": { "properties": { "schema_name": { "description": "Schema name", "title": "Schema Name", "type": "string" }, "object_type": { "default": "table", "description": "Object type: 'table', 'view', 'sequence', 'function', 'stored procedure', or 'extension'", "title": "Object Type", "type": "string" } }, "required": [ "schema_name" ], "title": "List_objectsArguments", "type": "object" } } } It is the orchestrator’s responsibility to construct the tool descriptions in this OpenAI format. This is one area Red Hat’s integrated Llama Stack operator simplifies for you. The key idea here is that the LLM uses this information, specifically, the semantic description of each function and its formal parameters, to determine which tools are relevant to answer the user’s query and how to invoke them with proper arguments. As a protocol, MCP requires specific sequences of message exchanges for each type of interaction, such as the requirement for the client to call the initialize method and use the resulting session identifier in subsequent calls. The initialization sequence also provides an opportunity for the client to discover the server’s capabilities. The following code block shows the contents of a sample initialization request. { "jsonrpc": "2.0", "method": "initialize", "params": { "protocolVersion": "2025-11-25", "capabilities": {}, "clientInfo": { "name": "jupyter-experiment", "version": "1.0.0" } }, "id": 1 } This is its response for the PG Airman MCP server. { "jsonrpc": "2.0", "id": 1, "result": { "protocolVersion": "2025-11-25", "capabilities": { "experimental": {}, "prompts": { "listChanged": false }, "resources": { "subscribe": false, "listChanged": false }, "tools": { "listChanged": false } }, "serverInfo": { "name": "pg-airman-mcp", "version": "1.26.6" } } } A summary of MCP’s concepts and contributions: - Supports multiple communication protocols: The protocol supports stdio, server-sent events (SSE), and streamable HTTP. Streamable HTTP is recommended and used by the copilot quickstart for full-duplex, out-of-process communication using a single HTTP endpoint between the MCP client and server. - JSON-RPC method interface and message format: Provides a set of methods and message formats for interacting with tools, resources, and prompts. - Handshake protocol: Enables session initialization and capabilities discovery. The protocol is stateful, requiring clients to initiate long-lived sessions to address unexpected connection drops, and includes methods for clients and servers to discover their mutual capabilities. - Sampling: Allows MCP servers to submit queries to the client’s LLM, effectively inverting the standard role between the MCP server and client. Compound AI systems Agentic AI applications are increasingly compound AI systems that orchestrate multiple components from different providers across Red Hat’s global ecosystem. This is particularly true in multi-agent settings that integrate several role-specific LLMs and traditional ML models. The key catalyst for compound systems is the widespread adoption of open-source standards like Model Context Protocol (MCP) and others overseen by the Agentic AI Foundation (AAIF), of which Red Hat is a founding member. These and other open-source standards supported and implemented by OpenShift AI drive innovation and help enterprises avoid vendor or LLM lock-in. Event-streaming architecture A key aspect of the copilot’s architecture is the use of server-sent events (SSE) while processing user queries. SSE enables Python code running in a server (e.g., a REST server) to send notifications to a browser client during processing. The use of SSE in agentic applications provides several benefits, including improving the user’s perception of performance but, more importantly, keeping the user engaged with the fine-grained steps the LLM and orchestrator are following to answer a query. This allows the user to spot and correct potential errors; for example, by improving prompts to correct incorrect assumptions the LLM has made. The key events of the sequence diagram in Figure 2 in Part 3 are raised by the MCP-Direct and Llama Stack orchestrators to the user interface. The user interface in turn uses the event information to create a more responsive, transparent user experience. The SSE event definitions sent by the copilot orchestrator to the user interface are as follows: - query_start: Query processing begins. - iteration_start: New agentic iteration. - llm_thinking: LLM reasoning process (if enabled for the Nemotron model). - llm_content_delta: Streaming response text from the LLM. - tool_call: Tool execution started (if requested by the LLM). - tool_result: Tool execution completes. - timing_summary: The backend generates performance metrics. - final_summary: The LLM's final answer is returned. - error: An error occurred. When it comes to event-streaming architectures, Python generators and SSE are a match made in heaven. Use of the yield keyword in Python maps directly to the SSE events. See the ChatInterface.svelte file to learn more about how the user interface reacts to these events. Both orchestrators (MCP-Direct and Llama Stack) emit an identical set of standardized SSE events, enabling the frontend to remain orchestrator-agnostic. The MCP-Direct orchestrator generates events natively within its agentic loop, while the Llama Stack orchestrator maps Llama Stack's native event stream to those previously listed. This abstraction allows a smoother transition to different orchestrator technologies in the future. Note: Don’t confuse the use of SSE here with the choice between SSE or Streamable HTTP for the PG Airman MCP server. The orchestrator uses Streamable HTTP when communicating with the PG Airman MCP server, not SSE. Agentic patterns While our implementation of an agentic loop is successful, there are many variations to consider for your specific enterprise. Authors have identified several taxonomies of agentic AI patterns. J. E. Joyce and S. Maheshwari in "A Design-Driven Taxonomy of AI Agentic Patterns" (2025 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT)), identify seven groups of agentic AI architectural patterns, from simple, stateless reactive patterns to learning-based agents that utilize reinforcement learning, neural networks, and evolutionary (genetic) algorithms. Among these seven groups is the Foundation Model-Based (FMB) pattern, which is our focus. These patterns leverage LLMs as the core of their reasoning and decision-making capability. Yue Liu and others in "Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents" (Journal of Systems and Software 220 (2025)) further analyze FMB patterns to reveal 18 separate compositional patterns, including a decision model for selecting the right patterns for your organization. You can find similar research by Qinghua Lu and others in “Towards Responsible Generative AI: A Reference Architecture for Designing Foundation Model based Agents” (2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C)). Our quickstart employs several of these agentic design patterns: - Agent Adapter: Provides an interface to connect the agent to external tools, such as the PG Airman MCP. - Retrieval Augmented Generation (RAG): Enhances the LLM’s parametric knowledge with in-context information obtained from external or private documents. In this copilot, this pattern is used to augment the model with a data governance policy. - Single-Path Plan Generator: Performs linear orchestration of intermediate steps that leads to a final solution. - Passive Goal Creator: Supports dialog-based interaction with users who submit queries to the system explicitly. Note the terminology above is provided in the context of the aforementioned research papers, but other names are often used to represent similar ideas (e.g., linear orchestration, ReAct pattern, etc.). Through coding changes to the orchestrator, these design patterns can be extended or replaced with more sophisticated approaches. For example, the multi-path plan generator pattern can be used to generate multiple strategies to respond to a user query. This approach is usually combined with a reflection design pattern that evaluates each plan, potentially incorporating user feedback or input from other agents. This multi-path approach may lead to improved insight generation for complex analytical tasks that can be decomposed in multiple valid ways. This is an extremely common occurrence for analysts and data owners working together and often results in refining or broadening the original analytic task. For example, the answer to the question, Which customers have the highest lifetime value and are they still active? could follow different query sequences. First, it ranks the sequences by LTV. Then it marks the active ones, filtering active customers first, then calculating their LTV, or segmenting by value tiers to spot churn patterns. Each path may surface different but valid insights. For example, the first approach may identify top tier customers that have churned to an inactive status; while adding the results of the second approach may spot a decline in spending by your active top-tier customers. A reflection step can combine the most valuable findings from multiple approaches. SQL exception handling When SQL execution fails within PG Airman MCP, the exception is caught and returned to the copilot orchestrator, which forwards the error message to the LLM as context for the next iteration. The LLM can then analyze the error (e.g., syntax issues, invalid column references, type mismatches) and generate corrected SQL in its subsequent response. This self-correction capability is a key advantage of agentic systems over single-shot chat interactions. To prevent runaway execution, the agentic loop enforces a maximum iteration limit of 100 iterations per query. Additionally, SQL errors are surfaced to users as real-time Server-Sent Events in the UI, allowing users to monitor SQL exceptions and step in to resolve an issue. Security Modern agentic AI architectures represent a paradigm shift in application logic, transitioning from deterministic, rule-based control flows to non-deterministic, stochastic reasoning engines powered by LLMs. This underscores the need to implement a defense-in-depth security posture to address multiple security threats, including privilege escalation, data exfiltration, and denial of service attacks. Security is an in-depth topic and we recommend readers review a dedicated security guide. Basic principle of LLMs and security Developer instructions via system prompts are ineffective for enforcing security constraints since LLMs process both developer and user instructions probabilistically in a single token stream. This means that security constraints provided by the developer in the system prompt don’t hold a special, inviolable status compared to user input. The LLM sees both as one continuous instruction stream. Agentic AI addresses the security implications of this design in several ways. MCP tool registration MCP is a structured protocol that allows only named tools to be executed. This is critical since we cannot rely on the LLM to limit its tool requests to the vetted list advertised by the MCP server. If the LLM requests execution of a method that exists in the Python server but was not intended for external use, the ToolManager in the FastMCP SDK will reject it, preventing MCP from serving as an arbitrary tool execution platform for the LLM. As an additional safeguard, the copilot orchestrator inspects each tool request using Pydantic models to ensure they are valid and among the explicitly allowed list, rejecting any others even if they are advertised by the MCP server. This forces developers to review and approve changes to the MCP server’s tool list during upgrades. Application-level checks Since the PG Airman MCP server provides a method (execute_sql ) that accepts SQL statements, it opens the door for general SQL code execution and hence must provide application-level checks as a safeguard against dangerous SQL commands, like TRUNCATE and DROP . When deployed in restricted mode, PG Airman MCP uses deep SQL syntax parsing using the pglast Python library to ensure only the SQL commands in the allowed list are executed. Figure 1 highlights an example that leverages Postgres’ ability to use CTEs to update data. The WITH statement shown on the left of Figure 1 is valid in Postgres and would otherwise parse and update the table despite its top-level appearance as a SELECT statement. PG Airman MCP’s use of deep syntax analysis reliably detects and rejects the nested use of the UPDATE statement. The PG Airman MCP server includes other safeguards to detect unsafe SQL statements. In restricted mode (used by the copilot), every query runs inside a BEGIN TRANSACTION READ ONLY block and ends with an unconditional ROLLBACK to ensure inadvertent updates that slipped past the parser will be blocked by PostgreSQL and never committed. Timeouts are also enforced to prevent runaway SQL statements that exhaust database resources. Principle of least privilege Our quickstart employs a defense-in-depth strategy. The database user the MCP server connects with possesses only the permissions it requires. During deployment, the quickstart creates a dedicated mcp_readonly account that is limited to read access for the objects in the public schema. Normally with the ALLOW_COMMENT_IN_RESTRICTED=TRUE setting, the MCP server is able to update comments on tables and views even in restricted mode. However, enabling comment updates to a database object requires granting the MCP database user object ownership access. Since our quickstart does not allow users to update object metadata, it sets the ALLOW_COMMENT_IN_RESTRICTED to false and deploys the MCP server with a read-only database user for greater security. If you wish to update object metadata, you may turn this feature on by updating the MCP deployment.yaml file and granting ownership rights to the mcp_readonly account (see the pgvector/scripts.create_readonly_user.sql file). Redundant layers of security present potential adversaries with additional challenges should one layer fail due to misconfiguration or unexpected issues. Additional basic security measures our quickstart employs: - Explicitly set pod- and container-level security constraints - Implementation: Configured in the template/spec/securityContext section ofdeployment.yaml in each Helm chart; blocks dangerous syscalls (like reboot) and capabilities. - Threat/Attack Vector: Container breakout/escape, privilege escalation via vulnerabilities, and SUID/SGID binary exploitation. - Layer: Pod + container - Verification: See the VERIFY_SECCOMP.md file. - Implementation: Configured in the - Place CPU, memory, and ephemeral storage limits on pods - Implementation: Defined in the values.yaml file in each Helm chart. - Threat/Attack Vector: Resource exhaustion attacks, such as log flooding and infinite loops. - Layer: Container - Verification: See the VERIFY_RESOURCES.md file. - Implementation: Defined in the - Create a highly available architecture using container replicas - Implementation: Utilizes values.yaml anddeployment.yaml , retry logic inmcp_direct andretryFetch.ts , and basic health checks. - Threat/Attack Vector: Denial of service attacks triggered by the LLM generating complex queries. - Layer: Pod - Verification: Stress/load testing and direct pod deletion tests. - Implementation: Utilizes Additional steps To simplify deployment and evaluation, the copilot quickstart does not include an authentication or authorization scheme to protect the user interface or backend API. This work typically requires deeper integration with your organization’s chosen security scheme (e.g., OIDC). You can also learn more about PG Airman’s Oath 2.0 support to protect tool access. Finally, you can explore different guard technologies that use multiple filtering techniques, including regular expressions, traditional machine learning and smaller LLMs, to detect adversarial input designed to fool LLMs to generate harmful output. You can also use these measures to detect second-order attacks (i.e., harmful input that makes its way to the LLM via an external process that updates the database). Performance In this section, we will discuss several factors that impact the performance of the copilot and other agentic conversational analysts, such as temperature and sampling parameters. Temperature Decoder LLMs like Qwen3 and Nemotron generate text one token at a time by sampling from a probability distribution conditioned on the user's query and all previously generated tokens. Figure 2 shows GPT-4o mini selecting the for token based on the displayed distribution. Two parameters control this sampling process: temperature and sampling strategy. amount token (source: https://vibes.sqlgene.com/logprobs-visualizer/).In the final layer of the decoder, it assigns a raw score (logit) to each token in the LLM’s vocabulary. Before these logits are converted to a probability distribution, the scores are divided by a temperature parameter. Values less than 1 have a multiplicative effect on logits and exaggerate their differences, leading to a more peaked distribution while values greater than 1 dampen their differences and result in a more uniform (random) distribution of equally probable choices. The following table lists the commonly suggested temperature ranges. The copilot uses a temperature value of 0.1 but this can be changed in the values.yaml file in the copilot_backend Helm chart. A slightly higher value may lead to improved results on insight generation. Temperature Range | Effects | Use Case | 0.1 - 0.4 | Peaked vocabulary distribution; greater determinism with fewer errors/hallucinations; more repetition across conversations | Code generation | 0.5 - 0.8 | Middle ground that avoids repetition across conversations and errors/hallucinations | General use | 1.0 - 2.0 | Creates a uniform (random) vocabulary distribution; LLM can take many paths | Creative writing and insight generation | Sampling parameters With a scaled probability distribution in hand, we need a way to pick the next token. If we just picked the token with the largest probability, there’d be no point to converting from logits and applying scaling since we would have just picked the largest logit. Hence, we need a way to sample from the resulting distribution. Top-k sampling The simplest technique spins a weighted wheel on the top k probabilities, but this ignores actual values. It’s like picking the winners of a race as those with the three fastest times even when there were ten others that finished milliseconds behind them. This may be how things work in sports, but we can do better with LLMs. Min-P The sampling technique our quickstart uses is Min-P, which adjusts our cutoff based on a configured percentage of the highest probability token. Scores below this value are excluded from consideration. For example, if the threshold is 0.1, only those tokens whose probability is at least 10% of the top token are kept as candidates for the final spin of the wheel. This works well for two reasons: - It broadens the list of candidates when many tokens are roughly equally probable, whereas a hard cutoff like top k would arbitrarily eliminate many equally valid tokens from consideration. - It eliminates low-probability tokens when there exist only one or two tokens at the top of the list with large probability scores, whereas top k would blindly include these low-probability tokens, as we see in Figure 2. Here, the model is very confident the next token should be for . Including low probability tokens likesold andof as candidates during sampling makes it possible (though improbable) for the LLM to take those paths, leading to errors or poor-quality output. Prompt engineering One of the most significant factors that impacts the performance of the copilot is prompt engineering–the ability to craft user queries that lead to better LLM responses. There are several techniques in use and readers are encouraged to learn more, but you can improve the quality of the copilot’s output significantly by remembering two basic principles: provide examples in your query of what you expect and provide the chain of thought you would follow to generate the correct results. For example, if you would like the copilot to format or transform your results in a certain way, provide a few concrete examples in your query using real data. If your queries require a procedural approach (i.e., you are querying a financial database that lists various agents and corporations and you’d like to determine if any of the agents have a potential conflict of interest), cue the LLM with the step-by-step reasoning you might follow to answer the query. Refer to this guide to learn more about these techniques. In general, the more you think of the LLM as a talented new hire that needs a little orientation to get started (versus a well versed colleague that always seems to know what you’re thinking), the better your results will be. Red Hat vLLM inference server The most significant factor that affects the performance of your agentic applications is the use of a shared inference server, like Red Hat’s vLLM inference server. LLMs on their own are static collections of floating point numbers that represent the patterns they’ve learned over their training data. The vLLM takes these models and converts them into scalable, secure services that drive generative and other AI tasks across your entire cluster using intelligent queueing techniques, like continuous batching, to ensure reliable and even performance across all clients. Though techniques like prompt engineering can help you get the best performance out of your locally deployed models, to obtain state-of-the-art performance, you may need to deploy larger models with higher-end hardware requirements. Red Hat’s integrated vLLM server ensures your hardware investment is used efficiently across all your AI clients versus being tied to a specific application. Our quickstart’s helm charts provide everything needed to deploy either the Qwen3 or Nemotron model to Red Hat’s vLLM inference server. Context length and agent memory The number of tokens submitted to the LLM increases rapidly as the size of the tool descriptions, tool call results, the system prompt (including the data governance policy), and conversation length increase. This can be especially pronounced for agentic applications that interact with relational data since tool calls have the potential to return thousands of database rows. LLM models are stateless. Once they respond to a query, they forget about it. Hence, for every turn of a running conversation, all the components in Figure 1 in Part 3 must be resubmitted to the LLM, quickly approaching the maximum number of tokens an LLM can process on given hardware for any given conversation turn. There is an emerging standard pattern of managing agent context and memory via a dedicated retrieval system that holds all of the agent’s session context. For subsequent LLM inference a semantic search and retrieval is conducted and only the most relevant context is retrieved and augmented to the LLM prompt, which reduces the required tokens very effectively. The copilot does not include this context optimization and agent memory techniques yet, but agent memory and context optimization is an active area of research at EDB and we invite you to reach out to them to learn more. Potential enhancements We can enhance the copilot in several ways, including providing conversation persistence. The copilot currently provides an in-memory store for conversations, which is not robust to application restarts. A persistent storage backend should be provided for production environments. Sample prompts can also be saved in persistent storage to allow experts to create and share effective or recurring prompts with a broader team. The next chapter in two of technology’s biggest success stories The integration of agentic AI with relational databases marks a pivotal shift in how enterprises interact with their most valuable asset: data. Throughout this four-part series we’ve explored how the leap from brittle, schema-specific text-to-SQL methods to modern LLM-powered conversational analysts is not just a magic trick but a robust, scalable solution available to your enterprise today. By giving the LLM the agency to safely navigate schemas, understand object-level metadata, and adhere to broad organizational governance policies, conversational analytics can eliminate the technical barriers that hamper your enterprise’s data analysis workflow. Our quickstart combines the power of Red Hat OpenShift AI with EnterpriseDB’s PG Airman MCP server to provide organizations with a blueprint for building secure, agentic applications that run entirely within their controlled networks. This architecture delivers several critical enterprise advantages: - Data Sovereignty and Security: Processing queries and data on a secure platform within your custody ensures sensitive PII and proprietary schemas are never exposed to third-party APIs. - Built-in Governance: Integrating high-level data policies with database-level metadata keeps LLM-generated analysis accurate and compliant. - Open Standards and Flexibility: Utilizing the Model Context Protocol (MCP) and open-weight LLMs mitigates vendor lock-in, allowing you to seamlessly swap models and orchestration technologies as the AI landscape evolves. If you haven’t already, we invite you to clone the Data Governance Copilot repository and deploy it on your OpenShift cluster.

How it works

Once you click Generate, Ollama reads this article and crafts 5 comprehension questions. Your answers are graded against the article content — general knowledge won't be enough. Score 70+ to count toward your certificate.

Questions are cached — you'll always get the same 5 for this article.