Skip to main content

Parsing and Formatting Citations from Perplexity API JSON Responses

 Building modern interfaces for Large Language Models (LLMs) requires more than just streaming text. As developers integrate engines like Perplexity, they encounter a specific friction point: converting raw text citations into interactive UI elements.

The Perplexity API returns a response where the generated text contains static markers (e.g., [1][2]) and a separate citations array containing the source URLs. To create a professional user experience, you must parse these markers and replace them with interactive components without breaking the React render cycle or introducing security vulnerabilities.

This guide details the root cause of the rendering challenge and provides a production-ready React implementation to parse, map, and render interactive citations.

Understanding the Data Structure

Before writing the parser, we must understand the raw shape of the data. When querying the Perplexity API (specifically models like sonar-medium-online or pplx-7b-online), the JSON response typically looks like this:

{
  "id": "5f3a2b...",
  "model": "sonar-medium-online",
  "created": 17098234,
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "React Server Components allow you to render components on the server [1]. This reduces the bundle size sent to the client [2]."
      }
    }
  ],
  "citations": [
    "https://react.dev/reference/rsc/server-components",
    "https://nextjs.org/docs/app/building-your-application/rendering/server-components"
  ]
}

The challenge is evident: the content string refers to indices (1-based) that correspond to the citations array (0-based).

The Core Challenge: String Injection vs. Component Rendering

A common mistake is attempting to use String.prototype.replace() with raw HTML strings and injecting them via dangerouslySetInnerHTML.

Why dangerouslySetInnerHTML Fails Here

  1. XSS Vulnerabilities: Injecting un-sanitized HTML from an external API is a security risk.
  2. Loss of React Context: If you inject a standard <a> tag string, you lose the ability to use React components (like <Tooltip> or <Popover>) or Next.js <Link> components for internal routing.
  3. Event Handling: You cannot attach React event handlers (onClickonMouseEnter) to string-injected HTML.

The correct approach requires transforming the string into an array of React Nodes.

The Solution: Regex-Driven Splitting

To solve this, we utilize a powerful but often overlooked feature of JavaScript's String.prototype.split(). If the regular expression used in split contains capturing parentheses, the matched results are included in the output array.

We will use this behavior to deconstruct the text stream into linear segments, identifying which segments are citations and which are plain text.

The Parsing Logic

We need a regex that identifies the pattern [n], where n is a number.

/(\[\d+\])/g
  1. \[ and \]: Escaped brackets to match literal characters.
  2. \d+: Matches one or more digits.
  3. (...): The capturing group ensures the delimiter itself is returned in the array.

Implementation: The CitationRenderer Component

Below is a complete, TypeScript-typed React component. It accepts the raw text and the citations list, then handles the parsing and rendering safely.

This component handles the offset math (converting [1] to index 0) and validates bounds to prevent crashes if the LLM hallucinates a citation index that doesn't exist.

import React, { useMemo } from 'react';

interface CitationRendererProps {
  text: string;
  citations: string[];
}

export const CitationRenderer: React.FC<CitationRendererProps> = ({ 
  text, 
  citations 
}) => {
  // Memoize the parsing to prevent expensive regex operations on every re-render
  const elements = useMemo(() => {
    // Regex to split by citation markers like [1], [2], etc.
    // The capturing group ([\d+]) ensures the marker is included in the parts array.
    const parts = text.split(/(\[\d+\])/g);

    return parts.map((part, index) => {
      // Check if the current part is a citation marker
      const citationMatch = part.match(/^\[(\d+)\]$/);

      if (citationMatch) {
        // Extract the number from the string "1"
        const citationIndex = parseInt(citationMatch[1], 10) - 1;
        const url = citations[citationIndex];

        // Safety check: ensure the citation exists in the provided array
        if (url) {
          return (
            <CitationChip 
              key={`${index}-${citationIndex}`} 
              index={citationIndex + 1} 
              url={url} 
            />
          );
        }
      }

      // Return regular text nodes for non-citation parts
      return <span key={index}>{part}</span>;
    });
  }, [text, citations]);

  return (
    <div className="leading-7 text-gray-800 dark:text-gray-200">
      {elements}
    </div>
  );
};

// Sub-component for the interactive citation
const CitationChip = ({ index, url }: { index: number; url: string }) => {
  return (
    <a
      href={url}
      target="_blank"
      rel="noopener noreferrer"
      className="
        inline-flex items-center justify-center 
        align-baseline mx-0.5 px-1.5 py-0.5
        text-[10px] font-bold text-blue-600 bg-blue-50 
        rounded-full cursor-pointer 
        hover:bg-blue-100 hover:text-blue-700 
        transition-colors duration-200
        border border-blue-200
        no-underline translate-y-[-2px]
      "
      aria-label={`Citation ${index}`}
    >
      {index}
    </a>
  );
};

Deep Dive: How It Works

1. The Split Technique

When text.split(/(\[\d+\])/g) runs on "Text [1] End", the resulting array is: ["Text ", "[1]", " End"]

Without the capturing parentheses in the regex, the output would simply be: ["Text ", " End"] The separator would be lost. By capturing it, we maintain the sequence of the content while isolating the markers we need to replace.

2. Zero-Based Index Mapping

Perplexity (and most academic standards) use 1-based indexing for display ([1]). Arrays in JavaScript are 0-based. The line const citationIndex = parseInt(citationMatch[1], 10) - 1; performs this translation. It extracts the digit captured by the internal match (\d+) and decrements it to access the correct URL in the citations array.

3. Rendering Safety

The condition if (url) is critical. LLMs are non-deterministic. It is possible for the model to generate text containing [5] even if it only provided 3 citation URLs. Without this check, your UI would render a broken link or crash.

Styling and UX Considerations

In the code above, we used Tailwind CSS for styling. There are specific choices made for usability:

  • translate-y-[-2px]: This gives the citation a "superscript" feel without actually breaking the line-height rhythm of the paragraph, which often happens with the native <sup> tag.
  • Hit Area: We added padding (px-1.5) to ensure the clickable area is large enough for mobile users, despite the text being small (text-[10px]).
  • Accessibility: The aria-label ensures screen readers announce "Citation 1" rather than just reading the number "one" mid-sentence.

Handling Edge Cases

When implementing this in a production environment, consider these edge cases:

Consecutive Citations

The text might contain [1][2] without spaces.

  • Result: Our regex split handles this perfectly.
  • Output: ["...", "[1]", "", "[2]", "..."]. The empty string in the middle renders as nothing, and the two chips appear side-by-side.

Streaming Responses

If you are streaming the response token-by-token:

  1. The text prop will update frequently.
  2. The regex might break if a chunk ends halfway through a marker (e.g., ...content [1).
  3. Fix: While useMemo helps performance, for streaming, you might see the brackets "flicker" before converting to a chip. This is generally acceptable. For a smoother experience, you can implement a buffer that only updates the rendered output when a token completes a sentence or a citation marker.

Conclusion

Parsing unstructured text from APIs like Perplexity into structured React components is a necessary step for building polished AI interfaces. By moving away from string replacement and embracing regex-based array splitting, you ensure your application remains secure, performant, and capable of rendering rich, interactive citations.