ShareGPT-4o-Image & Janus-4o: Open-Source Models Reaching GPT-4o Output Quality

7月24日 Published inAI Models

ShareGPT-4o-Image is a high-quality dataset derived from GPT-4o’s image generation outputs. It consists of 92,000 samples designed to help open-source multimodal models approach GPT-4o’s level of image generation. (Note: This dataset matches the visual quality of GPT-4o’s outputs rather than replicating the full underlying model architecture.)

The dataset is divided into two primary categories:

  • Text-to-Image: 45,717 samples—generating an image based on a written prompt.
  • Text-and-Image-to-Image: 46,539 samples—combining a text prompt with an existing image to create a new, modified image.

Janus-4o Model

Overview

Janus-4o is a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. It is built upon the Janus-Pro architecture and undergoes fine-tuning using the ShareGPT-4o-Image dataset.

Following the fine-tuning process, Janus-4o demonstrates measurable improvements in image quality and introduces the ability to process combined text and image inputs. However, its overall performance remains slightly below that of the native GPT-4o image generation.

How to Use Janus-4o

1. Setup Clone the official Janus repository and install the necessary dependencies:

git clone https://github.com/deepseek-ai/Janus.git
cd Janus
pip install -e .

2. Run Inference

Text-to-Image Generation Load the model and processor, then define a generation function with parameters such as temperature and parallel size. Execute the function with your desired prompt and output path.

# Loading model and processor (partial code shown)
prompt = "A stunning princess from Kabul in red and white traditional clothing, blue eyes, brown hair"
image_output_path = "./test.png"
text_to_image_generate(prompt, image_output_path, vl_chat_processor, vl_gpt, parallel_size=2)

Text-and-Image-to-Image Generation After loading the model and processor, use the generation function by passing the prompt, the source image path, and the destination output path.

prompt = "Turn the image into a nighttime scene."
input_image_path = "./test_input.png"
image_output_path = "./test_output.png"
text_and_image_to_image_generate(prompt, input_image_path, image_output_path, vl_chat_processor, vl_gpt, parallel_size=2)

A Gradio-based web interface is also available for easier testing:

pip install -e .[gradio]
python demo/app_janus4o.py

Training

To reproduce the Janus-4o results, use the provided training script. This process initiates with Janus-Pro and fine-tunes the model on the ShareGPT-4o-Image dataset for both generation tasks.

accelerate launch --config_file configs/sft.yaml \
    --num_processes 8  \
    --num_machines 1 \
    --machine_rank 0 \
    --deepspeed_multinode_launcher standard train_janus.py \
    --model_path deepseek-ai/Janus-Pro-7B \
    --data_path [FreedomIntelligence/ShareGPT-4o-Image] \
    --n_epochs 3 \
    --train_bsz_per_gpu 1 \
    --learning_rate 5e-6 \
    --gradient_accumulation_steps 8