logo
Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts

Abstract

Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations.

We introduce MajutsuCity, a natural language–driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent that supports five object-level operations.

To support photorealistic and customizable scene synthesis, we also construct MajutsuDataset, a high-quality multimodal dataset containing 2D semantic layouts and height maps, diverse 3D building assets, and curated PBR materials and skyboxes, each accompanied by detailed annotations.

Meanwhile, we develop a practical set of evaluation metrics, covering key dimensions such as structural consistency, scene complexity, material fidelity, and lighting atmosphere. Extensive experiments demonstrate MajutsuCity reduces layout FID by 83.7% compared with CityDreamer and by 20.1% over CityCraft. Our method ranks first across all AQS and RDR scores, outperforming existing methods by a clear margin. These results confirm MajutsuCity as a new state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability for 3D city generation. We expect our framework can inspire new avenues of research in 3D city generation.

MajutsuCity Overview

Figure 1: MajutsuCity is a language–driven, aesthetic-adaptive system that unifies controllable urban scene generation and interactive editing within a single framework. Conditioned on textual instructions, the framework synthesizes a complete stylized city through layout–height creation, asset instantiation, and terrain/material generation, and further enables iterative refinement through five atomic editing operations. This paradigm forms the core contribution of MajutsuCity, empowering users to create and continuously modify large-scale, stylistically diverse urban scenes through natural language.

Method

To enable controllable generation of object-level urban scenes directly from natural language, we propose MajutsuCity framework that bridges high-level textual intent and structured 3D scene composition.

Methodology Pipeline

Figure 2: Overview of the proposed MajutsuCity framework. MajutsuCity is an aesthetic-adaptive generative framework that enables controllable, object-level 3D urban scene generation from natural language descriptions. It consists of Scene Design, Layout Generation, Assets & Materials Generation, and Scene Generation.

As shown in Figure. 2, the framework has four major stages:

Scene Design

Converting textual requirements into structured and consistent design guidance.

Layout Generation

Synthesizing spatially coherent urban layouts and height maps under semantic and topological constraints.

Assets & Materials Generation

Producing high-fidelity building-level 3D assets and material maps and skybox.

Scene Generation

Composing assets and environmental layers into a coherent and renderable 3D city.

Furthermore, we develop MajutsuAgent, an interactive editing agent enables fine-grained, human-in-the-loop scene manipulation with high controllability and consistency.

Dataset

To enhance the realism and controllability of 3D scene generation, we introduce MajutsuDataset, a high-quality multimodal dataset designed for text-guided 3D scene synthesis. As illustrated in Figure. 3, the dataset comprises three major components: Layout/Elevation, 3D Building Models, and Material Assets.

MajutsuDataset

Figure 3: Overview of MajutsuDataset, a high-quality multimodal dataset designed for text-guided 3D urban scene generation. (a) The OSM-based Layout/Elevation subset provides paired semantic layout maps, height maps, and detailed textual descriptions generated by GPT-5-mini. (b) The 3D Building Models subset includes 1,000 assets covering diverse architectural styles. (c) The Texture Map subset contains a large-scale library of seamlessly tilable PBR materials and HDR skybox maps.

Results

Table 1: Quantitative comparison of Layout Generation.

layout table
Layout Result

Figure 4: Qualitative comparison of city layouts generation. Our method yields more realistic and coherent urban layouts than prior InfiniteGAN, CityDreamer and CityCraft.

Table 2: Absolute Quantitative Scoring (AQS) and Relative Dimension Ranking (RDR) for city scene generation. For each metric, we report both GPT-based and user-based scores.

Scene table
scene Result

Figure 5: Qualitative comparison of city scene generation. We compare our method with CityDreamer, GaussianCity, UrbanWorld and CityCraft across two representative scenes. Our approach produces scenes with higher geometric fidelity, better multi-view consistency, and richer stylistic diversity than all baselines.

4 styles Result

Figure 6: Style-driven city generation results. Four city scenes with different well-known styles generated by MajutsuCity show high fidelity and strong intra-style consistency.

Citation

@article{huang2025majutsucity,
  title={MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts},
  author={Huang, Zilong and He, Jun and Huang, Xiaobin and Xiong, Ziyi and Luo, Yang and Ye, Junyan and Li, Weijia and Chen, Yiping and Han, Ting},
  journal={arXiv preprint arXiv:2511.20415},
  year={2025}
}