Training with aspect ratio bucketing can greatly improve the quality of outputs of Image Generations (and we personally don’t want another base model trained with center crops), so we have decided to release the NovelAI bucketing code under a permissive MIT license.
https://github.com/NovelAI/novelai-aspect-ratio-bucketing
The repository provides an implementation of aspect ratio bucketing for training generative image models as described in our previous blogpost:
Aspect Ratio Bucketing
One common issue of existing image generation models is that they are very prone to producing images with unnatural crops. This is due to the fact that these models are trained to produce square images. However, most photos and artworks are not square. However, the model can only work on images of the same size at the same time, and during training, it is common practice to operate on multiple training samples at once to optimize the efficiency of the GPUs used. As a compromise, square images are chosen, and during training, only the center of each image is cropped out and then shown to the image generation model as a training example.

For example, humans are often generated without feet or heads, and swords consist of only a blade with a hilt and point outside the frame.
As we are creating an image generation model to accompany our storytelling experience, it is important that our model is able to produce proper, uncropped characters, and generated knights should not be holding a metallic-looking straight line extending to infinity.Another issue with training on cropped images is that it can lead to a mismatch between the text and the image.
For example, an image with a `crown` tag will often no longer contain a crown after a center crop is applied and the monarch has been, thereby, decapitated.
We found that using random crops instead of center crops only slightly improves these issues.
Using Stable Diffusion with variable image sizes is possible, although it can be noticed that going too far beyond the native resolution of 512x512 tends to introduce repeated image elements, and very low resolutions produce indiscernible images.
Still, this indicated to us that training the model on variable sized images should be possible. Training on single, variable sized samples would be trivial, but also extremely slow and more liable to training instability due to the lack of regularization provided by the use of mini batches.
We hope to see many non-cropped images in the future!