First, a convolutional variational autoencoder (VAE) was used on the 3D voxel volumes in order to produce a decoder model that could take in latent space vectors and produce a design. The encoder model was also used for generating the database of latent vectors for the text model.

Second, a text encoder model was used to generate the latent space vectors from the text descriptions. It was trained to directly predict the encoded latent space vectors of known models from their descriptions generated earlier.

These two models were trained separately and then they were combined into one model that goes from the initial text description to a latent space vector, and then through the decoder model to generate a 3D design.

Shape Autoencoder

The shape autoencoder was highly successful at generating and interpolating between many different kinds of objects. Below is a TSNE map of the latent space vectors colorized by category. Most of the clusters are clearly segmented with some overlap between similar designs, such as tall round lamps and bottles. While there are a few out of place samples like some tables in the chair region, inspecting them manually shows that these models are indeed quite odd and in fact closer in shape to the surrounding models than to their parent category.

Tech Specs: The model used five 3D convolutional layers for both the encoder and decoder and had a latent space vector of 128 dimensions. Dropout and L2 regularization was used to make the model more generalizable. In total, the model had ~3.2 million parameters and took ~30 hours to train on a single Nvidia V100 GPU.

TSNE map of the latent space vectors colored according to object category.

The GIFs below show random walks between encoded shape vectors of different designs. It demonstrates that the model has learned how to smoothly interpolate between disparate geometries.

Interpolating between different types of swivel chairs (many more GIF examples on my github, can also be generated realtime on the app here)

Text Encoder

The text encoder model was moderately successful at predicting the latent space vectors given an input description. Predicting 128 separate continuous numbers is a difficult task and it had to effectively reverse engineer how the 3D encoder model worked on top of interpreting the text. This difficulty is compounded by the fact that there is a large variety of models that match a given text description, especially when the descriptions are short, e.g ‘a regular chair’ or ‘a wide bed.’

Tech Specs: SpaCy (GloVe) word embeddings were used to encode the text tokens as vectors with a max sequence length of 50 words. It then uses three bidirectional LSTM layers and four fully connected dense layers to generate the latent space vectors from the text descriptions. Spatial 1D dropout was used on the word embeddings as well as regular dropout and L2 regularization of all the subsequent layers. The total parameter count for the model was ~3.1 million and it took ~25 hours to fully train.

Tech Stack For Training Models

To train the Tensorflow based models, I used AWS EC2 (Elastic Computer Cloud) spot instances with EBS (Elastic Block Storage) drives storing all data. Spot instances allowed me to use p3.2xlarge instance types at ~1/3 of the cost of on-demand which enabled significantly more training tests to be run. Using the p3.2xlarge instances accelerated training by ~6x but were only ~3x the cost compared to p2.xlarge instances and so it was significantly faster and more cost effective. Additionally, I created a custom AMI (Amazon Machine Image) and launch template to drastically reduce setup time for new instances.

In order to store the ~90 GB of training data and quickly recover from spot instance terminations, I used several dedicated EBS drives with the same data structure so I could simply plug them into new spot instances as required. To download the training data and set up all of the EBS drives I used a persistent m4.large instance instead of the more expensive GPU accelerated instances.

In order to deploy code and synchronize data from training runs on multiple instances I used rsync. I also developed a logging class that created and managed training data for each run into a consistent folder structure.

Putting It All Together - Final Results

To deploy the final model, I made a Streamlit app where the user can type in a text description and interactively view the generated 3D model. This app utilizes Streamlit For Teams and reads directly from my github repo to smoothly manage and deploy code. The app also allows the user to interact with the encoded design space via an interactive version of the TSNE map shown earlier.

Try it out here: datanexus.xyz

Output models for various simple descriptions showing how the model changes

Input was: ‘a chair that looks like a lamp’.

Results Discussion

The final model is easy to interact with and demonstrates clear signs of understanding the text input with a wide variety of descriptions. In particular, shape descriptors like ‘wide’ or ‘tall’ are well interpreted and have a reasonable effect on the output. Even some odd descriptions that are outside of the intended scope produce suitable results such as ‘a chair that looks like a lamp’ (shown above).

Where the text encoder model seems to have trouble is with very short or one-word descriptions because the descriptions in the training set were on average ~13 words long. These short descriptions are so vague that it has a hard time finding an appropriate average model across all possible models that could match that description. The model also occasionally fails to pick up on some small details in the description that should lead to a large change in the output, like whether the mug is full or empty. It seems that this information can sometimes be lost as the LSTM goes along the text sequence.

Perhaps the most notable constraint with this approach, however, is that it isn’t able to generate models that are outside of the scope of the training data. There were no chairs with ten legs in the training set and so it isn’t able to extrapolate and generate something with ten legs. It’s likely that an entirely different approach would be required to achieve this. However, my overall goal in this project was to encode the design space in an intuitive way to enable rapid exploration of that design space and for that purpose generating entirely new novel and novel designs was less critical.

Conclusion

Ultimately, this project successfully demonstrated a novel way of applying ML to accelerate the 3D design process for simple objects. Using unsupervised learning to encode a large database of prior knowledge and then using supervised learning to build a natural language text model that can interact with the encoded database, allows the user to sample from and explore a knowledge base quickly and easily. While the models generated by this project were relatively simple, there are many ways to extend this idea to more complex models as explained below. This approach takes the first steps towards a future where designers can work together with ML algorithms to focus on the creative aspects and iterating rapidly on designs in order to achieve results on an unprecedented scale.