Abstract: Objective To develop multimodal joint cognitive representations for the research of visual cognitive activities of the brain,enhance the classification performance of visual information cognitive representations,predict brain electro-encephalogram(EEG)responses from visual image features,and decode visual images from EEG signals.Methods A architecture combining a multimodal variational autoencoder network with the Mixture of Product Experts(MoPoE)approach and with a style generation adversarial network based on adaptive discriminator augmentation(Style-GAN2-ADA)was used for facilitating the learning of cognitive representations and the encoding and decoding of EEG signals.This framework not only catered to classification tasks but also enabled cross-modal generation of images and EEG data.Results The present study integrated features from different modalities,enhancing the classification accuracy of cognitive representations of visual information.By aligning the feature spaces of diverse modalities into a cohesive latent space,cross-modal generation tasks were made possible.The cross-modal generation results of EEG and images,derived from this unified latent space,outperformed the one-way mapping methods that involved transition from one modality to another employed in previous research.Conclusion This study effectively integrates and aligns information from various modalities,enabling the classification performance of joint cognitive representations beyond any single modality.Moreover,the study demonstrates superior outcomes in cross-modal generation tasks compared to modality-specific unidirectional mappings,which is expected to offer a new line of thought for the effective unified encoding and decoding modeling of visual cognitive information in the brain.