The Chinese government has released a dataset to train language models that reflect their political views. This is another example of how the Chinese government is trying to control generative AI.
The Artificial Intelligence Security Governance Professional Committee of the Cyberspace Administration of China (CAC) announced a public dataset of 50 billion tokens in 100 million data points. This dataset has been officially approved by the government and is in line with its policies.
In terms of dataset size, the filtered version of the Common Crawl dataset used to train GPT-3 has approximately 410 billion tokens. Meta’s Llama-2 models were pre-trained on 2 trillion tokens.
So the CCP dataset is relatively small and probably not enough to train a large, capable language model. But it can be part of the data mix and used to align the LLM.
Those interested can download the dataset from the CAC website after registration and authentication.
The CCP’s struggle for control where control is difficult
The dataset announcement is noteworthy because it shows that the Chinese government continues to try to reconcile the language and image capabilities of large AI models, as well as their complex randomness, with its strict political discourse.
China released guidelines for generative AI services this past summer. For example, organizations that offer AI systems to the public must undergo a safety review process that checks for alignment with the CCP’s political views. Generative AI services must adhere to the “core values of socialism” and not attempt to overthrow state power or the socialist system.
Baidu’s ERNIE bot, the Chinese version of ChatGPT, shows what this looks like in practice in a recent test by CNN: ERNIE did not answer questions about the Tiananmen massacre or Xi Jinping’s lifting of term limits. After several inquiries, the account was suspended by CNN.
Baidu’s image AI had previously blocked the generation of images for political prompts, such as “Tiananmen Square,” the site of the Tiananmen massacre.