There are many (I belive around 5%) Chinese character coded in GB2312 contains \xad, remove all \xad make the characters broken. Image you see one broken ideograph in every sentence:(