Making of a Chinese Characters dataset

Peter Burkimsher
10 min readJun 25, 2018

Last year, I watched the videos of Stanford’s CS231n class, the seminal course in machine learning by Fei-Fei Li, Justin Johnson, and Andrej Karpathy.

The course spends most of its time covering image recognition, with reference to Google Image Search. The homework assignments use standard datasets, such as notMNIST characters. I wasn’t satisfied to just do all the same exercises as everyone else — I wanted to do my own thing. I’m trying to learn Chinese, and built Pingtype to help me add spaces between words, pinyin, literal, and parallel translations. Why not build a better Chinese OCR?

Here’s the final result: 15 million 28x28 PNG files of 52,835 characters. It’s 9.98 GB compressed; 13.48 GB uncompressed.

The rest of this article describes how I made it, why it took a year, and what I learned along the way.

The scripts I used are available here:

Step 1: Get some fonts

Date: 2017–06–19

Script: Download Chinese Fonts.scpt

There’s a great website called Chinese Font Design that has a large collection of fonts. I wrote a scraper to run wget and download them all. After about 5 hours, I had a 15GB folder full of RAR, ZIP, and 7z files.

Step 2: Extract the archives

Date: 2017–06–20

Archive Utility doesn’t like different character sets, and will give names like “Œƒ∂¶≥¨∫⁄.ttf”. Instead, I used The Unarchiver and specified the character set (usually BIG5 or GB2312). That gives the correct Chinese filenames, such as “恅隋閉窪.ttf”. The extracted data takes up another 15 GB.

Step 3: Convert to TTF

Date: 2017–06–21

Some of the fonts were TTF, some were TTC, and others were OTF. I used fontforge to convert everything to TTF.

I then had 11.82 GB of 1304 TTF files. That seemed hard to handle, so I split it into 6 subfolders of ~220 TTF files or 2.16GB each.

Step 4: Extract the characters

Date: 2017–06–21

Which characters do I need? This is made more complicated because of radicals and two-byte characters. I used the Pingtype dictionary to get 75617 characters that are actually in my data sources, making chinese75617.txt.

Then I used hb-view from Harfbuzz to get a PNG for each of those characters. I wrote a script to read each character from the file, and export it into a folder for that font.

My internal SSD was full, so I used the office MacBook Pro 2012 and my external 2 TB drive. Unfortunately that computer only has USB 2, so writing files was slow. It took 3 months.

Extracting continued until 2017–09–28.

The PNGs use over 300 MB per font, for a total of 450 GB.

Opening a font folder in Finder takes 19 seconds, even when connected over USB 3 to my regular laptop, a MacBook Pro Retina 2014. I needed to manually check each folder to remove blank glyphs. This was too slow.

Step 5: Buy a bigger SSD

I realised that I needed better hardware if I’m ever going to sort out this data. My laptop’s AppleCare was coming up for expiry in March 2018, so I decided to wait. Just before the warranty ran out, I sent it away for a new battery (error PPT004) and screen replacement (a few stuck pixels), and got a Chinese keyboard as a bonus because the battery is glued onto the top case. This is helpful when local friends need to type.

The Retina needed to be sent away for a week, so I decided to buy a spare MacBook Air 2013, with a 256GB SSD and a compatible slot to take my Retina’s hand-me-down. That cost half a month’s salary, but I need a computer for my job.

I also ordered a 2 TB SSD for my Retina from penguinestore on eBay (elliotxp on Taobao), which cost over a month’s salary.

With all my shiny new gear, I started copying the files to my internal SSD. Because the data is so fragmented into lots of small files, copying each folder of Glyphs1-Glyphs6 took all day. Or at least that was the plan.

While copying Glyphs3, I went off to relieve myself, and when I came back my computer was at the login screen. The mouse wasn’t moving. I tried to SSH in from my spare laptop, but it couldn’t connect. After waiting 10 minutes, I decided to power-cycle. It booted to a flashing folder icon. The SSD had fried.

I’d owned it for less than a month and hadn’t written eBay feedback yet, so I sent that SSD back to Shanghai by Fedex the next day and got a refund. He couldn’t fix it either. There’s no more stock though, so I’m back on my old 512GB. Having already erased my old SSD, I had plenty of free space, so I decided to finish this project before restoring the backup of my files.

Step 6: Remove duplicates

Of the 75617 PNG files, most are blank. Usually about 40–50k files are placeholder images, a blank square. The other 30k are useful glyphs.

Blank squares are not all equal, though! meiryo uses a 288x416 image, Microsoft Jhenghei uses a 126x373 image. They’re also not blank — a placeholder picture of a square is usually used as a filler.

Instead of deleting many duplicate files, it’s faster to move a smaller number of NotDuplicate files.

moveDuplicateGlyphs.sh is a script that uses find, stat, and uniq to get the most common size in the folder, and then moves all the files with a different size to a subfolder. I tried comparing md5 hashes, but it was too slow. Even if matching by the file size might make me lose a few genuine characters, it’s an acceptable loss to increase the speed of processing. The commands are concise and useful enough that I’ll copy them here.

mostCommonSize=`find “$thisFolder” -maxdepth 1 ! -name “.” -exec stat -f “%z” {} + | sort | uniq -c | sort -r | head -n 1 | cut -d “ “ -f 2`;

find “$thisFolder” -maxdepth 1 -type f ! -size “$mostCommonSize”c -exec mv ‘{}’ “$thisFolder/NotDuplicates/” \;

The data is now only 140 GB.

Step 7: Manually clean up other duplicates

Date: 2018–05–28

Sometimes there are other duplicate glyphs, such as “?” icons as well as blank glyphs. I could re-run moveDuplicateGlyphs.sh, but I risk losing good data, so I decided to do this manually.

I opened each font folder, sorted by size, and looked to see if the pictures and sizes are the same. If not, then I selected the duplicates and deleted them. Moving 50,000 items to trash takes several minutes. Emptying trash also has a problem when there’s more than about 80,000 items in there (it works, it just takes several hours instead of minutes). To work around this, I recommend relaunching Finder after every 2 or 3 fonts.

The manual pass also helped me find 33 fonts where Harfbuzz hadn’t cropped it correctly, and the glyph was cut in half. If this dataset is useful, please tell me, and I could try to do further work to extract the data from these.

From the fonts that I’m keeping, there are 15,867,622 glyph files.

Step 8: Move glyphs to character folders

Date: 2018–05–30

Instead of 6 folders Glyphs1 to Glyphs6, with subfolders for each font and filenames representing the character, it’s more useful to have one folder for each character, with filenames representing the font.

I used ls and a few regex find-replace commands to make 6 scripts to move the images to character folders.

moveNotDuplicates1ToCharacters.sh to moveNotDuplicates6ToCharacters.sh

Some folders were empty, meaning that none of the fonts have those characters. Only 52834 characters are available, not 75617.

Step 9: Trim whitespace

Date: 2018-06–04 to 2018–06–08

Many of the PNGs have a lot of whitespace around the character. Often the image dimensions are the same for every character in a font, but we want to standardise the size to make 28x28 pixel glyphs for every font. It’s better to keep the image as large as possible, so we’ll trim out whitespace.

trimPngs.sh is a script that uses imagemagick’s convert command to trim.

To run this script took 4 days on the spare laptop, and I needed to delete processed glyphs halfway, because the 256GB SSD was getting full.

The data is now 133 GB.

Step 10: Pad whitespace to make squares

Date: 2018–06–11 to 2018–06–16

If the character is already square, that’s perfect. If not, we should add whitespace to make it square. To do so, I wrote another script.

padPngs.sh uses imagemagick to find out whether the width or height is taller, and pads to the maximum size.

It took 5 days to run, and also needed space freeing halfway.

The data is another 133 GB.

Step 11: Scale down to 28x28

Date: 2018–06–19 to 2016–06–22.

Uploading 133 GB isn’t easy. Even Mega only offers 50 GB as a free tier, and this whole project is run on zero budget. I also don’t need the full-size glyphs for machine learning, only 28x28 PNGs like the notMNIST dataset. It took 3 days to run this step.

resizePngs.sh uses imagemagick to scale the PNGs.

The data is now a mere 16.32 GB.

Step 12: Compress and upload

Finally, with a quick tar czf command, I had a dataset!

MacOS says it’s 9.98 GB, although Google Drive thinks it’s only 9.2 GB. There are 15,867,622 glyph files.

The total size of all steps in this process:

15 GB (Downloaded Archives)
15 GB (Extracted font files)
11.82 GB (TTFs)
450 GB (Glyphs)
140 GB (NotDuplicates)
133 GB (Trim)
133 GB (Pad)
16.32 GB (Scale)
9.98 GB (Archive to upload)

Total: 924.12 GB.

Coding, manual checking time: 3 days.
Script runtime: 3 months, 19 days.

Here’s a couple of graphs about the data to give you an idea of the scale.

Most fonts have 20,000 or 8,000 characters
Most characters have over 400 glyphs.

Although I exported the fonts, I didn’t install them all on my computer (disk space!), so the characters on the right side of the graph can’t be displayed.

Trying to do machine learning on this data will be interesting, because there are many categories (50k), yet relatively few examples for each category (usually 400–1200 PNG files). Data augmentation could make more examples, but that will require even more disk space.

Conclusion

Big data is not only for people with supercomputers. Amateur users can build datasets too, with enough patience.

Would I do it again? Probably not. Losing the 2 TB internal SSD was painful, and I think it was related to the fragmentation of having so many small files.

I hope the dataset is useful to others. Apple are using machine learning for their Chinese handwriting recognition method, but they have only 30k characters instead of my 52k character dataset.

Please let me know if you’re using this data, or if you’re learning Chinese — I’d like to talk more about the ways I’ve tried and failed to study. Music and comics have been much more helpful to me than textbooks or flashcards; the “spaced repetition” learning method is called Shuffle on my iPod. I’m also starting to learn 台語/Minnan/Taiwanese/Hokkien.

Soon I plan to delete all the data from the partial steps in order to free up space on my hard drive. I’ll keep the original fonts and the final dataset, but the duplicates/trim/pad are not important to me. Please tell me if you think I should save that data.

My next step for this project will be to use TensorFlow to train a classifier and add my own handwriting recognition drawing area to Pingtype. Then I hope to experiment with text detection and OCR until I can read text from videos.

About the author

Peter Burkimsher has been in Kaohsiung, Taiwan for 4 years, working for InfoFab and OSE, a microSD card manufacturer.

Email, Github, LinkedIn

I like coding and making graphs about international topics such as localisation and Facebook Likes. I studied Electronic Systems Engineering, but have had more work experience in software. I’m currently looking for a job in New Zealand or Canada, perhaps Australia, so I can immigrate and settle down. Please contact me if your company has any open positions!

More where this came from

This story is published in Noteworthy, where thousands come every day to learn about the people & ideas shaping the products we love.

Follow our publication to see more product & design stories featured by the Journal team.

--

--

Peter Burkimsher

We will build greater things together than we can imagine.