Convert a Google doc to a doc site

Ming
4 min readJun 12, 2022

--

I have a huge Google doc:

Unwieldy! I want to make it as approachable as documentation sites like this:

To be specific, I want:

  • a full-text search box,
  • a sidebar for navigating around the Table of Contents, and
  • each chapter be put on its separate page.

How do I do that?

We exploit the fact that ePub is simply a zipped static website.

First, I download the Google doc as a Microsoft Word document:

Then, I convert the docx file into an ePub. (Note: Don’t download ePub directly from Google doc; It doesn’t split the document by chapter.) I use pandoc, an universal document converter, for the purpose:

# Replace this with the path to the file you downloaded.
FILE="The Project Gutenberg eBook of Alice’s Adventures in Wonderland, by Lewis Carroll.docx"
# Convert the docx file to epub:
pandoc -s "$FILE" -t epub --epub-chapter-level=2 -o all.epub
# epub is just a zipped version of a static website:
unzip -q "all.epub" -d . && rm "all.epub"

Now, I have this nice structure at the directory EPUB/:

This is essentially a website already. If you fire up a simple HTTP server and navigate to the nav.xhtml file, it would look like this:

Not the most aesthetic website, don’t you agree? Let’s make that a Jekyll website:

jekyll new EPUB --force
cd EPUB
# Install a good theme:
echo 'gem "just-the-docs"' >> Gemfile
# Tell Jekyll to use this theme:
sed -i '' 's/^theme: minima/theme: "just-the-docs"/g' _config.yml
# Install the packages:
bundle install

(We use --force here so that we can write Jekyll files to this existing folder.)

Next, we convert xhtml-formatted chapters to Markdown files, so that Jekyll can recognize them. Let’s fire up pandoc again:

for i in text/*.xhtml
do
pandoc -s ${i} -o ${i%.xhtml}.md
done

Now, if you serve the site (using bundle exec jekyll serve), you can see that the index page is correctly picking up all the chapter files, although each chapter page itself still looks barebone:

Let’s make each chapter pretty:

sed -i '' 's/^generator: pandoc/layout: default/g' text/*.md

Serve again, and you’ll see:

Cool, but what are those ::: {#... }? Let’s remove some Pandoc artifacts:

# Remove bookmarks:
sed -i '' 's/{#.*}$//g' text/*.md
# Remove lines that starts with `:::`:
sed -i '' '/^:::/d' text/*.md

Also, chapter links are still just numbers. Let’s replace them with the actual titles of the chapters. For this task, we use ripgrep, because it conveniently lists out file paths along side with the matched lines (when piped). Remember that we used --epub-chapter-level=2 when converting to ePub, so we look for the first line that starts with 2 hashes in each file:

cd text
rg "^##" -m 1 *.md | while read line; do
file=${line%:## *}
title=${line#*:## }
# Replace the auto-generated title with the title found.
sed -i '' "s/^title: .*\xhtml$/title: ${title}/g" "$title.md"
# (Optional) Rename the file:
mv "$file" "$title.md"
done
cd ..

After removing some empty files (using find text/ -name "*.md" -type f -size -3k -delete served my purpose, but your mileage may vary), my sidebar looks perfect!

Even the search box works out of the box:

With some extra sed -Fu, you can also ensure internal links, images, and anchor links work. You got the idea.

The only remaining task is to push this folder to a GitHub repo, serve it via GitHub pages, and share with your friends/colleagues what a beautiful documentation website you’ve just built.

Isn’t it nice to have most of the work automated, rather than manually copying & pasting the content of every chapter to a separate Adobe Dreamweaver page?

--

--