Epubs and html - Richard's Blog

Motivation

I knew that an epub format is actually a zip file containing a whole bunch of html files. You can open the individual html files in a browser if you want, and if you can find the content page, you can jump to any page in the book via table of contents (toc) links. I actually used this strategy to read some books for a while, but the fact that I have to hunt for the toc page each time proved to be annoying.

There exist plugins that allow you to open and read epubs directly in the browser. However, this is usually some kind of javascript code that runs an epub application from within the browser. Though this is convenient, the fact that the browser itself is a highly performant html rendering machine means that I can just skip the javascript stuff and open the html files directly with the browser.

The solution to this issue is generating a single page html from an epub.

Figure 1: Look at the simplicity of a plain html book!

Using Calibre

My initial solution was just to use Calibre. Calibre has a “Convert Books” functionality where you can set your epub as the input format and set htmlz as the output format. This will create a .htmlz output, where you can just unzip to find a single-page html within. This worked surprisingly well on many of my epubs.

If you don’t want to use the GUI, you can just use the following:

$ ebook-convert input.epub output.htmlz

However there was one epub where the resulting single-page html did not have working links. I tried to remedy the issue by trying to read the Calibre code, and wow is it complicated. Take a look at the entry main function for the ebook-convert python program. Because of the plugin system, there is a lot of indirection and this requires more analysis than a weekend of reading. The system is actually pretty impressive. You can read more of the documentation here.

Using BeautifulSoup

But I don’t need the flexibility to convert between any 2 formats. My goal is simple: obtain a single-page html. And sometimes reducing your requirements can simplify the required code greatly.

The problem isn’t merging the html files, as it is trivial to do in the terminal.

$ cat a.txt b.txt > c.txt

The problem is that the hrefs originally pointed to the html files, so if you just concatenated the files together without adjusting the links, clicking on the links would lead you nowhere. The hrefs had the following syntax: “part00XX.html#some-random-string”. That means that when you clicked on the links, it did not go to the section with the id of the corresponding string, but it tries to open a non-existent html document.

The solution is just to remove the string prior to ‘#’:

# strip the .html file name and preserve the id string, prepend with # to tell it is a href
for link in output_doc.find_all('a'):
    href = link.get('href')
    if href and href.startswith('text'):
        # Update the link to point to the correct section within the merged document
        index = href.find('#')
        link['href'] = f'{href[index:]}'

    if href and href.startswith('part'):
        # Update the link to point to the correct section within the merged document
        index = href.find('#')
        link['href'] = f'{href[index:]}'

In each of the html files, the format was as such:

<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en-US" xml:lang="en-US">
  <head>
    <title>...</title>
    <link href="../stylesheet.css" rel="stylesheet" type="text/css"/>
  </head>
  <body id="uMbqldEtLexUca2edkuNPBH" class="calibre">
  <div class="some-attribute">some content</div>
  </body>
</html>

Notice that the id is by itself (there is no prepended html file string). This is good. That means that if I combined the html files by their body, then I retain the original id’s, which I can refer to using the hrefs. However trying the following in BeautifulSoup did not work too well:

for file in html_files: # html_files contain the path names for every html file in directory
    with open(file, 'r') as html_file:
        soup = BeautifulSoup(html_file, "html.parser")
        body_content = soup.find('body')
        output_doc.append(body_content.extract())

The resulting output file had only 1 body tag. Apparently it forces every body content into a single body tag. That means the corresponding id of all the body tags were lost.

A hack to get around this is to hide the body in a div:

for file in html_files:
    with open(file, 'r') as html_file:
        soup = BeautifulSoup(html_file, "html.parser")
        body_content = soup.find('body')
        # the trick to preserve the id is to hide the body in a div
        div = output_doc.new_tag('div', id=body_content.get('id'))
        div.append(body_content.extract())
        output_doc.append(div)

Doing so preserves the body tags. I’m not sure if using the same id for both the div and body is harmful, but it gets the job done. Now my document has defined id’s.

Now the document should contain links that are defined and functional. The code required to do all the above and more can be found in this repo. The nbviewer view of the notebook can be found here

Styling

Epubs generally include a style sheet, so its useful if you also add the same stylesheet to your html document.

# Create a new head tag
head_tag = output_doc.new_tag('head')
# Create a link tag for the stylesheet
link_tag = output_doc.new_tag('link', rel='stylesheet', type='text/css', href='style.css')
# Append the link tag to the head tag
head_tag.append(link_tag)
# Insert the head tag into the HTML document
if output_doc.head:
    output_doc.head.insert_before(head_tag)
else:
    output_doc.insert(0, head_tag)

In your stylesheet, you can set the width of your document to 700px so that it doesn’t span the whole screen. I like to also center it by setting auto margins. For which element to apply the css to, I found that applying it to “body” or “.calibre” for calibre exported epubs works.

.calibre {
  display: block;
  font-size: 1em;
  margin: auto;
  width: 700px;
  padding-left: 0;
  padding-right: 0;
  text-align: justify;
}

If you want to be even more fancy, you can set the background color to something like Sepia:

body {
    background-color: #e4ded3;
}

Or even dark mode:

body {
  background-color: #000000; /* Black background color */
  color: #FFFFFF; /* White text color */
}
/* Normal link color */
a {
  color: #1E90FF; /* Dodger blue */
}
/* Visited link color */
a:visited {
  color: #8A2BE2; /* Blue violet */
}
/* Hover link color */
a:hover {
  color: #FFD700; /* Gold */
}
/* Active link color */
a:active {
  color: #FFA500; /* Orange */
}

The result:

Concluding thoughts

I really enjoy reading a lot more now that I have a single page html. It is really convenient to have the book in a tab that restores when you re-open the browser. I find myself reading online pages really quickly, so having a book on standby in one of the tabs means that I can just jump back to reading a book just as quickly as reading social media posts.

Also, I really love html. It is probably one of the best portable text formats there is. Just some simple lines of css and my reading experience is enhanced. If you haven’t tried out this form of reading yet, I highly encourage you to give it a shot!