jkt's blog: Tagged pointers, and saving memory in Trojita

Tagged pointers, and saving memory in Trojita

2014-03-26

One of the improvements which were mentioned in the recent announcement of Trojitá, a fast Qt e-mail client, were substantial memory savings and speed improvements. In the rest of this post, I would like to explain what exactly we have done and how it matters. This is going to be a technical post, so if you are not interested in C++ or software engineering, you might want to skip this article.

Planting Trees

At the core of Trojitá's IMAP implementation is the TreeItem, an abstract class whose basic layout will be familiar to anyone who has worked with a custom QAbstractItemModel reimplementation. In short, the purpose of this class is to serve as a node in the tree of items which represent all the data stored on a remote IMAP server.

The structure is tree-shaped because that's what fits both the QAbstractItemModel's and the IMAP way of working. At the top, there's a list of mailboxes. Children of these mailboxes are either other, nested mailboxes, or lists of messages. Below the lists of messages, one can find individual e-mails, and within these e-mails, individual body parts as per the recursive nature of the MIME encapsulation. (This is what enables messages with pictures attached, e-mail forwarding, and similar stuff. MIME is fun.) This tree of items is used by the QAbstractItemModel for keeping track of what is where, and for issuing the QModelIndex instances which are used by the rest of the application for accessing, requesting and manipulating the data.

When a QModelIndex is used and passed to the IMAP Model, what matters most is its internalPointer(), a void * which, within Trojitá, always points to an instance of some TreeItem subclass. Everything else, like the row() and column(), are actually not important; the pointer itself is enough to determine everything about the index in question.

Each TreeItem has to store a couple of interesting properties. Besides the usual Qt-mandated stuff like pointer to the parent item and a list of children, there are also application-specific items which enable the code to, well, actually do useful things like printing e-mail subjects or downloading mail attachments. For a mailbox, this crucial information might be the mailbox name. For a message, the UID of the message along with a pointer to the mailbox is enough to uniquely identify everything which is needed.

Lazy Loading

Enter the lazy loading. Many people confirm that Trojitá is fast, and plenty of them are not afraid to say that it is blazingly fast. This speed is enabled by the fact that Trojitá will only do the smallest amount of work required to bring the data over the network (or from disk, for that matter). If you open a huge mailbox with half a million messages, perhaps the GMail's "All messages" account, or one's LKML archive, Trojitá will not start loading half a million of subjects. Instead, the in-memory TreeItem nodes are created in a special state "no data has been requested yet". Trojitá still creates half a million items in memory, but these items are rather lightweight and only contain the absolute minimum of data they need for proper operation.

Some of these "empty" nodes are, eventually, consulted and used for item display -- perhaps because a view is attached to this model, and the view wants to show the recent mail to the user. In Qt, this usually happens via the data() method of the QAbstractItemModel, but other methods like rowCount() have a very similar effect. Whenever more data are needed, the state of the tree node changes from the initial "no data have been requested" to "loading stuff", and an asynchronous request for these data is dispatched. An important part of the tale is that the request is indeed completely asynchronous, so you won't see any blocking whatsoever in the GUI. The QTreeView will show an animation while a subtree is expanded, the message viewer might display a spinner, and the mail listing shows greyed-out "Loading..." placeholder instead of the usual message subjects.

After a short while, the data arrive and the tree node is updated with the extracted contents -- be it e-mail subject, or perhaps the attached image of dancing pigs. As the requested data are now here, the status of the tree node is updated from the previous "loading stuff" into "done". At the same time, an appropriate signal, like dataChanged or rowsInserted, is emitted. Requesting the same data again via the classic MVC API will not result in network requests, but everything will be accommodated from the local cache.

What we see now is that there is just a handful of item states, yet the typical layout of the TreeItem looks roughly like this:

enum class FetchingStatus {
    INITIAL_NOTHING_REQUESTED_YET,
    LOADING,
    DONE,
    FAILED
};
class TreeItem {
    TreeItem *m_parent;
    QList<TreeItem*> m_children;
    FetchingStatus m_status;
};

On a 64bit system, this translates to at least three 64bit words being used -- one for the painter to the parent item, one (or much more) for storage of the list of children, and one more for storing the enum FetchingStatus. That's a lot of space, given we have just created half a million of these items.

Tagged Pointers

An interesting property of a modern CPU is that the data structures must be aligned properly. A very common rule is that e.g. a 32bit integer can only start at memory offset which is a multiple of four. In hex, this means that an address, or a pointer value, could end with 0x0, or 0x4, or 0x8, or 0xc. The detailed rules are platform-specific and depend on the exact data structure which we are pointing to, but the important message is that at least some of the low bits in the pointer address are always going to be zero. Perhaps we could encode some information in there?

Turns out this is exactly what pointer tagging is about. Instead of having two members, one TreeItem * and one FetchingStatus, these are squashed into a single pointer-sized value. The CPU can no longer use the pointer value directly, all accesses have to go via an inlined function which simply masks away the lowest bits which do bring a very minor performance hit, but the memory conservation is real.

For a real-world example, see this commit in Trojitá.

Using Memory Only When Needed

Back to our example of a mailbox with 500k messages. Surely a user is only going to see a small subset of them at once, right?

That is indeed the case. We still have to at least reserve space for 500k items for technical reasons, but there is certainly no need to reserve space for heavy stuff like subjects and other headers. Indeed, in Trojitá, we track the From/To/Cc/Bcc headers, the subjects, various kinds of timestamps, other envelope items and similar stuff, and this totals a couple hundred bytes per each message. A couple hundred bytes is not much (pun intended), but "a couple hundred bytes" times "half a million" is a ton of memory.

This got implemented here. One particular benchmark which tests how fast Trojitá resynchronizes a mailbox with 100k of messages showed immediate reduction in memory usage from previous 45 MB to 25 MB. The change, again, does come with a cost; one now has to follow one more pointer redirection, and one has to perform one more dynamic allocation for each message which is actually visible. That, however, proves to be negligible during typical usage.

Measure, Don't Guess

As usual with optimizing, the real results might sometimes be surprising. A careful reader and an experienced Qt programmer might have noticed the QList above and shuddered in horror. In fact, Trojitá now uses QVector in its place, but when I was changing the code, using std::vector sounded like a no-brainer. Who needs the copy-on-write semantics here anyway, so why should I pay its price in this context? These data (list of children of an item) are not copied that often, and copying a contiguous list of pointers is pretty cheap anyway (it surely is dwarfed by dynamic allocation overhead). So we should just stick with std::vector, right?

Well, not really. It turned out that plenty of these lists are empty most of the time. If we are looking at the list of messages in our huge mailbox, chances are that most of these messages were not loaded yet, and therefore the list of children, i.e. something which represents their inner MIME structure, is likely empty. This is where the QVector really shines. Instead of using three pointers per vector, like the GCC's std::vector does, QVector is happy with a single pointer pointing to a shared null instance, something which is empty.

Now, factor of three on an item which is used half a million times, this is something which is going to hurt. That's why Trojitá eventually settled on using QVector for the m_children member. The important lesson here is "don't assume, measure".

Wrapping up

Thanks to these optimization (and a couple more, see the git log), one particular test case now runs ten times faster while simultaneously using 38% less memory -- comparing the v0.4 with v0.3.96. Trojitá was pretty fast even before, but now it really flies. The sources of memory diet were described in today's blog post; the explanation on how the time was cut is something which will have to wait for another day.

Got a comment? Feel free to mail them to me!

Tags: gentoo, kde, qt, trojita.

jkt's blog

Blößmrt