One of the improvements which were mentioned in the recent announcement of Trojitá, a fast Qt e-mail client, were substantial memory savings and speed
improvements. In the rest of this post, I would like to explain what exactly we have done and how it matters. This is
going to be a technical post, so if you are not interested in C++ or software engineering, you might want to skip this
article.
Planting Trees
At the core of Trojitá's IMAP implementation is the TreeItem
, an abstract class whose basic layout will
be familiar to anyone who has worked with a custom QAbstractItemModel
reimplementation. In short, the
purpose of this class is to serve as a node in the tree of items which represent all the data stored on a remote IMAP
server.
The structure is tree-shaped because that's what fits both the QAbstractItemModel
's and the IMAP way of
working. At the top, there's a list of mailboxes. Children of these mailboxes are either other, nested mailboxes, or
lists of messages. Below the lists of messages, one can find individual e-mails, and within these e-mails, individual
body parts as per the recursive nature of the MIME encapsulation. (This is what enables messages with pictures attached,
e-mail forwarding, and similar stuff. MIME is fun.) This tree of items is used by the QAbstractItemModel
for keeping track of what is where, and for issuing the QModelIndex
instances which are used by the rest of
the application for accessing, requesting and manipulating the data.
When a QModelIndex
is used and passed to the IMAP Model
, what matters most is its
internalPointer()
, a void *
which, within Trojitá, always points to an instance of some
TreeItem
subclass. Everything else, like the row()
and column()
, are actually not
important; the pointer itself is enough to determine everything about the index in question.
Each TreeItem
has to store a couple of interesting properties. Besides the usual Qt-mandated stuff like
pointer to the parent item and a list of children, there are also application-specific items which enable the code to,
well, actually do useful things like printing e-mail subjects or downloading mail attachments. For a mailbox, this
crucial information might be the mailbox name. For a message, the UID of the message along with a pointer
to the mailbox is enough to uniquely identify everything which is needed.
Lazy Loading
Enter the lazy loading. Many people confirm that Trojitá is fast, and plenty of them are not afraid to say
that it is blazingly fast. This speed is enabled by the fact that Trojitá will only do the smallest amount of
work required to bring the data over the network (or from disk, for that matter). If you open a huge mailbox with half a
million messages, perhaps the GMail's "All messages" account, or one's LKML archive, Trojitá will not start
loading half a million of subjects. Instead, the in-memory TreeItem
nodes are created in a special state
"no data has been requested yet". Trojitá still creates half a million items in memory, but these items are rather
lightweight and only contain the absolute minimum of data they need for proper operation.
Some of these "empty" nodes are, eventually, consulted and used for item display -- perhaps because a view is
attached to this model, and the view wants to show the recent mail to the user. In Qt, this usually happens via the
data()
method of the QAbstractItemModel
, but other methods like rowCount()
have a
very similar effect. Whenever more data are needed, the state of the tree node changes from the initial "no data have
been requested" to "loading stuff", and an asynchronous request for these data is dispatched. An important part of the
tale is that the request is indeed completely asynchronous, so you won't see any blocking whatsoever in the GUI. The
QTreeView
will show an animation while a subtree is expanded, the message viewer might display a spinner,
and the mail listing shows greyed-out "Loading..." placeholder instead of the usual message subjects.
After a short while, the data arrive and the tree node is updated with the extracted contents -- be it e-mail
subject, or perhaps the attached image of dancing pigs. As the requested data are now here, the status of the tree node
is updated from the previous "loading stuff" into "done". At the same time, an appropriate signal, like
dataChanged
or rowsInserted
, is emitted. Requesting the same data again via the classic MVC
API will not result in network requests, but everything will be accommodated from the local cache.
What we see now is that there is just a handful of item states, yet the typical layout of the TreeItem looks roughly
like this:
enum class FetchingStatus {
INITIAL_NOTHING_REQUESTED_YET,
LOADING,
DONE,
FAILED
};
class TreeItem {
TreeItem *m_parent;
QList<TreeItem*> m_children;
FetchingStatus m_status;
};
On a 64bit system, this translates to at least three 64bit words being used -- one for the painter to the parent
item, one (or much more) for storage of the list of children, and one more for storing the enum
FetchingStatus
. That's a lot of space, given we have just created half a million of these items.
Tagged Pointers
An interesting property of a modern CPU is that the data structures must be aligned properly. A very common
rule is that e.g. a 32bit integer can only start at memory offset which is a multiple of four. In hex, this means that
an address, or a pointer value, could end with 0x0
, or 0x4
, or 0x8
, or
0xc
. The detailed rules are platform-specific and depend on the exact data structure which we are pointing
to, but the important message is that at least some of the low bits in the pointer address are always going to be zero.
Perhaps we could encode some information in there?
Turns out this is exactly what pointer tagging is about.
Instead of having two members, one TreeItem *
and one FetchingStatus
, these are squashed into
a single pointer-sized value. The CPU can no longer use the pointer value directly, all accesses have to go via an
inlined function which simply masks away the lowest bits which do bring a very minor performance hit, but the memory
conservation is real.
For a real-world example, see this commit in
Trojitá.
Using Memory Only When Needed
Back to our example of a mailbox with 500k messages. Surely a user is only going to see a small subset of them at
once, right?
That is indeed the case. We still have to at least reserve space for 500k items for technical reasons, but there is
certainly no need to reserve space for heavy stuff like subjects and other headers. Indeed, in Trojitá, we track the
From/To/Cc/Bcc headers, the subjects, various kinds of timestamps, other envelope items and similar stuff, and this
totals a couple hundred bytes per each message. A couple hundred bytes is not much (pun intended), but "a couple
hundred bytes" times "half a million" is a ton of memory.
This got implemented here. One
particular benchmark which tests how fast Trojitá resynchronizes a mailbox with 100k of messages showed immediate
reduction in memory usage from previous 45 MB to 25 MB. The change, again, does come with a cost; one now has
to follow one more pointer redirection, and one has to perform one more dynamic allocation for each message which is
actually visible. That, however, proves to be negligible during typical usage.
Measure, Don't Guess
As usual with optimizing, the real results might sometimes be surprising. A careful reader and an experienced Qt
programmer might have noticed the QList
above and shuddered in horror. In fact, Trojitá now uses
QVector
in its place, but when I was changing the code, using std::vector
sounded like a
no-brainer. Who needs the copy-on-write semantics here anyway, so why should I pay its price in this context? These data
(list of children of an item) are not copied that often, and copying a contiguous list of pointers is pretty cheap
anyway (it surely is dwarfed by dynamic allocation overhead). So we should just stick with std::vector
,
right?
Well, not really. It turned out that plenty of these lists are empty most of the time. If we are looking at
the list of messages in our huge mailbox, chances are that most of these messages were not loaded yet, and therefore the
list of children, i.e. something which represents their inner MIME structure, is likely empty. This is where the
QVector
really shines. Instead of using three pointers per vector, like the GCC's std::vector
does, QVector
is happy with a single pointer pointing to a shared null instance, something which is
empty.
Now, factor of three on an item which is used half a million times, this is something which is going to hurt.
That's why Trojitá eventually settled on
using QVector
for the m_children
member. The important lesson here is "don't assume,
measure".
Wrapping up
Thanks to these optimization (and a couple more, see the git log), one particular test case now runs ten times faster
while simultaneously using 38% less memory -- comparing the v0.4 with v0.3.96. Trojitá was pretty fast even before, but
now it really flies. The sources of memory diet were described in today's blog post; the explanation on how the time
was cut is something which will have to wait for another day.