The AppKiDo database

From an MVC point of view, by far the biggest challenge in AppKiDo, as measured by hours of coding, re-coding, head-scratching, and forehead-slapping, has been the model, which I call the "database". The views are standard Cocoa views, and although I don't use Cocoa controllers, my home-rolled controller classes are mostly straightforward. The database, however, is a hairy object graph that I've spent a lot of time figuring out how to organize and populate, and it's still not right. It's where massive changes in 0.981 occurred, and where the biggest changes will occur in the next major release.

In this post I'll talk about what the database is and how it gets populated in AppKiDo 0.981.

What's in the database

The elements of the AppKiDo database are called nodes. Roughly speaking, each database node represents a Cocoa programming construct such as a class, method, or function. You could say the whole purpose of AppKiDo is to help you navigate through the universe of database nodes.

Database nodes are represented by the class AKDatabaseNode and its descendants. Each node contains the following information:

  • Name. If the node represents a class, the node name is the class name. If it represents a method, the node name is the method name. And so on, for the most part. In the latest AppKiDo code, I've starting using the term "token" to refer to names of API constructs. Tokens map roughly to node names, but sometimes nodes represent groups of tokens. For example, there is a node with the name "Global Variables" that contains names of global variables and #defines in the AppKit framework. The individual variable names (NSCalibratedWhiteColorSpace, NSTokenSize, etc.) are tokens; there isn't a separate database node for each one.
  • Kind. Each node knows what sort of programming construct it represents. This is generally expressed by the class of the node: AKClassNode, AKProtocolNode, AKMethodNode, etc.
  • Relationships. Nodes often have relationships to other nodes. For example, a class node is related to nodes that represent the class's superclass, its subclasses, any protocols it implements, its properties, its class methods, its instance methods, its delegate methods, and its notifications.
  • Framework. Every node belongs to a main framework such as Foundation, AppKit, or Core Data. In addition, if a class has methods that come from multiple frameworks, the class is considered to belong to all those frameworks. For example, NSString's main framework is Foundation, but it is also considered to be an AppKit class because AppKit adds a category on NSString containing methods like -drawInRect:withAttributes:. The reason for noting NSString's dual citizenship, so to speak, is so that it will be included when you select either "Classes in Foundation" or "Classes in AppKit" in the Quicklist drawer. (Note: This is broken in version 0.981 — NSString does not appear in the AppKit list if you're using Xcode 3.)
  • Documentation. Each node knows which section of which HTML file contains the documentation for the programming construct it represents.
  • Header. Class nodes and protocol nodes know the location of the .h file where the class or protocol is declared.

Populating the database

Almost all the information in the database comes from parsing HTML documentation files. I divide each file into sections that contain documentation for individual tokens. How I do this depends on the kind of file it is. For example, I look at the documentation for a class and I figure out what the class's methods, properties, and so on are. (By the way, I treat notifications and delegate methods very much like regular methods.) I associate each of the tokens I find with the byte range within the file that contains documentation for just that token. I also figure out what part of the file contains the class's "Overview" documentation.

I do similar scraping of the HTML pages that document protocols, functions, globals, etc.

I get a little additional information by parsing header files — specifically, header files for classes and protocols. For one thing, I see which file each class and protocol is declared in, so AppKiDo can offer the option to view that header. Also, I use headers to figure out whether a protocol is formal or informal: if there is a @protocol declaration for it somewhere, it's formal; otherwise, informal. I think the more recent documentation has started putting the word "Informal" in the title of the documentation when the protocol is informal, so in the future I may not need to use the headers to make this distinction.

Which HTML files?

Since the information from the headers is relatively minor, populating the database essentially boils down to parsing a bunch of HTML files. Where do we find those files?

This question is closely tied with the question of which frameworks AppKiDo supports. For one thing, it only supports Objective-C frameworks. This was an early design decision; I didn't think searching documentation for C functions was a significant challenge compared to browsing a class hierarchy. But I've gotten more than one request to support Core Foundation, so I may add that in the future. (Historical note: AppKiDo was originally going to support only AppKit and Foundation, since that's where most of my lookup needs were. The name "AppKiDo" is short for "AppKit Documentation".)

Another constraint is that AppKiDo only supports a framework if its documentation is in the exact format that AppKiDo knows how to parse. It used to be that not all frameworks were documented exactly the same; one or two frameworks contained variations in the HTML structure. I believe the docs have gotten a lot more consistent in recent iterations, and I suspect this is related to major improvements that Apple has been making in their documentation machinery.

Before Xcode 3, the directory layout of the documentation was organized more or less by framework. The Foundation docs were in one directory, the AppKit docs were in another, the Core Data docs were in another, and so on. I knew by trial and error which frameworks were AppKiDo-compatible, and I listed them in a plist file called FrameworkInfo.plist, which specified the root documentation directory for each framework as well as the directory containing headers for that framework.

Xcode 3 and docsets

With Xcode 3 came good news and bad news. The bad news was that for some of the new classes introduced by Leopard, the documentation wasn't in the main documentation directory for their frameworks; it was scattered in directories a level or two up. As a consequence, those classes either did not show up in AppKiDo or they showed up with empty documentation.

There was similar bad news around header files. Headers from multiple frameworks were sometimes commingled in the same directory, which confused things.

The good news was that in Xcode 3 Apple organized all their developer documentation into docsets. Docsets are bundles that contain documentation files plus metadata files. The Xcode documentation window uses docsets to support search and navigation.

One metadata file, called docSet.dsidx, turns out to be a SQLite Core Data store containing information on every API token in the documentation, including:

  • what framework it belongs to,
  • what type of token it is (class name, method name, etc.),
  • what HTML file its documentation is in,
  • what header file it is declared in, and
  • a unique "anchor" that is used in URLs throughout the documentation to link to that token.

A wealth of information — a dream come true!

Well, almost. The docset index doesn't know about inheritance relationships. It doesn't distinguish delegate methods from regular methods. For that matter, it doesn't know which class or protocol a given method belongs to — only that it's a method. To get all that missing information, and to calculate the byte ranges for individual chunks of documentation, I still need to parse HTML files. But at least I have a much more precise way of knowing which HTML files to parse, and which header files as well. Also, the unique anchors will come in handy in the future; I've already used them to fix some bugs in traversing hyperlinks.

Furthermore, it seems that any remaining inconsistencies in the structure of the HTML have been removed, and that AppKiDo can parse the docs for any Objective-C framework. No more need for FrameworkInfo.plist, at least under Xcode 3.

In order to support versions of Xcode before 3.0, I start by checking for the presence of the docset index. If it's not there, I assume the documentation is organized the old way, and I fall back on using FrameworkInfo.plist. If the docset index is there, I use Gus Mueller's fmdb library to query it using raw SQL queries. Ideally I'd have been able to use Core Data, but I don't have the .mom for the docset index. My queries are based on totally undocumented, reverse-engineered assumptions about the structure of the SQLite tables. [UPDATE: I don't know exactly when, but at some point Apple started including the .mom along with the docset index.]

(Note: if you decide to poke around in the dsidx file yourself — working only on a copy, of course — note that "node" doesn't mean the same thing there as in AppKiDo. "Token" does, though.)

Probably 90% of the work in AppKiDo 0.981 was to add support for docsets, along with related changes like support for Dev Tools locations other than /Developer, and support for the iPhone docs.

Changes to come

As I said at the beginning, the database is still not quite right. I want to take better advantage of the token anchors that are in the docset index. I want to get rid of the kludges that I put in to support navigation of hyperlinks.

Most importantly for the next major release, I want to cut down that long startup phase where AppKiDo reparses everything from scratch every time. I want to make the parsing itself faster and more intelligent (I hope I can do both). I want to save the database in a Core Data store; right now the whole thing is in memory.

To support all these improvements, AppKiDo may have to require Xcode 3.

I have plenty of other items on my to-do list, but those are the main ones related to the database.

Apple dropping iPhone NDA

The announcement.

This means I should be able to release AppKiDo for iPhone, as well as the AppKiDo source code, in the pretty near future.

I'll also blog about the internal nuts and bolts of AppKiDo some time soon. I've been meaning to do it for ages, just haven't gotten to it. I think it would be useful for me to think out loud about what I'm doing, just as long as I keep the "thinking" part balanced with the "doing" part.

AppKiDo 0.98 almost in the can

Over the weekend I took care of a bunch of the items on my list for version 0.98. I'm basically down to the release-management phase now.

I decided to release "AppKiDo-for-iPhone" as a separate app with a horrible name rather than try to integrate it all into one app. I gave AppKiDo-for-iPhone the same icon as regular AppKiDo but colored slightly differently. I'm dying to replace the app icon altogether (all the icons, in fact), but I want to dedicate some time to it, and maybe hire a professional. I like what Daniel Jalkut did using NodeBox for his FlexTime icons. I fooled around for two minutes with NodeBox, and it looks like a fun and fascinating way to learn Python and create some nice (though admittedly amateur) graphics at the same time.

I haven't heard back from Apple about whether it's okay to release AppKiDo-for-iPhone, so I think I'll release the Mac-SDK-only version instead of waiting any longer. Since there are a lot of changes under the hood, I may send a sneaky-peek to a few people as a sanity check before doing a full public release.

I feel bad about holding back the iPhone version, but I'm paranoid about violating the NDA. I went so far as to #ifdef out any iPhone-specific strings so they can't be found by someone running 'strings' on the binary.