Aug 29, 2006

Notes on installing and configuring Ubuntu on VMWare

I seem to be installing Ubuntu under VMWare too many times. Since I do the same things every time, I wrote myself a small cheat sheet.

Category:  — Permalink

Aug 22, 2006

Deeply nested if statements

I dislike deeply nested C\C++ code. I.e. the code of the form:

if (foo) {
  if (bar) {
   if (anotherVariable) {
   }
  }
}

It’s a simplified example that doesn’t really show the problem with such code - when the logic within an if is long, such code becomes hard to read because it becomes difficult to figure out what logic condition is handled by a given part.

To my surprise I seem to be in minority. Most people, especially the Windows GUI kind, nest logic conditions like there’s no tomorrow. I find it hard to argue about the superiority of avoiding nesting because saying “this is hard to read” is too vague. But I’m relieved to find out that I’m not the only one that has a dislike for unnecessary nesting. Stepanov, inventor and implementor of STL, has a great example of simplifying nested logic in his “Professionalism in Programming” papler (PowerPoint, PDF, from his website).

The way to avoid deeply nesting is to do early exit as soon as possible. The trivial example could be rewritten as:

  if (!foo) return;
  if (!bar) return;
  if (!anotherVariable) return;
  … and this is the logic

Of course trivial examples teach us nothing - this only shows the forest so that you don’t get lost in details. For a real-life example see how Stepanov refactors deeply nested logic expression (pages 4-13 in the paper above).

Category:  — Permalink

Aug 20, 2006

What I love about Google open-source project hosting

I love how easy it is to create a new project, especially compared to protracted sourceforge process. It’s so easy that even though I have my own server that already hosts several of my Subversion repositories, I’ve already switched a couple of repositories to Google. It’s just so much simpler to click a few buttons on a web page that login to my server, create a repository, update permissions file etc.

And it’s fast. Fast is good. Slow is bad. svn update being interrupted in the middle due to server issue is bad. Sourceforge is bad.

Now about things that I don’t like.

Their search has outright bugs. I have a project called “qemacs-kjk” but I can’t find it if I search for “qemacs-kjk”. I can find it if I search for “qemacs”. If I was king of search, I would feel embarassed right about now.

It’s hard to get to a list of your projects (i.e. your profile page). There’s no obvious link anywhere, you can’t get there through “My Account”. So far the only way I found is to search for one of your projects and use a link from project summary. That’s why browsers will have bookmarks in years to come.

The list of projects in user profile page just begs to have the short description of the project listed next to project name.

Project hosting launched only recently and I’m sure that the team is working hard on adding new features. Here are the features that I would like to see.

Much better repository browsing. Right now it’s the most basic thing you can imagine. Syntax coloring would be nice. Even nicer would be if they ported GeSHI from PHP to Python, used it for syntax coloring on their site and released the code.

Personally I’m a big fan of timeline view (as implemented by e.g. CVSTrac and copied by Trac). I think that in open-source, collaborative projects knowing what happened recently is the most important information. A view like that would be really nice. Most SVN/CVS web-based repository browsers don’t have that, though.

Speaking of CVSTrac/Trac, I think Mr. Hipp had the right idea with issue tracking/wiki/source control integration. Being able to automatically link issues with checkins is good. Very, very good.

Google already has issue tracking. An integrated system for providing the face of the project (i.e. a wiki or something like that) would be nice too.

Personally, I think that for discussions the jos-style forum is the best for projects that don’t generate too much discussion (and that’s a majority; for those that do generate a lot of discussion regular mailing list could be better).

Even if Google Code developers shared my opinion, I’ve worked at too many companies to believe that they wouldn’t be pressured to push for Google Groups integration. I just wish that Groups team would fix their RSS feeds so that:

  • they have full text in the body, so that I don’t have to go back to a web page to read every message. Kinds of defeats the purpose of RSS if I have to do that.
  • they actually work in Bloglines. Currently they update as frequently and erratically as U.S. is starting a new war. They should update within 24 hr, to keep conversation fluid

Google’s Code is a very promising alternative to Sourceforge. Yes, they don’t claim to compete with Sourceforge, but they do. And it’s a good thing because Sourceforge was never good and hasn’t improved significantly in years. Google Code is already better (at least for the kinds of things I want to do) and could be really ass kicking service in the future.

And if you like to see my projects, here they are.

Category:  — Permalink

Aug 17, 2006

A simple catchpa scheme

Captcha are those blurry and transformed images that recently became so popular on many websites that accept user-contributed content. They are an unevitable consequences of spammers becoming more sophisticated.

My forum for Sumatra PDF recently received an annoyingly high amount of spam, so I decided to put an end to it. Or at least die trying.

I extended the FruitShow forum software with a captcha scheme I’ve stolen from CVSTrac software. It’s very simple: instead of showing blurry images, it asks people to enter a result of a very simple arithmetic expression, like 1+3.

It seems to work for CVSTrac so maybe it’ll stop spam on my forum as well (so far it’s been a couple of days without a single spam).

Technically, it’s not hard to defeat - all the data needed for correct response are in the html and I’m not even trying to do anything fancy like hide the numbers in JavaScript (so that the spam bot needs full JavaScript evaluation engine).

I’m counting more on obscurity of the method. While it would be easy to manually modify the spam bot to defeat this particular captcha, I’m hoping that no-one will bother to put the effort just so that they can spam one website.

FruitShow, by the way, rocks. It took me an hour and just a few lines of PHP to add this. Too bad it doesn’t seem to be developed anymore (3 months of checkin silence).

Category:  — Permalink

Aug 16, 2006

Paradox of bad comments

Bad comments are worse than no having comments at all. Given that writing comments (good or bad) takes time, you would think that obviously bad comments would be very rare.

In my experience, they’re not. Hence a paradox of bad comments.

I often see comments stating blindingly obvious things i.e. comments of the kind:

class Foo {
    // constructor for Foo
    Foo();
};

or:

/* returns width */
int getWidth();

So why do people write such useless comments if they could get their job done more quickly without spending time to write them?

My theory is: guilt.

It’s not that those programmers don’t know that useless comments are, well, useless, or that they couldn’t, given enough time to reflect, classify such comments as useless.

Programmers know that writing good comments is important. However, writing good comments is hard. By nature, good comments only explain tricky, unexpected behaviour of the code and those things are hard to explain well.

On top of that, writing comments often has to be postponed until code has been written and tested at which point there’s little incentive to add them.

Writing good comments is hard (which is why they’re rarely written) but programmers feel guilty when programs have no comments at all, so they kill that guilty feeling by writing the easy, but useless, comments.

Category:  — Permalink

Aug 15, 2006

Order of #include headers in C/C++

One thing I’ve learned is that maintaining good #include hierarchy (which is closely related to good design i.e. good partitioning of code into independent modules) requires eternal vigilance. It’s easy to slack off and end up with a mess (circular dependencies or files that compile only because some other file happened to have been included somewhere in #include chain).

This mess is not a theoretical problem: it becomes very real when you modify the code and suddenly it doesn’t compile because of wrong #include dependencies that are hard to track down and fix.

One big project I’ve worked on had this problem and a running joke was that every couple of months some developer would get determined to fix it once and for all by cleaning up headers. After all, how hard can it be? Turns out it was very hard and no-one succeeded.

For that reason I cringe every time I see #include <stdafx.cpp>- it’s a free ticket to future dependency hell.

A trick I recently settled upon helps to keep clean #include hierarchy. In the past (for no reason I can remember) I would put #include for system includes (like or ) first in my *.c files. Those days the first #include in module foo.c is for “foo.h”.

Why?

The golden rule for #include files is that if a module bar.c uses foo.c, everything needed to compile foo.c should be defined in foo.h. Chances are that foo.h uses definitions defined in system includes. If all places that #include “foo.h” also include those system includes before foo.h, things will compile just fine but only by accident.

Which is not a problem until you forget to #include those system includes and are faced with weird (”it used to work just fine”) compiler errors.

Including it’s own #define as the first thing helps to spot those mistakes early.

Category:  — Permalink

Aug 14, 2006

Performance optimization story

The story you’re about to read makes those major points about optimizing software for speed:

  • it’s good to read other people’s sources. You will learn new tricks.
  • performance work is driven by data. Don’t guess what is slow, measure it.
  • a good profiler is extremely helpful in getting the data
  • lots of allocation of small objects isn’t good in a C\C++ program

When working on my Sumatra PDF viewer for Windows, I decided to take a look at the performance. I profilied the code to parse a rather large (~8MB) PDF. I found a rather surprising thing: a lot of time was spent inside malloc()/free() (they were in the top 10 most expensive functions in the profile) and a large portion of those allocations/frees was for strings. The code in question has it’s own, simple GooString class.

To get more data I instrumented GooString destructor to find out what are the typical sizes of the strings. An allocation histogram told me that about 90% of them is 16 bytes or less.

Then I looked at the implementation. GooString is a very typical implementation. It keeps track of the size of string and a pointer to allocated string i.e. (to paraphrase):

class GooString {
int length;
char * str;
};

It does have an interesting trick. Most typical implementations allocate more memory than stricly needed for the string, which avoids frequent re-allocation when you add data to string. So they also have to keep track of how big is the actual allocated area e.g.:

class DumberString {
int allocated; /* the real size of ’str’ buffer */
int length;
char *str;
};

GooString gets rid of ‘allocated’ variable by using a rounding function based on size e.g.:

static inline int rounded_size(int len) {
int delta;

delta = len < 256 ? 7 : 255;
return ((len + 1) + delta) & ~delta;
}

That way GooString saves 4 bytes per object. Not that it usually matters, as we’ll find out very shortly, but it illustrates that reading other people’s code is useful. I’ve seen a couple of string implementation but this is the first time I noticed that particular trick and I would probably never have come up with that trick by myself.

The problem with GooString is that creating an instance causes 2 allocations: one for the object and another for the str pointer.

You might think that the amount of memory taken from the system for a 1-byte string (an empty string that only contains terminating zero) would be sizeof(GooString) (8) + 1 i.e. 9 bytes.

This is not so. First, most systems rounds allocation. You can find out a rounding of your system with printf(”rounding: %d\n”, -(int)((char*)malloc(1)-(char*)malloc(1))). On my Ubuntu Linux this turns out to be 16. So allocating 1 byte or 16 bytes takes the same amount of memory from the system: 16 bytes. So suddently one instance of GooString actually costs us 32 bytes.

But that’s not all. The OS has to somehow keep track of each allocation. How it’s done and what’s the exact overhead are highly implementation dependent, but we can safely assume at least 8 bytes (that’s just 2 32-bit pointers). So the real cost of allocating a 1-byte GooString is at least 48 bytes. And we thought it was 9.

There is a better way. A trick used in dynamic string implementation in venerable Tcl language uses a static buffer that is a part of the string:

#define STR_STATIC_SIZE 16
class BetterString {
char sStatic[STR_STATIC_SIZE];
int length;
char * ’s’;
};

If the size of the string is less than STR_STATIC_SIZE, ’str’ points to ’sStatic’. If it’s bigger, we allocate the string as before. That way for strings smaller than STR_STATIC_SIZE we don’t have to allocate memory (halving the cost of allocations). It doesn’t even cost us more memory in most cases since for small strings we avoid the minium 24 bytes cost of allocating at least 1 byte, and for larger strings the overhead is small compared to the total size.

You can tweak STR_STATIC_SIZE. The bigger it is, the faster we’ll be (less cases where we need to allocate additional storage) but more memory we’ll use.

In my particular case, implementing this trick reduced allocations due to string by 45% (since 90% of strings were less than STR_STATIC_SIZE) which improved loading time by 10%. And that was a very simple change.

So let’s recap the things we can learn from this story.

The only way to know what is slow is to get data i.e. profile the app. Pdf parser and renderer I use is a complex piece of code. It would be pointless for me to try to guess which part of it is slow.

A good profiler is essential to giving the right data. An hour spent profiling and reading the results pointed me in the right direction.

It’s important for a programmer to read other people’s source code. I’ve learn new tricks from reading the source of GooString. I’ve learn new tricks from reading Tcl’s implementation. In the end it’s much cheaper than trying to come up with those ideas on my own.

And finally, as you can see, allocating small objects in C\C++ has a huge overhead, so try not to do it. Unfortunately naive implementation of common data structures (strings, nodes in trees or lists) has a problem of requiring lots of small allocation. A good answer to this problem is custom allocator that pre-allocates large numbers of a given object and uses a bitmap to keep track of which ones are used (1-bit overhead per object as opposed to 8 + whatever rounding to 16 takes). And, if done right, they should be faster than a standard OS allocator. But that’s a story for another day.

Category:  — Permalink

Aug 12, 2006

Where do bugs come from and how to avoid them.

It’s frustrating that we, programmers, write bugs in our code.

One way to figure out how to avoid writing bugs (or, more realistically, write less bugs) is to understand where do bugs come from.

Recently I realized that there are only two sources of software bugs:

  • ignorance
  • carelessness

Probably not the greatest of insights, but I haven’t seen it spelled out this way before.

Since it seemed like a topic bigger than a blog post, I’ve written up a longer article about where do bugs come from and how to avoid them.

Category:  — Permalink

Aug 07, 2006

The missing msvcr80.dll story.

It all started with a complaint that Sumatra PDF v0.2 (a yellow PDF viewer that I’ve just released) doesn’t run.

After some psychic debugging and a little help from my friends I figured out that the reason is that it won’t run on machines that lack msvcr80.dll.

Programs usually use C library calls that usually reside in a shared library to maximize the amount of shared code between different apps running at the same time.

For ages (which means up to Visual Studio 6) you could pretty much rely on that DLL being present on people’s machines.

Visual Studio 2002/2003/2005 each use their own version of msvcr70.dll/msvcr71.dll/msvcr80.dll. And that is a problem. Two problems, actually.

The first problem is that you can’t expect that msvcrt80.dll is available on people’s machine. This is nasty, because a developer, be virtue of installing Visual Studio 2005, has msvcrt80.dll installed, so he doesn’t see the problem. Only when he gives the binary to someone to run on a machine without msvcr80.dll installed, it mysteriously fails. I’ve seen it a couple of times in the past and still make this mistake. You can diagnose this (and other DLL-loading related problems) using Dependency Walker.

What’s the solution to problem one?

Distribute msvcr80.dll (and others you depend on that are not guranteed to be installed everywhere) with your app. MSDN has articles on that e.g. this one and this one.

The solution I’ve chosen is static linking. I changed C++\Code Generation\Runtime Library setting from multi-threaded DLL (/MD) to just multi-threaded (/MT). This caused Visual Studio to complain about the conflict with libcmtd.lib. I don’t understand why or is it serious or not, but just in case I’ve added libcmtd.lib to the “Linker\Ignore Specific Library” list.

Problem number two is worse. Due to how C library is implemented, you can’t use different versions of C library at the same time. The need for that happens surprising often, especially in the days of abundance of open-source code. Imagine your app wants to use libxml2 and you downloaded libxml2 library and headers files that were compiled in Visual Studio 6. libxml2 opens files so it had to link to VS 6 version of C library. Imagine that you’re compiling your app with Visual Studio 2005 (e.g. because you can no longer buy VS 6) and you also need to use C library. It’s not possible to link from Visual Studio 2005 to the older C library so you have to link to msvcr80.dll. Now you’re in trouble.

When you call both old C library and new one, things might work but they also can break mysterioiusly. I once debugged a problem caused exactly by that. A python app compiled with Visual Studio 2005 was calling into an extension compiled with VS 6. As it happend, the code compiled with VS 2005 was closing a file descriptor (i.e. calling close() inside msvcr80.dll) that was allocated by an extension (using open() call inside VS 6 version of C lib dll). And the app would die right there. Turns out that file descriptor (which are just integers) to Windows file handle mapping is part of C lib. The same file descriptor in one dll maps to a completely different file handle in the other dll.

There is no easy fix for this problem. If you can, your best choice is to re-compile everything under the same compiler. For many open-source project it’s easier said than done. Because of strong Unix heritage they often care little about the state of Windows build (if it compiles under Cygwin, it’s clearly good enough) so a simple act of compilation can turn into multiple-day endeavour of creating a build system and fixing small but annoying portability problems.

Category:  — Permalink

Aug 06, 2006

php_mysql.dll not loading in PHP 5.1.4 and Apache 2.2

Installing software is always fun. A recent purchase of new computer forced to redo the setup of my test web developement environment (i.e. Apache + PHP + MySQL) under Windows XP. Something I did at least 3 times in the past.

Murphy, as always, won, and things that could go wrong, went.

But this is an educational tale, so let’s get to the point.

The official Windows installation for latest Apache 2.0.x is so bad that after vanilla installation with no changes at all, Apache refused to run. I didn’t feel like debugging problems caused by a broken installer, so I tried the latest 2.2.x installer. That worked out-of-the box (although the Apache monitor that they start by default and stick into tray seems pointless to me).

Second problem was that the latest version of Apache (2.2.x) isn’t binary-compatible with 2.0 and the latest PHP 5.1.4 only comes with 2.0 adapter so I had to hunt down binary 2.2. Thank god for Apache Lounge.

Third problem was that even after setting correct extension_dir in php.ini and uncommenting extension=php_mysql.dll, I was getting mysterious error:

PHP Warning:  PHP Startup: Unable to load dynamic library ‘C:\\php-514\\ext\\php_mysql.dll’ - The specified module could not be found.\r\n in Unknown on line 0

It’s mysterious because the file clearly is there. Lucky for me that I misspent part of my life doing Windows programming, which includes intricacies of DLL loading, and I know that a DLL might not get loaded even if it exists when it refers to another DLL that is not present in %PATH%.

After some googling I found that others had the same problem and the solution proposed was to copy libmysql.dll to c:\windows or c:\windows\system32. Polluting system directories is a gross solution but at the time I would be happy if things just worked. However, copying libmysql.dll to c:\windows didn’t solve the problem.

In a flash of inspiration I decided to run Dependency Walker on php_mysql.dll to see what else it might depend on. Turns out it also needs php5ts.dll and probably in some cases msjava.dll (that is nowhere to be found and fortunately doesn’t seem necessary). I copied php5ts.dll to c:\windows as well and voila, things work.

I, however, am left with a bad taste in my mouth. More on that later, now let’s recap wisdom gained:

  • Apache logs are essential for diagnostic of Apache-related problems
  • Dependency Walker is one of those tools that you need once a year, but when you do, it saves you. Learn to use and learn about DLLs in Windows. Apparently installing PHP requires that.

Now about bad taste in my mouth.

Apache and PHP are amongst the most popular open-source project and yet the Windows experience of using them is awful.

Managing dependencies is a nightmare (3 versions of Apache + a dozen versions of PHP, good luck figuring out what’s compatible with what).

Open-source world, in general, is not good about maintaining binary compatibility. Apache 2.0 was supposed to be the big rewrite, what was so important to add to Apache 2.2 to justify a 3rd, incompatible binary interface.

Forcing people to copy DLL to system folders to make things work is bad and easily fixable. PHP could easily figure out the path of their main DLL and add it to path so that DLLs that it knows live there could be loaded by system.

Let’s not forget misleading error messages. Could not be found? But it’s there! Windows is partly to blame, because that’s probably the error that it returns to PHP, but given that it seems to be a common problem, it could be smart enough to stat the file to see whether it’s really not there or it’s not loading because of some other reason.

Given those problems it’s only a minor issue that Apache Foundations seems to change their minds about where Apache is being installed about once a year.

Category:  — Permalink