Scalable System Design

Building scalable system is becoming a hotter and hotter topic. Mainly because more and more people are using computer these days, both the transaction volume and their performance expectation has grown tremendously.

This one covers general considerations. I have another blogs with more specific coverage on DB scalability as well as Web site scalability.

General Principles

"Scalability" is not equivalent to "Raw Performance"
  • Scalability is about reducing the adverse impact due to growth on performance, cost, maintainability and many other aspects
  • e.g. Running every components in one box will have higher performance when the load is small. But it is not scalable because performance drops drastically when the load is increased beyond the machine's capacity

Understand environmental workload conditions that the system is design for
  • Dimension of growth and growth rate: e.g. Number of users, Transaction volume, Data volume
  • Measurement and their target: e.g. Response time, Throughput
Understand who is your priority customers
  • Rank the importance of traffic so you know what to sacrifice in case you cannot handle all of them
Scale out and Not scale up
  • Scale the system horizontally (adding more cheap machine), but not vertically (upgrade to a more powerful machine)
Keep your code modular and simple
  • The ability to swap out old code and replace with new code without worries of breaking other parts of the system allows you to experiment different ways of optimization quickly
  • Never sacrifice code modularity for any (including performance-related) reasons
Don't guess the bottleneck, Measure it
  • Bottlenecks are slow code which are frequently executed. Don't optimize slow code if they are rarely executed
  • Write performance unit test so you can collect fine grain performance data at the component level
  • Setup a performance lab so you can conduct end-to-end performance improvement measurement easily
Plan for growth
  • Do regular capacity planning. Collect usage statistics, predict the growth rate


Common Techniques

Server Farm (real time access)
  • If there is a large number of independent (potentially concurrent) request, then you can use a server farm which is basically a set of identically configured machine, frontend by a load balancer.
  • The application itself need to be stateless so the request can be dispatched purely based on load conditions and not other factors.
  • Incoming requests will be dispatched by the load balancer to different machines and hence the workload is spread and shared across the servers in the farm.
  • The architecture allows horizontal growth so when the workload increases, you can just add more server instances into the farm.
  • This strategy is even more effective when combining with Cloud computing as adding more VM instances into the farm is just an API call.
Data Partitioning
  • Spread your data into multiple DB so that data access workload can be distributed across multiple servers
  • By nature, data is stateful. So there must be a deterministic mechanism to dispatch data request to the server that host the data
  • Data partitioning mechanism also need to take into considerations the data access pattern. Data that need to be accessed together should be staying in the same server. A more sophisticated approach can migrate data continuously according to data access pattern shift.
  • Most distributed key/value store do this
Map / Reduce (Batch Parallel Processing)
  • The algorithm itself need to be parallelizable. This usually mean the steps of execution should be relatively independent of each other.
  • Google's Map/Reduce is a good framework for this model. There is also an open source Java framework Hadoop as well.
Content Delivery Network (Static Cache)
  • This is common for static media content. The idea is to create many copies of contents that are distributed geographically across servers.
  • User request will be routed to the server replica with close proxmity
Cache Engine (Dynamic Cache)
  • This is a time vs space tradeoff. Some executions may use the same set of input parameters over and over again. Therefore, instead of redo the same execution for same input parameters, we can remember the previous execution's result.
  • This is typically implemented as a lookup cache.
  • Memcached and EHCache are some of the popular caching packages
Resources Pool
  • DBSession and TCP connection are expensive to create, so reuse them across multiple requests
Calculate an approximate result
  • Instead of calculate an accurate answer, see if you can tradeoff some accuracy for speed.
  • If real life, usually some degree of inaccuracy is tolerable
Filtering at the source
  • Try to do more processing upstream (where data get generated) than downstream because it reduce the amount of data being propagated
Asynchronous Processing
  • You make a call which returns a result. But you don't need to use the result until at a much later stage of your process. Therefore, you don't need to wait immediately after making the call., instead you can proceed to do other things until you reach the point where you need to use the result.
  • In additional, the waiting thread is idle but consume system resources. For high transaction volume, the number of idle threads is (arrival_rate * processing_time) which can be a very big number if the arrival_rate is high. The system is running under a very ineffective mode
  • The service call in this example is better handled using an asynchronous processing model. This is typically done in 2 ways: Callback and Polling
  • In callback mode, the caller need to provide a response handler when making the call. The call itself will return immediately before the actually work is done at the server side. When the work is done later, response will be coming back as a separate thread which will execute the previous registered response handler. Some kind of co-ordination may be required between the calling thread and the callback thread.
  • In polling mode, the call itself will return a "future" handle immediately. The caller can go off doing other things and later poll the "future" handle to see if the response if ready. In this model, there is no extra thread being created so no extra thread co-ordination is needed.
Implementation design considerations
  • Use efficient algorithms and data structure. Analyze the time (CPU) and space (memory) complexity for logic that are execute frequently (ie: hot spots). For example, carefully decide if hash table or binary tree should be use for lookup.
  • Analyze your concurrent access scenarios when multiple threads accessing shared data. Carefully analyze the synchronization scenario and make sure the locking is fine-grain enough. Also watch for any possibility of deadlock situation and how you detect or prevent them. A wrong concurrent access model can have huge impact in your system's scalability. Also consider using Lock-Free data structure (e.g. Java's Concurrent Package have a couple of them)
  • Analyze the memory usage patterns in your logic. Determine where new objects are created and where they are eligible for garbage collection. Be aware of the creation of a lot of short-lived temporary objects as they will put a high load on the Garbage Collector.
  • However, never trade off code readability for performance. (e.g. Don't try to bundle too much logic into a single method). Let the VM handle this execution for you.

iPhone SDK delayed?

It's beginning to look as though the iPhone SDK may not ship this month since Apple has been silent about it. It was supposed to ship by the end of February. Of course this doesn't really have any impact on Run BASIC since it will not use the SDK, but it will produce apps that run in the iPhone's Safari web browser.

We haven't talked much about our plans for iPhone support in Run BASIC. We should have more to say in the next month or so.

BASIC is bad because it's too easy?

A response to a recent post read, "BASIC is good, but I think it should be avoided as a first language because it pampers the programmer too much."

I won't be the first person to say that BASIC is perfect. There is no perfect language. However, to recommend that a first language should not "pamper" the beginning programmer seems to me misses the mark completely.

I'm guessing that this means that languages should force the beginner to be aware of low level details such as the type of numeric value (int, float, byte, etc.) or that the beginner should made to manage the allocation and deallocation of memory. What do these sorts of things have to do with the essense of programming? Since there are many languages which do not have these kinds of features, I can only submit that they aren't essential to programming. Therefore they are not necessary ideas to teach the beginner.

The nitty gritty details of how a computer works ARE important. These things should be taught to any serious student of computers, but they do not need to be the first thing taught. People who do not fancy themselves experts do not need to be bothered to learn them.

Easy is the quality that BASIC has, especially in the quickness of its learnability. This is almost to a fault I agree in the sense that a slightly more general and abstract language might be a little harder to warm up to but better in the long haul. However for the person who programs for fun, or who needs a light language for writing utilities or small personal applications, I think BASIC hits the mark pretty well.

Let Us Grow Our Community

Click
Here
to Submit Your Article


I guess many sites/blogs don’t talk about it too often but I am ;-)


Number of Articles I've Posted Months-after-Month


These are the number of articles that I’ve posted for the respective
months, it’s clear that for 4-5 months I’ve not been able to post
much. Although there is a thing about quality over quantity but I don’t
think it makes as an excuse. Does it?


Amazingly though, the number of visitors and page views have increased
months after month.



In the span of about 9 months with 112 articles, we have somewhat formed
our own community.


I know that many of our visitors are good programmers and have the ability
to share their knowledge and expertise.


If you feel like you have/know something that you’d like to share with
thousands of other peoples on our community, then please submit your articles
to us. We’ll post it here and give you the platform.


Along with the articles, a short note about the author (including author’s
website and e-mail address) will also be published. So it’s also a nice
way to build links and traffic to your site?.





Guys what are you waiting for now, write about anything you know (related to
programming), our community needs it.


P.S. You may write article on just about anything, ex. example program, algorithm,
problem solving and in any programming language (C, C++, Java, C# etc.)

Thank You


Click
Here
to Submit Your Article

Zero-sized arrays

It is valid to do the below:

    T * ptr = new T[0];

0 is a perfectly valid input to the array form of new expression. What is the use of it, you may ask. No use as far as the array is concerned. What use is an array if it does not have any members? What can you possibly do with it? Nothing! Right, you cannot do anything with it but the statement above being valid is atleast a little bit helpful.

It allows you to have code like this:

void f(size_t n)
{
     int * ptr = new int[n];
     //...
     delete[] ptr;
}

instead of:

void f(size_t n)
{
     if (n>0)
     {
         int * ptr = new int[n];
         //...
         delete[] ptr;
     }
}

No other use though as said earlier, because since it has zero elements, you cannot do anything with it.

This reminds of another peculiar extension related to zero sized/unsized arrays. VC++ has an extension which makes use of zero sized arrays as the last member of structures. For example, see this:

#include<cstdlib>
#include<iostream>

struct test
{
     int cb;
     int buf[];
};

int main()
{
     test* bb;
     int length = 10;
     bb = (test*) malloc(sizeof(test)+sizeof(int)*length);
     bb->cb = length;
     //fill buf with length int items or can copy from another array for length elements
     bb->buf[i]=10; //i should be less than length
     //OR--------
     test aa = {10, {1,10,22,32}};
     std::cout << aa.buf[2];
}

Since the zero size member is in there, there are few restrictions as well. You cannot create an array of the test struct. You can mimic that though like this:

    //create an array of length 10 of test pointers
    test * ptr[10];
    //set each of the pointers to individual elements created as above

Generally, you would be better off keeping a pointer member but it is
worth noting the presence of such an extension. See for details :
Unsized arrays.

The rationale as per msdn is saving the runtime pointer dereferences and that is not specific to C. How big that overhead is, with respect to the application in order to make a choice to this extension, is a different story altogether. You could even implement std::vector<T> using this extension rather than having a pointer to type T! That is, std::vector<T> implementation with VC++ might use this extension but they don't do so. :-)

Interesting one, this one... isn't it? :-)

const in C and C++

Globals are usually bad but not always. Consider C code that is to be migrated to C++ and has a bunch of global constants. As soon as that code is worked upon and is compiled as C++, those const globals will cause the compiler to start emitting "undefined reference" or similar errors.

This is because of fundamental different between how const is treated in C and C++ in the context of linkage.

In C, apart from the fact that const are non-modifiable variables, they are the same as any other variables. In C++, const have different linkage as well. C++ const objects/variables have internal linkage and hence you get unresolved symbol error when you try to access something in a different compilation unit having iternal linkage from another compilation unit.

The solution is to include the extern declaration following right after you define the const variables, the same one that you put in another files or simply define the variable as extern const instead of just const.

#pragma once standardized?

I have at a few places, where people have mis-heard/mis-read that that #pragma once has been standardized for C++0x. It's "not".

Even gcc has it labelled as an obsolete feature (for a couple of years now) - Obsolete once-only headers.

Since, it is not standard in the first place, one should avoid using it. The standard include guards "#ifndef" should the choice and with good compiler support, these are as efficient as the promise made by #pragma once. Plus they are guaranteed to be portable.

Reading file all at once

Few times, you would want to read the whole file in one go, not line by line, not by a fixed buffer size. Here's a way to get that done:

    std::fstream ifs("filename.txt");
    if (!ifs)
    {
         std::stringstream oss;
         oss << ifs.rdbuf();
    }

As simple as that. The contents of the file are moved (re-directed) to the stringstream. The same code can be used to make copies of files. You would need to replace the stringstream object with an fstream one though.

Taking the Arc Challenge

Paul Graham, a very famous Lisp programmer who is well known for his continuations based web framework has invented a language called Arc (a version of Lisp). He came up with a challenge to see how others would implement a trivial web program in their favorite language.

Here's the Arc program:

(defop said req
(aform [w/link (pr "you said: " (arg _ "foo"))
(pr "click here")]
(input "foo")
(submit)))

Follow this link to see other submissions in different languages. Scroll down to to bottom to see the Run BASIC example. http://arclanguage.org/item?id=722

Now tell me which language you'd rather develop web apps in. BASIC is the one. ;-)

Web programming in... Java?

I mean, why? Of course that's how I feel about any kind of programming in Java. I recently found a blog where someone reviewed Wallace Wang's Beginning Programming book which teaches Liberty BASIC. The blogger seemed to enjoy Liberty BASIC and then went on to say that we was planning to learn Java. I wish that those presidential candidates would promise to reform our educational system by banning Java from k12 schools. People who are just learning programming need something that gives them a tighter feedback loop like BASIC does. If the learning experience isn't at least a little bit of fun it won't benefit the kids. Period.

People who trash BASIC haven't tried modern versions. Even the old DOS QBasic has everything the beginner needs.

Knocked on the head with BASIC

A close friend of mine schooled in C++, Smalltalk, Java and Groovy told me the other day that he has written his first BASIC program by using Run BASIC. He had never written code in BASIC before, but he took my word for it that BASIC is a great language for throwing together solutions quickly. He said he was struck by the lightness of the BASIC language and that he enjoyed working in it. This is my paraphrase of what he said, because I don't remember his exact words.

I know there are a lot of modern BASIC implementations that force you to declare all your variables and give them types and sizes. Some of them keep the core keywords but add Pascal syntax, and some make BASIC look more like Java. I know there are some benefits to the way these other languages work, but BASIC is really meant to be very light and simple. In my humble opinion any language claiming to be BASIC which forces the programmer to dot too many i's and cross too many t's is not BASIC, but an imposter.

BASIC is a small language without too many rules.

iPhone development - activation experience

Since we have decided to create some iPhone development features for Run BASIC I bought an iPhone for this purpose. I thought it would be interesting to post my personal iPhone experience.

So, I went to the Apple store. I said that I wanted to buy an iPhone, so the youthful Apple employee grabbed me a small black box from behind the counter and handed it to me. "Big day," he said with a certain air of importance. I thought that was a little over the top. I mean, it's a phone. I wasn't having a baby. My daughter put it well, "Maybe if he was going to give you the $400 phone for free it would have been a big day." Ah well. I suppose Apple store employees can be forgiven for drinking the corporate koolaid. ;-)

I told him that I was buying the iPhone because I am working on easy development tools for it. He didn't quite get it right away that I wasn't working on iPhone apps for people to consume, but a really easy way for anyone to create their own iPhone apps that run in the Safari browser. I explained more carefully and got a gratifying 'Ahhhh' response from him. Gotta work on that marketing message.

So I took the phone home and unpacked it. There's no manual at all. There really should be for that price.

Alright, I understand that you activate the iPhone using iTunes. I am a Mac user (and a PC user) so I thought I would activate the phone from my Mac. That didn't work. I needed a newer version of iTunes. No big deal. I downloaded and installed the latest. Still no good. Why? Because then I discovered that I needed OS X v10.4.x or better. What then dawned on me made me a bit angry. I was going to need to activate the phone using iTunes on Windows. I think that qualifies as mistreatment by Apple of its customers.

Okay, so now I upgraded to the latest version of iTunes on my Vista box. I plugged the phone in and activated it. It went smoothly from there.

I'll post more about the iPhone and our work to support it using Run BASIC in the days to come.

Classification via Decision Tree

Decision Tree is another model-based learning approach where the output is a tree. Here, we are given a set of data with structure [x1, x2 …, y] is presented. (in this case y is the output). The learning algorithm will learn (from the training set) a decision tree and use that to predict the output y for future seen data.

x1, x2 ... can be either numeric or categorical. y1 is categorical

In the decision tree, the leaf node contains relatively "pure" data represented by a histogram {class_j => count}. The intermediate node is called a "decision node" containing a test against an input attribute (e.g. x2) to a constant value. Two branches are output according to whether the decision is evaluate to be true or false. If x2 is a categorical value, the test will be a equality test (e.g. if "weather" equals "sunny"). If x2 is numeric, the test will be a greater than / less than test (e.g. if "age" >= 40).


Building the Decision Tree

We start at the root node which contains all training samples. We need to figure out what should be our first test. Our strategy is to pick the test such that it divides the training samples into two groups which has the highest sum of "purity"

A set of data records is "pure" if all their outcome is gravitated towards a particular value, otherwise it is impure. Purity can be measurable by Entropy or Gini Impurity function.

Gini is measured by calculating the probability of picking two records from a set such that their outcome is different.

Entropy is measured by calculate the following ...
sum_over_j(P(class_j) * log (P(class_j)))

Note that the term P * logP is close to zero in either case when P is close to zero or when P is close to one. Entropy is large when P is about 0.5. The higher the entropy, the lower the purity.

Keep doing the following until the overall purity is not improved further
  1. Try all combination of x1, x2 ... / value1a, value1b, value2a, value2b ...
  2. Pick one combination such that the divided data set has a better combined purity (which is the weighted sum of purity based on the frequency)
  3. To avoid the decision tree overfits the training data, we divide the tree only when the purity after divide exceed a threshold value (called pre-pruning)


def build_tree(set, func, threshold)
orig_purity = calculate_purity(func, set)
purity_gain = 0
best_attribute = nil
best_test_value = nil
best_split_left_set = nil
best_split_right_set = nil

for x in each attribute
for x_value in each possible value of x
left_set, right_set = split(x, x_value, set)
left_freq = left_set.size / set.size
right_freq = right_set.size / set.size
left_purity = calculate_purity(func, left_set)
right_purity = calculate_purity(func, right_set)
split_purity = left_freq*left_purity + right_freq*right_purity
improvement = split_purity - orig_purity
if improvement > purity_gain
purity_gain = improvement
best_attribute = x
best_test_value = x_value
best_split_left_set = left_set
best_split_right_set = right_set

if purity_gain > threshold
node = DecisionNode.new(best_attr, best_value)
node.left = build_tree(best_fit_left_set, func, threshold)
node.right = build_tree(best_fit_right_set, func, threshold)
else
node = DecisionNode.new
node.result = calculate_outcome_freq(set)


Over Fitting and Tree Pruning

Since the training data set is not exactly same as the actual population, we may biased towards some characteristics of the training data set which doesn't exist in actual population. Such biases is called over fitting and we want to avoid it.

In above algorithm, we create an intermediate node when the purity_gain is significant. We can also do "post pruning" such that we build the tree first and then working backward trying to merge the leaf nodes if the decrease of purity is small.

We also can set aside 1/4 of training data as validation data. So we use the 3/4 of training data to build the tree first and then use the 1/4 validation data to prune the tree. This approach will reduce the training data size and so only practical when the training data set is large.


Classification and missing data handling

When a new query point arise, we start traversing the decision tree from its root by answering the question at each decision node until we reach the leaf node, from there we pick the class with the highest number of instance count.

However, what if the value of some attributes (say x3, x5) are missing. When we reach a decision node where x3 need to be tested, we cannot proceed.

One simple solution is to fill-in the most likely value of x3 based on the probability distribution of x3 within the training set. We can also throw a dice based on the probability distribution. Or we can use a tree-way branch by adding "missing data" as one of the condition.

Another more sophisticated solution is to walk down both path if the attributed under test is missed. We know the probability distribution at the junction point, which used to calculate the weight on each branch. The final result will be aggregated according to the weight.

Check out this stream