Notes on Memcached

Some notes about Memcached. Here is its architecture.


How it works ?
Memcached is organized as a farm of N servers. The storage model can be considered as a huge HashTable partitioned among these N servers.

Every API request takes a "key" parameter. There is a 2-step process at the client lib ...
  • Given the key, locate the server
  • Forward the request to that server
The server receiving the request will do a local lookup for that key. The servers within the farm doesn't gossip with each other at all. Each server use asynchronous, non-blocking I/O and one thread can be used to handle large number of incoming TCP sockets. Actually a thread pool is being used but the number of threads is independent of the number of incoming sockets. This architecture is highly scalable for large number of incoming network connections.

API
Memcached provide a HashTable-like interface, so it has ...
  • get(key)
  • set(key, value)
Memcached also provides a richer "multi-get" so that one read request can retrieve values for multiple keys. The client library will issue different requests to multiple servers and doing the lookup in parallel.
  • get_multi(["k1", "k2", "k3"])
Some client lib offers a "master-key" concept such that a key contains 2 parts, the prefix master-key and the suffix key. In this model, the client lib only use the prefix to located the server (rather than looking at the whole key) and then pass the suffix key to that server. So user can group entries to be stored by the same server by using the same prefix key.
  • get_multi(["user1:k1", "user1:k2", "user1:k3"]) -- This request just go to the server hosting all keys of "user1:*"

For updating data, Memcached provides a number of variations.
  • set(key, value, expiration) -- Memcached guarantees the item will never be staying in the cache once the expiration time is reached. (Note that it is possible that the item being kicked out before expiration due to cache full)
  • add(key, value, expiration) -- Success only when no entry of the key exist.
  • replace(key, value, expiration) -- Success only when an entry of the key already exist.
Server Crashes
When one of the server crashes, all entries owned by that server is lost. Higher resilience can be achieved by storing redundant copies of data in different servers. Memcached has no support for data replication. This has to be taken care by the application (or client lib).

Note that the default server hashing algorithm doesn't handle the growth and shrink of the number of servers very well. When the number of servers changes, the ownership equation (key mod N) will all be wrong. In other words, if the crashed server needs to be taken out from the pool, the total number of servers will be decreased by one and all the existing entries needs to be redistributed to different server. Effectively, the whole cache (among all server) is invalidated even when just one server crashes.

So one approach to address this problem is to retain the number of Memcached servers across system crashes. We can have a monitor server to detect the heartbeat of all Memcached server and in case any crashes is detected, start a new server with the same IP address as the dead server. In this case, although the new server will still lost all the entries and has to repopulate the cache, the ownership of the keys are unchanged and data within the surviving node doesn't need to be redistributed.

Another approach is to run logical servers within a farm of physical machines. When a physical machine crashes, its logical servers will be re-start in the surviving physical machines. In other words, the number of logical servers is unchanged even when crashes happens. This logical server approach is also good when the underlying physical machines has different memory capacity. We can start more Memcached process in the machine with more memory and proportionally spread the cache according to memory capacity.

We also can use a more sophisticated technique called "consistent hashing", which localize the ownership changes to just the neighbor of the crashed server. Under this schema, each server is assigned with an id under the same key space. The ownership of a key is determined by the closest server whose key is the first one encountered when walking in the anti-clockwise direction. When a server crashes, its immediate upstream neighbor server (walking along the anti-clockwise direction) will adopt the key ownership of the dead server, while all other servers has the same ownership of key range unchanged.


Atomicity

Each request to Memcached is atomic by itself. But there is no direct support for atomicity across multiple requests. However, App can implement its own locking mechanism by using the "add()" operation provide by Memcached as follows ...
success = add("lock", null, 5.seconds)
if success
set("key1", value1)
set("key2", value2)
cache.delete("lock")
else
raise UpdateException.new("fail to get lock")
end

Memcached also support a "check-and-set" mechanism that can be used for optimistic concurrency control. The basic idea is to get a version stamp when getting an object and pass that version stamp in the set method. The system will verify the version stamp to make sure the entry hasn't been modified by something else or otherwise, fail the update.
data, stamp = get_stamp(key)
...
check_and_set(key, value1, stamp)

What Memcached doesn't do ?
Memcached's design goal is centered at performance and scalability. By design, it doesn't deal with the following concerns.
  • Authentication for client request
  • Data replication between servers for fault resilience
  • Key > 250 chars
  • Large object > 1MB
  • Storing collection objects
Data Replication DIY
First of all, think carefully about whether you really need to have data replication at the cache level, given that cache data should always be able to recreated from the original source (although at a higher cost).

The main purpose of using a cache is for "performance" reason. If your system cannot tolerate data lost at the cache level, rethink your design !

Although Memcached doesn't provide data replication, it can easily be done by the client lib or at the application level, based on a similar idea described below.

At the client side, we can use multiple keys to represent different copies of the same data. A monotonically increasing version number is also attached with the data. This version number is used to identify the most up-to-date copy and will be incremented for each update.

When doing update, we update all the copies of the same data via different keys.
def reliable_set(key, versioned_value)
key_array = [key+':1', key+':2', key+':3']
new_value = versioned_value.value
new_version = versioned_value.version + 1
new_versioned_value =
combine(new_value, new_version)

for k in key_array
set(k, new_versioned_value)
end
end
For reading the data from cache, use "multi-get" for multiple keys (one for each copy) and return the copy which has the latest version. If any discrepancy is detected (ie: some copies have a lacking version, or some copies are missing), start a background thread to write the latest version back to all copies.
def reliable_get(key)
key_array = [key+':1', key+':2', key+':3']
value_array = get_multi(key_array)

latest_version = 0
latest_value = nil
need_fix = false

for v in value_array
if (v.version > latest_verson)
if (!need_fix) && (latest_version > 0)
need_fix = true
end
latest_version = v.version
latest_value = v.value
end
end
versioned_value =
combine(latest_value, latest_version)

if need_fix
Thread.new do
reliable_set(key, versioned_value)
end
end

return versioned_value
end
When we delete the data, we don't actually remove it. Instead, we mark the data as deleted but keep it in the cache and let it expire.

User Throttling
An interesting use case other than caching is to throttle user that is too active. Basically you want to disallow user request that is too frequent.
user = get(userId)
if user != null
disallow request and warn user
else
add(userId, anything, inactive_period)
handle request
end

Machine Learning: Association Rule

A typical example is the "market-basket" problem. Lets say a supermarket keep track of all the purchase transactions. Each purchase transaction is a subset of all the item available in the store. e.g. {beer, milk, diaper, butter}.

The problem is: By analyzing a large set of transactions, can be discover the correlation between subsets ? ie: people buying milk and butter has a high tendency of buying diaper. Or people buying diaper tends to buy soda and ice-cream.

Such correlation is called an association rule, which has the following form:
A => B where A, B are disjoint subsets of U (a universal set)

This rule can be interpreted as: From the transaction statistics, people buying all items in set A tends to also buy all items in set B.

Note that people buying both set A AND set B is denoted as (A union B) rather than (A intersect B).

There are two concepts need to be defined here ...

"Support" is defined with respect to a subset X as the % of total transaction that has contains subset X. This can be indicate as P(contains X). e.g. The support of {beer, diaper} is P(contains {beer, diaper}) which means if we randomly pick a transaction, how likely that it will contain both beer and diaper.

"support" of an association rule A => B is defined as the "support" of (A union B)

"Confidence" is defined with respect to a rule (A => B) that given we know a transaction contains A, how likely that it also contains B.

P(contains B | contains A) = P(contains B union A) / P(contains A) which is the same as Support(A union B) / Support(A)

Mining Association Rules

The problem is how can we discover the association rules that has a high enough "support" and "confidence". First of all, an arbitrary threshold of "support" and "confidence" is set according to domain specific concerns. There are two phases.

1) Extract all subset X where support(X) > thresholdOfSupport
2) For all extracted subset X, discover A => B where A is subset of X and B is (X - A)

1) is also known as the "finding frequent subsets" problem. A naive implementation can generate all possible subsets and check their support value. The naive approach has exponential complexity 2 exp(N) .

Apriori Algorithm to find frequent subsets

This algorithm exploit the fact that if X is not a frequent subset, then (X union Anything) will never be a frequent subset. So it starts with scanning small subsets and throw away those that doesn't has high enough support. In other words, it prune the tree as it grows.

Lets say the universal set is {I1, I2, I3, I4, I5, I6}

First round:
  • Generate possible candidates of 1-item subset. ie: {I1}, {I2}, ... {I6}
  • Find out all supports of the candidate set. ie: support({I1}), support({I2}), .... support({I6})
  • Filter out those whose value < supportThreshold
Second round:
  • From the surviving 1-item subset, generate possible 2-item subset candidates. ie: {I1, I2}, {I1, I4} ... Note that we can skip any subset that contains I3 because it is out.
  • Find out all supports of the 2-item candidate set.
  • Filter out those whose value < supportThreshold

K round: (repeat until no more surviving k-1 item subset)
  • From the surviving k-1 item subset, generate possible k-item subset candidates by adding one more item that is not already in the k-1 item subset. Skip any k item subset that contains any throw-away k-1 item candidates from the last round.
  • Calculate the support of k-item candidates
  • Throw away those whose support value < supportThreshold
Find association rules from frequent subsets

After knowing a frequent k-item subset X, we want to find its subset A such that the confidence value of (A => X-A) is higher than the confidence threshold.

Note that confidence = support(X) / support(A)

Since for a given X, support(X) is fixed, we start with trying to find lowest support(A) because that will give the highest confidence. So we do the reverse process, starting from the largest subset A within X, and reduce the set A in each round.

Notice that if support(A) is not low enough, there is no need to try subset A' because support(A') can only be higher. Therefore we only focus our energy to try subset with the surviving A.

First round:
  • Within X, generate possible candidates of k-1 item subset.
  • Find out all confidence of the candidate set.
  • Filter out those whose confidence value < confidenceThreshold
  • For those surviving k-1 item subset A, mark the rule (A => X-A)
J round: (repeat until no more surviving j-1 item subset)
  • Within the surviving j-item subset A, generate possible candidates of j-1 item subset.
  • Find out all confidence of the candidate set.
  • Filter out those whose confidence value < confidenceThreshold
  • For those surviving j-1 item subset A', mark the rule (A' => X-A')
Miscellaneous

Note that confidence is absolute but not relative.
When (A => B) has confidence = 75%, it is also possible that (!A => B) has confidence = 90%. In other words, it is possible that some rules are contradict to each other and usually the one with higher support and confidence wins.

C++ Variable : Declare and Assign

C++ Variable : Declaration and Assignment

What is variable ?

Variable reserve space in memory to store the data. You can change the data you stored in the variable. Since the data could vary, it is called variable.
When declaring variable you should specify what type of data it should store and how much memory it should reserve for it. The following table shows the type of data and size of memory it reserve for that data and also how many values you can store in that data type.


Type
Size
Value
bool
1 byte
True or false
unsigned short int
2 bytes
0 to 65535
short int
2 bytes
-32768 to 32767
unsigned long int
4 bytes
0 to 4,294,967,295
long int
4 bytes
–2,147,483,648 to 2,147,483,647
int (16 bit)
2 bytes
–32,768 to 32,767
int (32 bit)
4 bytes
–2,147,483,648 to 2,147,483,647
unsigned int (16 bit)
2 bytes
0 to 65,535
unsigned int (32 bit)
4 bytes
0 to 4,294,967,295
char
1 byte
256 character values
float
4 bytes
1.2e–38 to 3.4e38
double
8 bytes
2.2e–308 to 1.8e308


Note : The size of some variable type will differ on 16-bit and 32-bit processor.
Imp* : You can use "short" instead of "short int" and "long" instead of "long int".

On the 32-bit processor the value of short ranges from –32,768 to 32,767 but if you declare unsigned short the value ranges from 0 to 65535. You cannot store negative value in unsigned. Never declare unsigned before variable type if there is any possibility to store a negative value.

How to declare a variable ?

To declare a variable first type the variable-type (also known as data type) followed by the variable name and then teminate it by semicolon.
For example :
int number;
In the above example, int is the data type and number is name of variable. You can give any name to the variable except for name used for keywords. Variable name can contain letters, numbers or underscore. But the first character should always be either letter or underscore.

How to assign value to a variable ?

There are two ways to assign value to a variable :
1) Assign the value directly in the program.
2) Ask from user to input a value and then assign that value.

A C++ program example that assign the value directly in the program.


#include <iostream>

int main ()
{

    using std::cout;
    using std::endl;

    int a = 10, b = 20;
    int sum = a + b;

    cout << "Addition is : " << sum;

    return 0;
}


A C++ program example that ask the user to input a value then assign that value.


#include <iostream>
#include <iomanip>

int main ()
{

    using namespace std;

    int a, b, sum;

    cout << "Enter two number for addition." << endl;

    cout << "First" << setw (3) << ": ";
    cin >> a;

    cout << "Second" << setw (1) << ": ";
    cin >> b;

    sum = a + b;
    cout << "Addition is : " << sum;

    return 0;
}


A C++ program example that demonstrate the declaration and assignment of various data types.


#include <iostream>
#include <iomanip>

int main ()
{

    using namespace std;

    short myShort = 2000;
    int myInt = 200000;
    long myLong = 5000000;
    float myFloat = 1255.549;
    double myDouble = 78079.3;
    char myChar = 'a';
    char myString1[20] = "I love ";
    string myString2 = "C++ Programming";

    cout << "myChar" << setw (5) << ": " << myChar << endl;
    cout << "myShort" << setw (4) << ": " << myShort << endl;
    cout << "myInt" << setw (6) << ": " << myInt << endl;
    cout << "myLong" << setw (5) << ": " << myLong << endl;
    cout << "myFloat" << setw (4) << ": " << myFloat << endl;
    cout << "myDouble" << setw (3) << ": " << myDouble << endl;
    cout << "myString1" << setw (1) << ": " << myString1 << endl;
    cout << "myString2" << setw (1) << ": " << myString2 << endl;

    return 0;
}


C++ Notes:

(1)As we know that in english language statement ends with dot (full-stop). In programming languages like C, C++ and Java, statement ends with semicolon.
Thus, "int age;" is called statement in C++ and since it declares a variable, it is called declaration statement.
(2)This "age=20;" too is a statement as it ends with semi-colon. This statement assigns value to the variable, hence it is called assignment statement.
(3)This "int age=20;" statement does two job. It declares variable as well as assigns value to it. This is compound statement of both declaration as well as assignment statement.
(4)In C++ (=) is called assignment operator and not equal to as in mathematics. It is use to assign the value. In C++ equality operator is (==).
(5)As already discussed in earlier lesson that cout in C++ is used to displays output to the console. Similarly, cin is used to take the input (a value) from the user and then assign it to its variable.

C++ Comments : Types and Uses

C++ Comments : Types and Uses

Why to use commments ?

You write a program so that it performs the given set of instruction. When you increase the code by adding additional instruction to it, it may become confusing. You can use comments to simplify it by mentioning that how or what the given instruction does. If you give your code to any other programmer, it will be easier for him to understand your code if you use comments wisely. Also It will be helpful for you too. If you use comments you will have better idea about your program even if the program becomes large later on.

Types of comments

There are two types of comments used in C++.
(1) Single line comments :
Single line comments is accomplished by double-slash (//). Everthing that is followed by double-slash till the end of line is ignored by the compiler. It is reffered as C++-style comments as it is originally part of C++ programming.

(2) Multi-line comments :
Multi-line comments starts by using forward slash followed by asterisk (/*) and ends by using asterisk followed by forward slash (*/). Everthing between (/*) and (*/) are ignored by compiler whether it is one or more than one line. It is reffered as C-Style comment as it was introduced in C programming.

A C++ Program example that demonstrates the syntax and use of C++ Comments.


/*
Program : C++ Comments
Source code written by : Mohammed Homam
Last modified date : 8:09 PM 10/21/2009
*/

#include <iostream>

int main ()
{

    using std::cout;
    using std::endl;

    cout << "C++ Programming" << std::endl;    // It will display
                                                                         // C++ Programming
    return 0;
}

The challenging 'Infinite loop' problem

Picked this one up from The C++ blog.

Consider the following program:



#include <iostream>

int
main()
{

int
i;
int
array[4];
for
(i=0; i<=8; i++)
{

array[i]=0;
}


return
0;
}


Most people including myself expect this program to crash. This is not the case. The reason being the way everything is stored on stack. In simple representation, the storage is something like this:



So when i = 6, value of i is reset to 0 and the loop continues forever.

After digging some information on stack, heap, etc. I found some useful info in C++ in 24 hours:

Programmers generally deal with five areas of memory:
  • Global name space
  • The free store (a.k.a. Heap)
  • Registers
  • Code space
  • The stack

Local variables are on the stack, along with function parameters. Code is in code space, of course, and global variables are in global name space. The registers are used for internal housekeeping functions, such as keeping track of the top of the stack and the instruction pointer. Just about all remaining memory is given over to the free store, which is sometimes referred to as the heap.

The problem with local variables is that they don't persist. When the function returns, the local variables are thrown away. Global variables solve that problem at the cost of unrestricted access throughout the program, which leads to the creation of code that is difficult to understand and maintain. Putting data in the free store solves both of these problems.

You can think of the free store as a massive section of memory in which thousands of sequentially numbered cubbyholes lie waiting for your data. You can't label these cubbyholes, though, as you can with the stack. You must ask for the address of the cubbyhole that you reserve and then stash that address away in a pointer.

The stack is cleaned automatically when a function returns. All the local variables go out of scope, and they are removed from the stack. The free store is not cleaned until your program ends, and it is your responsibility to free any memory that you've reserved when you are done with it.

The advantage to the free store is that the memory you reserve remains available until you explicitly free it. If you reserve memory on the free store while in a function, the memory is still available when the function returns.

The advantage of accessing memory in this way, rather than using global variables, is that only functions with access to the pointer have access to the data. This provides a tightly controlled interface to that data, and it eliminates the problem of one function changing that data in unexpected and unanticipated ways.

For this to work, you must be able to create a pointer to an area on the free store and to pass that pointer among functions. The pointer is created using new and once you are done with it you can free the memory using delete.

Reading string via cin and flushing input buffers

The following program shows three different approaches to read string via Cin. It also shows two different approaches for flushing input buffer.






//Program tested on Microsoft Visual Studio 2008 - Zahid Ghadialy
//Program to read input string and to flush input buffer
#include<iostream>
#include<string>
#include <istream>
#include <limits>

using namespace
std;

int
main()
{

string s1;
cout << "\nAPPROACH 1" <<endl;
cout << "Input a string : ";
cin >> s1;
cout << "Entered string = " <<s1.c_str()<<endl;

//Lets flush the buffer from Approach 1 here
istream& in(cin);
in.ignore( numeric_limits<std::streamsize>::max(), '\n' );

cout << "\nAPPROACH 2" <<endl;
cout << "Input a string : ";
istreambuf_iterator<char> eos; // end-of-range iterator
istreambuf_iterator<char> iit (cin.rdbuf()); // stdin iterator
string s2;
while
(iit!=eos && *iit!='\n') s2+=*iit++;
cout << "Entered string = " << s2 << endl;

//Another approach for flushing the input
int ch;
while
((ch = cin.get()) != '\n' && ch != EOF);

cout << "\nAPPROACH 3" <<endl;
cout << "Input a string : ";
char
buf[BUFSIZ];
if
(cin.getline(buf, sizeof(buf)))
{

cout << "Entered string = " << buf << endl;
}


cout << endl;
return
0;
}







The output is as follows:



More information:

BASIC and Space Flight!

The same fellow who created an OpenGL flight simulator in Liberty BASIC has taken it a step further by creating a spaceflight simulator with a map editor. Very neat!

Check it out

Using #pragma message directive

Sometimes you would like to remind the user or yourself of something while compiling and there is an option to remind yourself using the #pragma message directive. See the following program:



//Program tested on Microsoft Visual Studio 2008 - Zahid Ghadialy
//Program to show how to remember things using #pragma message
#include<iostream>

using namespace
std;

int
someFunction(int a=0, int b=0)
{

#pragma message("TODO: Coding of someFunction()")
return 0;
}


int
main()
{

someFunction();
return
0;
}

In this case the user wants to remind himself that someFunction() needs to be coded at a later time and hence he uses a preprocessor message statement. The result is when compilation is done, a message will be output as shown below:

Re-engineer Legacy Architecture

Legacy system may sounds like a negative term, but in fact it is the result of a successful system. Most legacy systems lies in the core business operation of many enterprises. In fact, it is because of the criticality of it to the business that no one dare to make changes, because a small bug can have huge financial impact.

Most of those legacy system started with a clean design at the beginning as the originally problem it tried to solve was smaller and well-defined. However, as business competition and organization evolution continuously demand new enhancements/features within a relatively short timeframe, these new features will typically be implemented in a completely separated module so that existing working code won't be touched. Common functionality coded in the original system will got copied into the new module. This is a very common syndrome of what I call "reuse via copy and paste"

Here is an earlier blog about some common mindset that is causing the formation of bad code.

Basically, the idea of "not touching existing working code" encourage a "copy and paste code" culture which over time, causing a lot of code duplication across many places. Once a bugfix is developed, you need to make sure the fix is put in all the copied code. Once you enhance a feature, you need to make sure to roll it into all the copies. When there is 20 different places in the code doing the same thing (but slightly different), you start of losing visibility which takes the ultimate responsibility of certain piece of logic. Now the code is very hard to maintain and also hard to understand it.

Because you cannot understand it, so you are more scare about making changes to existing code (since they are working). This further encourage you to put new feature in a complete separated module, and further worsen the situation. The cycle repeats.

Over a period of time, the code is so unmaintainable that adding any new features takes a long time and usually breaks many places of existing code, development team doesn't feel they are productive and work in a low morale condition. In my past career, I was brought in to help on this situation.

At a high level, here are the key steps ...

1. Identify your target architecture

Define a "to-be" architecture that can support the business objectives in next 5 years. It is important to purposely ignore the current legacy system at this stage because otherwise you won't be able to think "outside the box".

Be cautious not to pass out a feeling that this exercise is going to suggest throwing the existing system away and start everything from scratch. It is important to understand that the "to-be architecture" is primarily a thought exercise for us to define our target. And we should clearly separate our "vision" from the "execution" which we shouldn't be worrying at this stage.

The long-term architecture establish a vision on where we want the ultimate architecture to be and serve as our long-term target. A core vs non-core analysis is also necessary to decide which components should be built vs buy.

It is also important to get a sense of possible changes in future and build enough flexibility into the architecture such that it can adapt to future changes when it happens. Knowing what you don't know is very important.

A top down approach is typically used to design the to-be architecture. The level of detail is determined by how well the requirements are known and how likely will they be changed in future. Since the to-be architecture mainly serve the purpose of a guiding target, I usually won't get too deep into implementation details at this stage.

The next step is to get on to the ground to understand where you are now.


2: Understand your existing system

To get quickly up to speed, my first attempt is talk to people who understand the current code base as well as the pain points. I'd also try to skim through existing documents, presentation slides to get a basic understanding of the existing architecture.

In case people who are knowledgeable about how the legacy system works still available, conducting a formal architecture review process can be a very efficient process to get start on understanding the legacy system.

In case these people has already left, a different reverse engineering process is needed.


3: Define your action plan

At this point, you have a clear picture of where you are and where you want to be. The next step is to figure out how to move from here to there. In fact this stage is the hardest because many factors needs to be taken into considerations.
  • Business priorities and important milestone dates
  • Risk factors and opportunity costs
  • Organization skill set distribution and culture
The next step is to construct an execution plan that optimize business opportunities and minimize cost and risks. Each risk also need to have an associated contingency plan (plan B). In my experience, the action plan usually take on one of the following options.

Parallel development of a green-field project
A small team of the best developers will form an effort in parallel (alongside with the legacy system) to create the architecture from scratch. The latest, best of breed technologies will typically be used such that most of the infrastructure pieces can either be bought or fulfilled by open source technologies. The team will focus in just rewriting the core business logic. The green-field system is typically more easy to understand and more efficient.

After the green field system is sufficiently tested. The existing legacy system will be swapped out (or serve as a contingent backup). Careful planning on data migration, traffic migration is important to make sure the transition is smooth.

One problem of this approach is development cost, because now you need to maintain (within the transition period) two teams of developers working on two systems. New feature requirements may come in continuously and you may need to do the same thing twice in both systems.

Another problem is the morale of the developers who maintain the legacy system. They know the system is going away in future and so they may need to find another job. You may endup losing those people who are knowledgeable about your legacy system even faster.

Refactoring
Another approach is to refactor the current code base and incrementally bring them back into a good shape. This involve repartitioning of responsibilities of existing components, break down complex components or long methods.

When I run into code that I am not able to understand which may be dead code that never get exercise, or logic that is hidden in many level of indirection. What I typically do is to add trace statements into the code and rerun the system to see when this code is execute and who is calling it (by observing the stack trace). I will also put a wrapper around the code that I don't understand, shrink the wrapper's perimeter to a point that I can safely just swap out the component which I don't understand.

It is also quite common that legacy system lacks of unit test, so a fair amount of effort may need to spend in writing unit test around the components.

One problem of the refactoring approach is it is not easy to get management buy-in because they don't see any new feature coming out from the engineering effort being spent.

Architecture Review Process

If you are lucky enough to keep the engineers who are knowledgeable about the current system around, then conducting a formal "Architecture Review Process" is probably the most efficient way to understand how the existing system work.

However, if these people have already left the company already, then a different reverse engineering effort need to be taken instead.

Participants
It is usually involved a number of key persons
  • A facilitator who orchestrate the whole review process and control the level of depth at different stages
  • A recorder who documents key points, ideas, observations and outstanding issues throughout the review process
  • Key architects and developers who collectively understand the details of the legacy systems, and be able to get down to any level of details (even code walk-through) if necessary.
  • Domain expert who understand every details of how people use the system, what are their current pain points and what features will help them most. The domain expert also helps to set business priorities between conflicting goals.
Process

The architecture review process has the following steps.

Use Cases and Actors
From the business domain expert, we focus in the "actors" (who use the system) as well as the "use case" (how they use it). We also look at the key metrics to measure the efficiency of their tasks (e.g. how long does it take to complete the use case)


Activity Analysis
Drill down from each use case, we identify activities that each actor perform. We look at what data need to be captured at each activity and how actors interact with each other as well as with the system.

At this point, we should establish a good external view of the system. Now we dig into the internals of it ...

Technology Stack
This purpose is to understand what are those building blocks underlie the system and get a good sense of whether the build vs buy combination is correct. Things like which programming language, Java vs DOtNet, which App Server, what DB, any ORM vs direct SQL, XML vs JSON, which IOC or AOP container, Messaging framework ... etc need to be discussed. We also need to distinguish the core features (which you mostly want to build) from the non-core features (which you mostly want to leverage 3rd party code). By the end of this exercise, we'll get a very good understanding about the foundation on which we write our code and perhaps we can also identify certain areas where we can swap in 3rd party code.

Component Analysis
This is the portion where most of the time is being spent. Here we dissect the whole system into components. It starts off by the architect highlighting a list of major components of current system. For each component, we look at
  • The responsibility of the component
  • The persistent data owned by the component and the life cycle of maintaining this data
  • The interface of the component
  • The thread model executing the logic of the component (ie: Caller thread vs listening thread vs a new spawn thread) as well as any concurrent access implications
  • What are the potential bottleneck of this component and how we can remove the bottleneck when it occurs.
  • How does the component scale up along growth of different dimensions (e.g. more users, more data, more traffic rate ... etc) ?
  • What is the impact if this component crashes ? How does the recovery happen ?
It is important to realize whether the component communicates across VM boundaries. If so,
  • What is the authentication and authorization mechanism ?
  • What is the message format being communicated ?
  • Is the data transfer in clear text or encrypted ? And where is the secret key being stored ?
  • What is the handshaking sequence (protocol) ?
  • Is the component stateful or stateless ?
Since we already dive deep into the architecture, I usually take a further step to drill into the code by asking the developers to walk me through the code of some key components. This usually will give me a good sense about the code quality. Whether the code is easy to read, whether the logic is easy to follow, whether the method is too many lines of code, whether there is duplicated logic scattering around ... etc, and more important, whether there are sufficient unit tests around.

Maintainability Analysis
This focus in the ongoing maintenance of the architecture, things like ...
  • Is the system sufficiently instrumented for monitoring purpose ?
  • When the problem happens, is there enough trace around to quickly identify what went wrong ?
  • Can the system continue working (with some tolerable degradation) when some components fail ?
Extensibility Analysis
Understand the parameters that affects the behavior of the system. When different scenarios of changes happens, how much code need to be change to accommodate that ? Or can the system still serve by just changing the configurable parameters ? For example, does the system hard code business rules or using some kind of rule engine ? Does the system hard code the business flow or using some kind of workflow system ? What if the system need to serve a different UI device (like mobile devices) ?

Output

The output of an architecture review process is typically
  • A set of documents/diagrams of the key abstractions that gives a better view of how the overall system. This documents should help a new comer to get up to speed quicker as well as communicate the architecture to a broader audiences.
  • A set of recommended action plans on what can be done to improve the current architecture.

Data Encryption Decryption using AES Algorithm, Key and Salt with Java Cryptography Extension

In this tutorial we will implement a full data encryption decryption cycle with Java (only data, not file encryption); encrypt some data using a secret key, salt and iterations and decrypt using the same parameters. We are using the Java Cryptography Extension (JCE) for data encryption/decryption operations. This extension is available in Java 1.4.2 and above; you will have to manually download it for older versions (here). Java supports a number of of encryption algorithms, however we will demonstrate only AES algorithm (the Advanced Encryption Standard) usage.

Why should we encrypt data?

Encryption and Decryption are highly important security steps; now there are file and data encryption software as well. Data encryption is the mechanism of converting a message (plain text) into some other text (called ciphertext) so that readers can not understand the original message; however some authorized party can understand that ciphertext using the method called Decryption which converts the ciphertext into original message. As many software applications store sensitive personal data in databases, encryption has become a must. Ideally nobody (including the software developers) should not be able to view these user specific real data.

Simple Data Encryption/Decryption Example with AES

For encryption we must use a secret key along with an algorithm. In the following example we use an algorithm called AES 128 and the bytes of the word "ThisIsASecretKey" as the secret key (the best secret key we found in this world). AES algorithm can use a key of 128 bits (16 bytes * 8); so we selected that key.

package org.kamal.crypto;

import java.security.*;
import java.security.spec.InvalidKeySpecException;
import javax.crypto.Cipher;
import javax.crypto.spec.SecretKeySpec;
import sun.misc.*;

public class SimpleProtector {

private static final String ALGORITHM = "AES";
private static final byte[] keyValue =
new byte[] { 'T', 'h', 'i', 's', 'I', 's', 'A', 'S', 'e', 'c', 'r', 'e', 't', 'K', 'e', 'y' };

public static String encrypt(String valueToEnc) throws Exception {
Key key = generateKey();
Cipher c = Cipher.getInstance(ALGORITHM);
c.init(Cipher.ENCRYPT_MODE, key);
byte[] encValue = c.doFinal(valueToEnc.getBytes());
String encryptedValue = new BASE64Encoder().encode(encValue);
return encryptedValue;
}

public static String decrypt(String encryptedValue) throws Exception {
Key key = generateKey();
Cipher c = Cipher.getInstance(ALGORITHM);
c.init(Cipher.DECRYPT_MODE, key);
byte[] decordedValue = new BASE64Decoder().decodeBuffer(encryptedValue);
byte[] decValue = c.doFinal(decordedValue);
String decryptedValue = new String(decValue);
return decryptedValue;
}

private static Key generateKey() throws Exception {
Key key = new SecretKeySpec(keyValue, ALGORITHM);
// SecretKeyFactory keyFactory = SecretKeyFactory.getInstance(ALGORITHM);
// key = keyFactory.generateSecret(new DESKeySpec(keyValue));
return key;
}
}

We use "generateKey()" method to generate a secret key for AES algorithm with a given key. You can change the used algorithm by changing this Key generation; the commented out code shows the use of DES algorithm (Data Encryption Standard). Following is a simple class to test the above implementation.

package org.kamal.crypto;

public class TestSimpleProtector {

public static void main(String[] args) throws Exception {

String password = "mypassword";
String passwordEnc = SimpleProtector.encrypt(password);
String passwordDec = SimpleProtector.decrypt(passwordEnc);

System.out.println("Plain Text : " + password);
System.out.println("Encrypted : " + passwordEnc);
System.out.println("Decrypted : " + passwordDec);
}
}

Following is the output we got from above test; so you clearly see that the original text is reproduced after the decryption operation.

Plain Text : mypassword
Encrypted : sBhCap4urE50a/dGuhNgrw==
Decrypted : mypassword

The encrypted value is not simply related to the plain text, so it provides some security to the password.

The Risk with Simple Protector

When you use the above mentioned method to encrypt all passwords in your database, it makes an attacker's task easier. Since the same key is used, if an attacker find a way to get the plain text, the he can use that same method to get all the other plain-texts in the database in minutes.

Use Salt and iterations to improve

To make the attackers job harder we can use two methods; called adding Salt and using Iterations.
A Salt is another plain-text appended to the given plain text, before generating the encrypted value. We use a unique Salt per each plain-text so that the value used to encrypt is getting much stronger making it harder for an attacker to guess with a brute force or dictionary attack. Following class shows the improved version of SimpleProtector class.

package org.kamal.crypto;

import java.security.*;
import java.security.spec.InvalidKeySpecException;
import javax.crypto.Cipher;
import javax.crypto.spec.SecretKeySpec;
import sun.misc.*;

public class Protector {

private static final String ALGORITHM = "AES";
private static final int ITERATIONS = 2;
private static final byte[] keyValue =
new byte[] { 'T', 'h', 'i', 's', 'I', 's', 'A', 'S', 'e', 'c', 'r', 'e', 't', 'K', 'e', 'y'};

public static String encrypt(String value, String salt) throws Exception {
Key key = generateKey();
Cipher c = Cipher.getInstance(ALGORITHM);
c.init(Cipher.ENCRYPT_MODE, key);

String valueToEnc = null;
String eValue = value;
for (int i = 0; i < ITERATIONS; i++) {
valueToEnc = salt + eValue;
byte[] encValue = c.doFinal(valueToEnc.getBytes());
eValue = new BASE64Encoder().encode(encValue);
}
return eValue;
}

public static String decrypt(String value, String salt) throws Exception {
Key key = generateKey();
Cipher c = Cipher.getInstance(ALGORITHM);
c.init(Cipher.DECRYPT_MODE, key);

String dValue = null;
String valueToDecrypt = value;
for (int i = 0; i < ITERATIONS; i++) {
byte[] decordedValue = new BASE64Decoder().decodeBuffer(valueToDecrypt);
byte[] decValue = c.doFinal(decordedValue);
dValue = new String(decValue).substring(salt.length());
valueToDecrypt = dValue;
}
return dValue;
}

private static Key generateKey() throws Exception {
Key key = new SecretKeySpec(keyValue, ALGORITHM);
// SecretKeyFactory keyFactory = SecretKeyFactory.getInstance(ALGORITHM);
// key = keyFactory.generateSecret(new DESKeySpec(keyValue));
return key;
}
}

You need to store the clear Salt with the stored encrypted value since the salt is required at the decrypt operation. We just used a sentence as the 'salt', but you should use a random letters/numbers for that to avoid any dictionary attacks.

package org.kamal.crypto;

public class TestProtector {

public static void main(String[] args) throws Exception {
String password = "mypassword";
String salt = "this is a simple clear salt";
String passwordEnc = Protector.encrypt(password, salt);
String passwordDec = Protector.decrypt(passwordEnc, salt);

System.out.println("Salt Text : " + salt);
System.out.println("Plain Text : " + password);
System.out.println("Encrypted : " + passwordEnc);
System.out.println("Decrypted : " + passwordDec);
}
}

The output from the above class looks as follows.

Salt Text : this is a simple clear salt
Plain Text : mypassword
Encrypted : GX8jEr04o6PC+IE+f6DXq+zuRiZ2nsSL+UYxYRh9vK28xt/GrLqnbrP3hkXSW5S3jNgzXkrLSmXm
fwRw+eBMWQJO+8tlSEE9D2Y9qEzowX18s4W81U/D9JL3JX7uJtwp
Decrypted : mypassword

When using Salt, you can store each (clear text) Salt in the database per each encrypted text. It would be better if we pass the secrete key while initializing or instantiating the (Protector) encryption class rather than hard coding it in the class itself (for simplicity we used a hard coded key for this tutorial).

Using Preprocessor Statements #ifdef, #elif, #else and #endif

Sometimes it becomes necessary to write two (or more) different peices of behaviour in a common code. Depending of its mode/place/time/version the behaviour has to be changed. It can be done by the use of simple preprocessor statements. Consider the following program:






//Program tested on Microsoft Visual Studio 2008 - Zahid Ghadialy
//Program to show use of preprocessor macros for dynamic behaviour

#include<iostream>
#include<string>

using namespace
std;

#define USE_CAR //Comment or Uncomment this if using preprocessor option

int
main()
{

int
someNum = 3;
#ifdef USE_CAR
string modeOfTransport("Use a Car");
#elif USE_BUS
string modeOfTransport("Use a Bus");
#elif USE_TRAIN
string modeOfTransport("Use a Train");
#else
string modeOfTransport("Use any mode of transport");
#endif

cout<<"Mode of Transport = "<<modeOfTransport<<" and someNum = "<<someNum<<endl;

return
0;
}



The output is as follows:

In the program above, the statement in the start #define USE_CAR says that the code with USE_CAR should be used during compile time. Instead if USE_BUS is defined, that particular peice of code will be used. If nothing is defined then the behaviour with else will be used.

There is another way to use these preprocessor directives. It is by defining in the properties below. If defined this way then the statement should be omitted from the main program.


The output is as follows in this case:

For more information see this.

Check out this stream