Deleting of Vectors content to avoid memory leaks

Its interesting how deleting of vectors can sometimes be a problem as people tend to make simple mistakes that can cause crash and waste useful hours to debug.

The following code shows my approach to deleting of vectors. I am sure there are better approaches. In case you know one please share.




//Program tested on Microsoft Visual Studio 2008 - Zahid Ghadialy
//Program to demonstrate a simple way to delete vector and memory
//associated with it.

#define _CRTDBG_MAP_ALLOC
#include <stdlib.h>
#include<iostream>
#include <crtdbg.h>
#include<vector>

#ifdef _DEBUG
#define DEBUG_NEW new(_NORMAL_BLOCK, __FILE__, __LINE__)
#define new DEBUG_NEW
#endif

using namespace
std;

int
main()
{

_CrtSetDbgFlag( _CRTDBG_ALLOC_MEM_DF _CRTDBG_LEAK_CHECK_DF );
vector<int *> someVector;
for
(int i = 0; i < 5; i++)
{

int
* x = new int(i * 15);
someVector.push_back(x);
}


//now we need to delete the elements of the vector else it will generate memory leaks
//This is one simple and safe approach
while (!someVector.empty())
{

vector<int *>::iterator it = someVector.begin();
if
(*it) //Additional safety in case a NULL pointer was stored
delete (*it); //because *it = *int - Only the contents are deleted
someVector.erase(it);
}


return
0;
}





You can check my old post on how to see if memory leaks are there. This program shows another approach but the results are the same

No output needed.

You can read more about this approach of detecting memory leaks on the MSDN site here.

Extending enums

There might come a scenario where you want to include an enum into a another one where the first one is a subset.

#include
struct enum_a {
     enum constant {
         a, b, c, d,
         end_of_enum_a = d
     };
};
struct enum_b : public enum_a {
     enum constant {
         e = end_of_enum_a + 1,
         f, g, h
     };
};
int main() {
     std::printf("enum_a: a(%d), b(%d), c(%d), d(%d)\n",
         enum_b::a, enum_b::b, enum_b::c, enum_b::d);
     std::printf("enum_b: e(%d), f(%d), g(%d), h(%d)\n",
         enum_b::e, enum_b::f, enum_b::g, enum_b::h);
     return 0;
}

Inheritance to the rescue. :-)

Case-Insensitive String comparison

It is often a problem that we ask someone to input some string for comparison but they often put different case and the comparison fails. For example if we are expecting a string, say, 'true'. The user may input 'True' or 'TRUE'. A simple comparison will fail in this case.

The following is a simple program to convert the input string to lower case. What this would allow is to do a case-insensitive search.



//Program tested on Microsoft Visual Studio 2008 - Zahid Ghadialy
#include<iostream>
#include<string>

using namespace
std;

//First approach for converting into lower case
string toLowerString(string& input)
{

int
length = input.length();
string output;
for
(int i = 0; i < length; i++)
{

output += tolower(input[i]);
}

return
output;
}


//Second approach for converting into lower case
string toLowerString(string& input, bool approach2) //Overloaded method
{
int
length = input.length();
string output;
for
(int i = 0; i < length; i++)
{

if
(input[i] >= 0x41 && input[i] <= 0x5A) //Ascii for A-Z
output += (input[i] + 0x20); //ascii for a-z
else
output += input[i];
}

return
output;
}


int
main()
{

string s1="true";
string s2("TRUE");
string s3;
s3 = "True";

//Check if s1 = s2
if(s1 == s2)
cout<<"s1 == s2"<<endl;
else

cout<<"s1 != s2"<<endl;

//Check if s1 = s2 with first approach
if(s1 == toLowerString(s2))
cout<<"s1 == Approach1::toLowerString(s2)"<<endl;
else

cout<<"s1 != Approach1::toLowerString(s2)"<<endl;

//Check if s1 = s2 with second approach
if(s1 == toLowerString(s2,true))
cout<<"s1 == Approach2::toLowerString(s2)"<<endl;
else

cout<<"s1 != Approach2::toLowerString(s2)"<<endl;


//Check if s1 = s3 with second approach
if(s1 == s3)
cout<<"s1 == s3"<<endl;
else

cout<<"s1 != s3"<<endl;

if
(s1 == toLowerString(s3,true))
cout<<"s1 == Approach2::toLowerString(s3)"<<endl;
else

cout<<"s1 != Approach2::toLowerString(s3)"<<endl;

//Check if s1 = s3 with second approach
if(s1 == toLowerString(s3))
cout<<"s1 == Approach1::toLowerString(s3)"<<endl;
else

cout<<"s1 != Approach1::toLowerString(s3)"<<endl;

return
0;
}





The output is as follows:


I am aure much better approaches are possible. If you have a better example please share.

Example of Stringstreams

Here is an example of stringstream that I picked up from here and modified it slightly.




//Program tested on Microsoft Visual Studio 2008 - Zahid Ghadialy
//Simple example of C++ stringstreams

#include<iostream>
#include<sstream>
#include<string>

using namespace
std;

int
main()
{

string input;

cout<< "Please enter some numbers speretaed by space and press enter once done: ";
getline(cin, input);

cout<< "input = " << input << endl;

stringstream memString(input, ios_base::in); //stringstream( string s, openmode mode )

cout<<"memString = " << memString << endl;
cout<<"memString (contents) = " << memString.str() << endl;

double
summ = 0;
double
temp;
while
(memString >> temp)
{

summ+= temp;
}


cout<<"summ = " << summ << endl;

return
0;
}








The output is as follows:


See Also:
C++ Reference on stringstreams
Using Stringstreams in C++
String Stream example
String Streams Constructors

Maximum and minimum limit for numeric data type


Maximum and minimum value of numeric data type

In c++, the numeric data types are short, int, long, float, double and long double. Every data type has certain limits. There is minimum and maximum value that it can hold. If a variable of a given data type is assigned a value that is greater than the maximum limit or smaller than the minimum limit, it results into wrong output. This mistake does not produce compiler error or runtime error. Hence we should be careful while choosing data type for a variable. The below is a c++ program which is used to determine the maximum and minimum values that data type can hold.

A c++ program to determine the maximum and minimum values of numeric data types.


#include <iostream>
#include <limits>

using std::cout;
using std::endl;
using std::numeric_limits;

int main ()
{
    cout << "The values for data type short ranges from: "
            << numeric_limits<short>::min () << " to "
            << numeric_limits<short>::max () << endl;

    cout << "The values for the data type int ranges from: "
            << numeric_limits<int>::min () << " to "
            << numeric_limits<int>::max () << endl;

    cout << "The values for the data type long ranges from: "
            << numeric_limits<long>::min () << " to "
            << numeric_limits<long>::max () << endl;

    cout << "The values for the data type float ranges from: "
            << numeric_limits<float>::min () << " to "
            << numeric_limits<float>::max () << endl;

    cout << "The values for the data type double ranges from: "
            << numeric_limits<double>::min () << " to "
            << numeric_limits<double>::max () << endl;

    cout << "The values for the data type long double ranges from: "
            << numeric_limits<long double>::min () << " to "
            << numeric_limits<long double>::max () << endl;

    return 0;
}

Using the sizeof keyword in c++

Determine the size of data type using sizeof keyword

When we compile the c++ program, the compiler translates that program into language of that machine in which you compile. For example, the size of the data type may vary on 16-bit and on 32-bit processor. In this case you might have to modify your source code to run your program properly. This is not the case in java. As java does not translates the program to machine language when it is compiled, rather it translates the program to bytecode. The java interpreter i.e. the java virtual machine (JVM) or the java run-time environment interprets the bytecode. Hence no modification is required. However in c++ we can find the amount memory allocated by each data type using the sizeof keyword.

The below is the program by which you can know how much memory, a particular data type will require on the machine in which it is executed. For example with windows xp operating system which runs on pentium 4 processor the size of int and long is same, it is 4 bytes.

A C++ Program example to find the size of the data type using sizeof keyword..


#include <iostream>
#include <iomanip>

using namespace std;

int main ()
{

    cout << "The size of bool is" << setw (9) << ": "
            << sizeof (bool) << " byte" << endl;

    cout << "The size of char is" << setw (9) << ": "
            << sizeof (char) << " byte" << endl;

    cout << "The size of short is" << setw (8) << ": "
            << sizeof (short) << " byte" << endl;

    cout << "The size of int is" << setw (10) << ": "
            << sizeof (int) << " byte" << endl;

    cout << "The size of long is" << setw (9) << ": "
            << sizeof (long) << " byte" << endl;

    cout << "The size of float is" << setw (8) << ": "
            << sizeof (float) << " byte" << endl;

    cout << "The size of double is" << setw (7) << ": "
            << sizeof (double) << " byte" << endl;

    cout << "The size of long double is" << setw (2) << ": "
            << sizeof (long double) << " byte" << endl;

    return 0;
}

Fixed form, Scientific form and Precision setting for floating point numbers

Sometimes we want to output the floating point numbers in fixed or scientific form and sometimes with fixed precision so as to align all the output. The following program is an example to demonstrate that:



//Program tested on Microsoft Visual Studio 2008 - Zahid Ghadialy
//This example shows how to output floating point numbersin fixed and
#include<iostream>

using namespace
std;

int
main()
{

float
number = 123.456;

cout << "number in fixed form = " << number << endl; //default is fixed
cout << "number in scientific form = " << scientific << number << endl;
cout.precision(2);
cout << "number in fixed form with precision 2 = " << fixed << number << endl; //here the format was scientific
cout.precision(3);
cout << "number in fixed form with precision 3 = " << number << endl;
cout.precision(4);
cout << "number in fixed form with precision 4 = " << number << endl;
cout.precision(5);
cout << "number in fixed form with precision 5 = " << number << endl;

return
0;
}






The output is as follows:

Printing out leading 0's (or any other charachters)

This is a simple example to print out leading zeroes or any charachters when printing out numbers. The main manipulator function needed is the set field width that will set the width of the field.



//Program tested on Microsoft Visual Studio 2008 - Zahid Ghadialy
//This example shows how to print leading 0's in the output

#include<iostream>
#include<iomanip>

using namespace
std;

int
main()
{

int
number = 123;

cout<< "number = "<< number << endl;

cout<< "number = "<< setw(2) << setfill('0') << number << endl;
cout<< "number = "<< setw(3) << setfill('0') << number << endl;
cout<< "number = "<< setw(4) << setfill('0') << number << endl;
cout<< "number = "<< setw(5) << setfill('0') << number << endl;
cout<< "number = "<< setw(10) << setfill('0') << number << endl;

cout<< endl;
cout<< "number = "<< setw(5) << setfill('x') << number << endl;

cout<< endl;
number = 0;
cout<< "number = "<< setw(5) << setfill('x') << number << endl;

return
0;
}







The output is as follows:

Two approaches on Multi-tenancy in Cloud

Continue on my previous blog on how multi-tenancy related to cloud computing

My thoughts has changed that now I think both the Amazon approach (Hypervisor isolation) and Salesforce approach (DB isolation) are both valid but attract a different set of audiences.

First of all, increase efficiency through sharing is a fundamental value proposition underlying all cloud computing initiatives, there is no debate that ...
  • We should "share resource" to increase utilization and hence improve efficiency
  • We should accommodate highly dynamic growth and shrink requirement rapidly and smoothly
  • We should "isolate" the tenant so there is no leakage on sensitive information
But at which layer should be facilitate that ? Hypervisor level or DB level.

Hypervisor level Isolation

Hypervisor is a very low-level layer of software that maps the physical machine to a virtualized machine on which a regular OS runs on. When the regular OS issue system calls to the VM, it is intercepted by the Hypervisor which maps to the underlying hardware. The hypervisor also provide some traditional OS functions such as process scheduling to determine which VM to run. Hypervisor can be considered to be a very lean OS that sits very close to the bare hardware.

Depends on the specific implementation, Hypervisor introduces an extra layer of indirection and hence incur a certain % of overhead. If we need a VM with capacity less than a physical machine, Hypervisor allows us to partition the hardware into finer granularity and hence improve the efficiency by having more tenants running on the same physical machine. For light-usage tenant, such increment in efficiency should offset the lost from the overhead.

Since Hypervisor focus on low-level system level primitives, it provides the cleanest separation and hence lessen security concerns. On the other hand, by intercepting at the lowest layer, Hypervisor retain the familar machine model that existing system/network admin are familiar with. Since Application is now completely agnostic to the presence of Hyervisor, this minimize the change required to move existing apps into the cloud and makes cloud adoption easier.

Of course, the downside is that virtualization introduce a certain % of overhead. And the tenant still need to pay for the smallest VM even none of its user is using it.

DB level Isolation

Here is another school of thought, if tenants are running the same kind of application, the only difference is the data each tenant store. Why can't we just introduce an extra attribute "tenantId" in every table and then append a "where tenantId = $thisTenantId" in every query ? In other words, add some hidden column and modify each submitted query.

In additional, the cloud provider usually need to re-architect the underlying data layer and move to a distributed and partitioned DB. Some of the more sophisticate providers also need to invest in developing intelligent data placement algorithm based on workload patterns.

In this approach, the degree of isolating is as good as the rewritten query. In my opinion, this doesn't seem to be hard, although it is less proven than the Hypervisor approach.

The advantage of DB level isolation is there is no VM overhead and there is no minimum charge to the tenant.

However, we should compare these 2 approach not just from a resource utilization / efficiency perspective, but also other perspectives as well, such as ...

Freedom of choice on technology stack

Hypervisor isolation gives it tenant maximum freedom of the underlying technology stack. Each tenant can choose the stack that fits best to its application's need and inhouse IT skills. The tenant can also free to move to latest technologies as they evolve.

This freedom of choice comes with a cost though. The tenant need to hire system administrators to configure and maintain the technology stack.

In a DB level isolation, the tenants are live within a set of predefined data schemas and application flows. So their degree of freedom is limited to whatever the set of parameters that the cloud provider exposes. Also the tenants' applications are "lock-in" to the cloud provider's framework, and a tight coupling and dependency is created between the tenant and the cloud provider.

Of course, the advantage is that there is no administration needed in the technology stack.

Reuse of Domain Specific Logic

Since it focus in the lowest layer of resource sharing, Hypervisor isolation provides no reuse at the app logic level. Tenants need to build their own technology stack from the ground up and write their own application logic.

In the DB isolation approach, the cloud provider pre-defines a set of templates in DB schemas and Application flow logic based on their domain expertise (it is important that the cloud provider must be the recognized expert in that field). The tenant can leverage the cloud provider's domain expertise and focus in purely business operation.

Conclusion

I think each approach will attract a very different (and clearly disjoint) set of audiences.

Notice that DB-level isolation commoditize everything and make it very hard to create product feature differentiations. If I am a technology startup company trying to develop a killer product, then my core value is my domain expertise. In this case, I won't go with the DB-level isolation which impose too much constraints on me to distinguish my product from "anyone else". Hypervisor level isolation much better because I can outsource the infrastructure layer and focus in my core value.

On the other hand, if I am operating a business but not building a product, then I would like to outsource all supporting functions including my applications as well. In this case, I would pick the best app framework provided by the market leader and follow their best practices (also very willing to live by their constraints), the DB level isolation is more compelling in this case.

Search Engine Basics

Receive the question of "how search works ?" couple times recently so try to document the whole process. This is intended to highlight the key concepts but not specific implementation details, which will be much more complicated and sophisticated than this one.

A very basic search engine includes a number of processing phases.
  • Crawling: to discover the web pages on the internet
  • Indexing: to build an index to facilitate query processing
  • Query Procesisng: Extract the most relevant page based on user's query terms
  • Ranking: Order the result based on relevancy


Notice that each element in the above diagram reflects a logical function unit but not its physical boundary. For example, the processing unit in each orange box is in fact executed across many machines in parallel. Similarly, each of the data store element is spread physically across many machines based on the key partitioning.


Vector Space Model

Here we use the "Vector Space Model" where each document is modeled as a multi-dimensional vector (each word represents a dimension). If we put all documents together, we form a matrix where the rows are documents and columns are words, and each cell contains the TF/IDF value of the word within the document.


To determine the similarity between 2 documents, we can apply the dot product between 2 documents and the result will represents the degree of similarity.


Crawler

Crawler's job is to collect web pages on the internet, it is typically done by a farm of crawlers, who do the following

Start from a set of seed URLs, repeat following ...
  1. Pick the URL that has the highest traversal priority.
  2. Download the page content from the URLs to the content repository (which can be a distributed file system, or DHT), as well as update the entry in the doc index
  3. Discover new URL links from the download pages. Add the link relationship into the link index and add these links to the traversal candidates
  4. Prioritize the traversal candidates
The content repository can be any distributed file system, here lets say it is a DHT.

There are a number of considerations.
  • How to make sure different Crawlers are working on different set of contents (rather than crawling the same page twice) ? When the crawler detects overlapping is happening (url is already exist in the page repository with pretty recent time), the crawler will skip the processing on this URL and pick up the next best URL to crawl.
  • How does the crawler determines which is the next candidate to crawl ? We can use a heuristic algorithm based on some utility function (e.g. we can pick the URL candidate which has the highest page rank score)
  • How frequent do we re-crawl ? We can track the rate of changes of the page to determine the frequency of crawling.

Indexer


The Indexer's job is to build the inverted index for the query processor to serve the online search requests.

First the indexer will build the "forward index"
  1. The indexer will parse the documents from the content repository into a token stream.
  2. Build up a "hit list" which describe each occurrence of the token within the document (e.g. position in the doc, font size, is it a title, archor text ... etc).
  3. Apply various "filters" to the token stream (like stop word filters to remove words like "a", "the", or a stemming filter to normalize words "happy", "happily", "happier" into "happy")
  4. Compute the term frequency within the document.
From the forward index, the indexer will proceed to build a reverse index (typically through a Map/Reduce mechanism). The result will be keyed by word and stored in a DHT.


Ranker


Ranker's job is to compute the rank of a document, based on how many in-links pointing to the document as well as the rank of the referrers (hence a recursive definition). Two popular ranking algorithms including the "Page Rank" and "HITs".
  • Page Rank Algorithm
Page rank is a global rank mechanism. It is precomputed upfront and is independent of the query

  • HITS Algorithm
In HITS, every page is playing a dual role: "hub" role and "authority" role. It has two corresponding ranks on these two roles. Hub rank measures the quality of the outlinks. A good hub is one that points to many good authorities. Authority ranks measures the quality of my content. A good authority is one that has many good hubs pointing to.

Notice that HITS doesn't pre-compute the hub and authority score. Instead it invoke a regular search engine (which only do TF/IDF matches but not ranking) to get a set of initial results (typically with a predefined fix size) and then expand this result set by tracing the outlinks into the expand result set. It also incorporate a fix size of inlinks (by sampling the inlinks into the initial result set) into the expanded result set. After this expansion, it runs an iterative algorithm to compute the authority ranks and hub ranks. And use the combination of these 2 ranks to calculate the ultimate rank of each page, usually pages with high hub rank will weight more than high authority rank.

Notice that the HITS algorithm is perform at query time and not pre-computed upfront. The advantage of HITS is that it is sensitive to the query (as compare to PageRank which is not). The disadvantage is that it perform ranking per query and hence expensive.


Query Processor

When user input a search query (containing multiple words), the query will be treated as a "query document". Relevancy is computed and combined with the rank of the document and return an ordered list of result.

There are many ways to compute the relevancy. We can consider only the documents that contains all the terms specified in the query. In this model, we search for each term (with the query) a list of document id and then do an intersection with them. If we order the document list by the document id, the intersection can be computed pretty efficiently.

Alternatively, we can return the union (instead of intersection) of all document and order them by a combination of the page rank TF/IDF score. Document that have more terms intersecting with the query will have a higher TF/IDF score.

In some cases, an automatic query result feedback loop can be used to improve the relevancy.
  1. In first round, the search engine will perform a search (as described above) based on user query
  2. Construct a second round query by expanding the original query with additional terms found in the return documents which has high rank in the first round result
  3. Perform a second round of query and return the result.

Outstanding Issues


Fighting the spammer is a continuous battle in search engine. Because of the financial value of being shown up in the first page of search result. Many spammers try to manipulate their page. Earlier attempt is to modify a page to repeat the terms many many times (trying to increase the TF/IDF score). The evolution of Page rank has mitigate this to some degree because page rank in based on "out-of-page" information that the site owner is much harder to manipulate.

But people use Link-farms to game the page rank algorithms. The ideas is to trade links between different domains. There is active research in this area about how to catch these patterns and discount their ranks

Check out this stream