Inline specializations and multiple inclusions

Recently, I came across a problem being faced by an individual on the forums. He had the code like this:

[CODE]
//file: header.h
#ifndef HEADER_H
#define HEADER_H

#include <iostream>

template<class T>
void foo(T val){
    std::cout << "foo<T>("<<val<<")\n";
}
template<>
inline void foo(double val){
    std::cout << "foo<double>(" << val <<")\n";
}

#endif

//file: main.cpp
#include "header.h"

int main(){
    double x=4;
    foo(x);
    return 0;
}

//file: other.cpp
#include "header.h"
void somecode(){}

Just to summarize the above code, there is a header file (header.h) that contains the template functions and its specialization on type 'double' for template type parameter declared inline. And that header is being included into 2 implementation files (main.cpp & other.cpp)

This code compiled fine for him but as soon as he removed the inline keyword, it started to given him linker error for multiple definitions of the foo<double> function. We might start wondering what is it between templates and inline that might be causing this? Something and nothing. What happens is explained below:

The linker error seen is not specific to templates or specializations. One will face the same problem if they turned foo into a non-template function. C++ functions have external linkage hence they must be implemented in just one .cpp file (implementation file). They can, however, be declared multiple times. When we provide the implementation in the header file instead of an implementation file and that header is being included multiple implementation files (main.cpp and other.cpp), we have multiple definitions and hence, rightly the linkage error. Putting the implementation into the header file was what we were forced to do when using a template function because we need to put the implementation in the header file as well as the declaration since 'export' doesn't work except for just with the Comeau compiler. But specializations are a different case than regular templates. We don't face such issues with regular templates (not their specializations) because the instantiation rules guarantees that there is just one copy of the function generated as and when needed. While in case of specializations, we end up with multiple definitions because of them being included into multiple implementation files.

Now, with inline functions, we provide the implementation in the header file so as to be visible to the compilation point where they are used. They become inline in the final executable or not is irrelevant. You just follow the rule to define the inline functions. There will be only one copy of them.

Coming to the solutions to the above linker error, below are the two simplest ways:

1. Make the specialization as inline (as in original code, in case, someone facing this issue didn't have it in the first place)

2. Declare it not as a specialization but just as an overloaded foo function and declare it in the header (header.h) and provide implementation in an implementation file (for that matter any one implementation main.cpp/other.cpp or a .cpp of its own for header.h templates' specializations just making sure we don't do it twice).

References:

1. Codeguru thread where I came across this - "inline" and linking errors
2. C++ FAQ Lite on Inline Functions

Solving TF-IDF using Map-Reduce

TF-IDF (Term Frequency, Inverse Document Frequency) is a basic technique to compute the relevancy of a document with respect to a particular term.

"Term" is a generalized element contains within a document. A "term" is a generalized idea of what a document contains. (e.g. a term can be a word, a phrase, or a concept).

Intuitively, the relevancy of a document to a term can be calculated from the percentage of that term shows up in the document (ie: the count of the term in that document divide by the total number of terms in it). We called this the "term frequency"

On the other hand, if this is a very common term which appears in many other documents, then its relevancy should be reduced. (ie: the count of documents having this term divided by total number of documents). We called this the "document frequency"

The overall relevancy of a document with respect to a term can be computed using both the term frequency and document frequency.

relevancy = term frequency * log (1 / document frequency)

This is called tf-idf. A "document" can be considered as a multi-dimensional vector where each dimension represents a term with the tf-idf as its value.

Compute TF-IDF using Map/Reduce

To extract the terms from a document, the following process is common
  • Extract words by tokenize the input streams
  • Make the words case-insensitive (e.g. transform to all lower case)
  • Apply n-gram to extract phrases (e.g. statistically frequent n-grams is likely a phrase)
  • Filter out stop words
  • Stemming (e.g. transform cat, cats, kittens to cat)
To keep the term simple, each word itself is a term in our example below.

We use multiple rounds of Map/Reduce to gradually compute …
  1. the word count of per word/doc combination
  2. the total number of words per doc
  3. the total number of docs per word. And finally compute the TF-IDF



Implementation in Apache PIG

There are many ways to implement the Map/Reduce paradigm above. Apache Hadoop is a pretty popular approach using Java or other programming language (ie: Hadoop Streaming).

Apache PIG is another approach based on a higher level language with parallel processing construct built in. Here is the 3 rounds of map/reduce logic implemented in PIG Script
REGISTER rickyudf.jar

/* Build up the input data stream */
A1 = LOAD 'dirdir/data.txt' AS (words:chararray);
DocWordStream1 =
FOREACH A1 GENERATE
'data.txt' AS docId,
FLATTEN(TOKENIZE(words)) AS word;

A2 = LOAD 'dirdir/data2.txt' AS (words:chararray);
DocWordStream2 =
FOREACH A2 GENERATE
'data2.txt' AS docId,
FLATTEN(TOKENIZE(words)) AS word;

A3 = LOAD 'dirdir/data3.txt' AS (words:chararray);
DocWordStream3 =
FOREACH A3 GENERATE
'data3.txt' AS docId,
FLATTEN(TOKENIZE(words)) AS word;

InStream = UNION DocWordStream1,
DocWordStream2,
DocWordStream3;

/* Round 1: word count per word/doc combination */
B = GROUP InStream BY (word, docId);
Round1 = FOREACH B GENERATE
group AS wordDoc,
COUNT(InStream) AS wordCount;

/* Round 2: total word count per doc */
C = GROUP Round1 BY wordDoc.docId;
WW = GROUP C ALL;
C2 = FOREACH WW GENERATE
FLATTEN(C),
COUNT(C) AS totalDocs;
Round2 = FOREACH C2 GENERATE
FLATTEN(Round1),
SUM(Round1.wordCount) AS wordCountPerDoc,
totalDocs;

/* Round 3: Compute the total doc count per word */
D = GROUP Round2 BY wordDoc.word;
D2 = FOREACH D GENERATE
FLATTEN(Round2),
COUNT(Round2) AS docCountPerWord;
Round3 = FOREACH D2 GENERATE
$0.word AS word,
$0.docId AS docId,
com.ricky.TFIDF(wordCount,
wordCountPerDoc,
totalDocs,
docCountPerWord) AS tfidf;

/* Order the output by relevancy */
ORDERRound3 = ORDER Round3 BY word ASC,
tfidf DESC;
DUMP ORDERRound3;



Here is the corresponding User Defined Function in Java (contained in rickyudf.jar)
package com.ricky;

import java.io.IOException;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class TFIDF extends EvalFunc<Double> {

@Override
public Double exec(Tuple input) throws IOException {
// TODO Auto-generated method stub
long wordCount = (Long) input.get(0);
long wordCountPerDoc = (Long) input.get(1);
long totalDocs = (Long) input.get(2);
long docCountPerWord = (Long) input.get(3);

double tf = (wordCount * 1.0) / wordCountPerDoc;
double idf = Math.log((totalDocs * 1.0) / docCountPerWord);

return tf * idf;
}
}

Kids Learning Resources

In chatting with a friend, I think it is useful to summarize a list of good resources for kids learning that I've used in the past.

Learning in this generation is very different given there are a lot of resources out on the internet. It is important to teach the kids to figure out the knowledge themselves.

General tools for knowledge acquisition

Geography and Site seeing

  • Google Map for navigating to different places
  • Using the street view feature by dragging the little person icon (at the top left corner) into the map, we can visit different countries virtually. Here is the Eiffel tower in Paris, France. Google street view is extending its coverage so expect to see more areas in the world being covered in future.
  • Google Earth for an even more sophisticated 3D map navigation. Watch their introduction video.

History and Ancient Empire


Human Body


Money Management


Develop a good economic sense and planning to use money wisely is important, especially when kids are commonly spoiled by their parents these days.

Online Games


I normally don't recommend exposing video or online games to the kids unless they are carefully chosen for educational purposes. Kids are so easily addicted to games. But if you can use their addiction appropriately, it is a good way to motivated them to learn, and it is fun too.

Here are some exceptionally good ones from Lego

Game Designing


It is also good to teach them how a game is designed because this gives them an opportunity to switch roles (from a player to a designer) and establish a good overall picture. There are some very good tools for designing games.

Kids Programming

Programming is all about planning for the steps to achieve a goal, and debugging is all about using a systematic way to find out why something doesn't work out as planned. Even you are not planning your kids to be a programmer or a computer professional, teach them how to programming can develop a very good planning and analytical skills, which is very useful in their daily life.
  • The famous LOGO Green turtle is a good start at age 6 - 7
  • Lego Mindstorm is a very good one because kids are exposed not just to programming skills but other engineering skills (such as mechanics) as well. It costs some money but I think it is well worth.
  • A great link for robot fans.
  • At around age 10, I found a professional programming language can be taught. I have chosen Ruby given its expressiveness and simplicity.
  • One of my colleagues has suggested BlueJ, basically Java, but I personally haven't tried that yet (given I am a Ruby fan).

Materials at all subjects and all levels

Design Pattern for Eventual Consistency

"Eventual Consistency" and "BASE" are the modern architectural approaches to achieve highly scalable web applications. Some of my previous colleagues, Randy Shoup and Dan Pritchett has already written up some very good articles about this.

I recently have some good conversation from the cloud computing Google groups on the "eventual consistency" model. There is a general design pattern that I want to capture here.


The definition of data integrity

"Data integrity" is defined by the business people. It needs to hold at specific stages of the business operation cycle. Here we use a book seller example to illustrate. The data integrity is defined as ...

Sufficient books must be available in stock before shipment is made to fulfil an order.

The application designer, who design the application to support the business, may transform the "data integrity" as ...

Total_books_in_stock > sum_of_all_orders

The app designer is confident that if the above data integrity is observed, then the business integrity will automatically be observed. Note here that "data integrity" may be transformed by the application designer into a more restrictive form.

Now, the designer proceeds with the application implementation ... There are 2 approaches to choose from, with different implications of scalability.


The ACID approach

One way of ensuring the above integrity is observed is to use the traditional ACID approach. Here, a counter "no_of_books_in_stock" is used to keep track of available books in the most up-to-date fashion. When every new order enters, the transaction checks the data integrity still holds.

In this approach, "no_of_books_in_stock" becomes a shared state that every concurrent transacton need to access. Therefore, every transaction take turns sequentially to lock the shared state in order to achieve serializability. A choke point is created and the system is not very scalable.

Hence the second approach ...


The BASE approach

Instead of checking against a "shared state" at the order taking time, the order taking transaction may just get a general idea of how many books is available in stock at the beginning of the day and only checks against that. Note that this is not a shared state, and hence there is no choke points. Of course, since every transaction assumes no other transaction is proceeding, there is a possibility of over-booking.

At the end of the day, there is a reconciliation process that took the sum of all orders taken in the day, and check that against the number of books in stock. Data integrity checking is done here, but in batch mode which is much more efficient. In case the data integity is not maintained, the reconciliation process fire up compensating actions such as print more books, or refund some orders, in order to reestablish the data integrity.


Generalizing the Design pattern of BASE

"Eventual consistency" is based on the notion that every action is revokable by executing a "compensating action". However, all compensating actions different costs involved, which is the sum of this individual compensation plus all compensating actions that it triggers.

In BASE, the design goal is to reduce the probability of doing high-cost revokation. Note that BASE is a probablistic model while ACID is a binary, all-or-none model.

In BASE, data is organized into two kinds of "states": "Provisional state" and "Real state".

All online transactions should only read/write the provisional state, which is a local, non-shared view among these transactions. Since there is no shared state and choke point involved, online transactions can proceed very efficiently.

Real state reflects the actual state of the business (e.g. how many books are actually in stock), which always lacks behind the provisonal state. The "real state" and "provisional state" are brought together periodically via a batch-oriented reconciliation process.

Data integrity checking is defer until the reconciliation process executes. High efficiency is achieved because of its batch nature. When the data integrity does not hold, the reconciliation process will fire up compensating actions in order to reestablish the data integrity. At the end of the reconciliation, the provisional state and real state are brought in sync with each other.

To minimize the cost of compensating actions, we need to confine the deviation between the provisional state and the real state. This may mean that we need to do the reconciliation more frequently. We should also minimize the chaining effect of compensating actions.

Easter Eggs?

Here's a link to a fun article about how Microsoft added Easter Eggs to their original BASIC implementations.


It never even occurred to me to do this.  ;-)

Check out this stream