Mcnought's Random Musing and Mutterings: February 2014

For the second time in two months I have been asked about "Big-O" functions when talking about software efficency.
You may have seen people quote a metric for the efficency of a piece of code as O(n) or O(n^2) etc. where 'n' is the amount of data you have got.
But can you recall waht it means ... have you seen it sicne your comuting course?
Well I hadn't and it is along time since then (don't ask)!

Due to experience I know that a nested set of loops is going to be O(n*m) but this stuff realy counts now that we are increasinly seeing huge data sets.

I have regularly seen processes rescently that sill take tens of minutes to run and that is despite multiple processors and fast ones.

So what is going on?

Well lets look at an over view of the basics of Big O notation (plagarised from the Perlmonks)

Common Orders of Growth
O(1) is the no-growth curve. An O(1) algorithm's performance is conceptually independent of the size of the data set on which it operates. Array element access is O(1), if you ignore implementation details like virtual memory and page faults. Ignoring the data set entirely and returning undef is also O(1), though this is rarely useful.
O(N) says that the algorithm's performance is directly proportional to the size of the data set being processed. Scanning an array or linked list takes O(N) time. Probing an array is still O(N), even if statistically you only have to scan half the array to find a value. Because computer scientists are only interested in the shape of the growth curve at this level, when you see O(2N) or O(10 + 5N), someone is blending implementation details into the conceptual ones.
Depending on the algorithm used, searching a hash is O(N) in the worst case. Insertion is also O(N) in the worst case, but considerably more efficient in the general case.
O(N+M) is just a way of saying that two data sets are involved, and that their combined size determines performance.
O(N²) says that the algorithm's performance is proportional to the square of the data set size. This happens when the algorithm processes each element of a set, and that processing requires another pass through the set. The infamous Bubble Sort is O(N²).
O(N•M) indicates that two data sets are involved, and the processing of each element of one involves processing the second set. If the two set sizes are roughly equivalent, some people get sloppy and say O(N²) instead. While technically incorrect, O(N²) still conveys useful information.
"I've got this list of regular expressions, and I need to apply all of them to this chunk of text" is potentially O(N•M), depending on the regexes.
O(N³) and beyond are what you would expect. Lots of inner loops.
O(2^N) means you have an algorithm with exponential time (or space, if someone says space) behavior. In the 2 case, time or space double for each new element in data set. There's also O(10^N), etc. In practice, you don't need to worry about scalability with exponential algorithms, since you can't scale very far unless you have a very big hardware budget.
O(log N) and O(N log N) might seem a bit scary, but they're really not. These generally mean that the algorithm deals with a data set that is iteratively partitioned, like a balanced binary tree. (Unbalanced binary trees are O(N²) to build, and O(N) to probe.) Generally, but not always, log N implies log₂N, which means, roughly, the number of times you can partition a set in half, then partition the halves, and so on, while still having non-empty sets. Think powers of 2, but worked backwards.

2¹⁰ = 1024
log₂1024 = 10
The key thing to note is that log₂N grows slowly. Doubling N has a relatively small effect. Logarithmic curves flatten out nicely. It takes O(log N) time to probe a balanced binary tree, but building the tree is more expensive. If you're going to be probing a data set a lot, it pays to take the hit on construction to get fast probe time.
Quite often, when an algorithm's growth rate is characterized by some mix of orders, the dominant order is shown, and the rest are dropped. O(N²) might really mean O(N² + N).

This is quite nice as you can see it gives you an idea of the efficency from just the SHAPE of the curve an not the absolute values.

Now those nested loops are understandable in context and can be expressed as a O(n^2) or O(n^3) curve.

But what if your program is using a set of code from the Java collections package for example?
You don't want to go digging through it to just get an idea of the efficency.
Fortunatly there is a cheet sheet that you can wander over to and pull out the efficencies of those off the shelf fuctions you are using.

See: http://bigocheatsheet.com/.

For example you have a HashSet with 'n' elements that is compared to 'm' data itesm s what is the efficency?

The worst case search for a hashset is O(n) and if you are having to search it 'm' times you will ahave an efficency of m * O(n) or ... O(m*n).

Easy!

So how do you improve your effiency?

Apart from having a good look to see if you can improve your algoarithm it is now possible to get things done quicker if they are done in parallel.
If this is done by multiple machines it is known as "Scaling Out" or if it is done by increasing the power of your machine then it is known as "Scaling up" (also known respectivly as Scaling Horizontally or Scalling Vertically).

But if you add twice the threads doers it get twice as fast?

If you read the article on WIKIPEDIA (http://en.wikipedia.org/wiki/Scalability), you will reach the part on Amdahl's law.

This gives a nice equasion:

Where α (Alpha) is the percentage of the process that can be done in parralel,
While P is the number of processors.

The equation has the following properties:

As α tends to ZERO ... so nothing can be done in prallel ... the function tends to 1.
As α tends to ONE ... so it can all be done in prallel ... the function tends to P.

In the first case no number of extra processors will help you ... while in the other it is scalable and adding more processors will increase efficency.

The worked example on Wikipedia gives an case where 30% of the code ca be done in parallel and compares 4 vs 8 processors increasing efficency by 20% only!

The conclusion to all this is unless software devlopers improve their code efficency we are all just going to have to throw more 'tin' at the problem and only get a minor improvement for out money.

Mcnought's Random Musing and Mutterings

Wednesday, 26 February 2014

Cool tool: IKVM.NET

Tuesday, 18 February 2014

Cool Tip of the day: Java regex tester

Friday, 7 February 2014

Scalability

Vampirestat domain and site appraisal!