Tag Archive: learn

The Reinvigorated Programmer (Mike Taylor) quotes from Jon Bentley’s 1986 classic Programming Pearls:

Only 10% of programmers can write a binary search

Shocking indeed

Better yet, he has a coding challenge for you to take part in (post answers as comments):

Here are the rules:

  1. Use whatever programming language you like.
  2. No cutting, pasting or otherwise copying code.  Don’t even look at other binary search code until you’re done.
  3. I need hardly say, no calling bsearch(), or otherwise cheating :-)
  4. Take as long as you like — you might finish, and feel confident in your code, after five minutes; or you’re welcome to take eight hours if you want (if you have the time to spare).
  5. You’re allowed to use your compiler to shake out mechanical bugs such as syntax errors or failure to initialise variables, but …
  6. NO TESTING until after you’ve decided your program is correct.
  7. Finally, the most important one: if you decide to begin this exercise, then you must report — either to say that you succeeded, failed or abandoned the attempt.  Otherwise the figures will be skewed towards success.

If you haven’t yet, try it. Read just the article, but do not look at the others’ comments just yet.

After you have attempted this, there is his redux, on why coding exercises like this matter; and his redux of the redux, on why you should not test while attempting exercises like this. They are all worth a good read.

In the previous posts, on socket programming and JDBC, the focus has been on programming for the server. This post will focus on client-side programming: applets and web start – with particular focus on the latter.
Continue reading

How to connect to a relational database from a Java program using Java Database Connectivity (JDBC); the JDBC API; some basic SQL; Object-Relational Mapping -all put together with a code example
Continue reading

Here is a quick introduction to socket programming in Java – using a simple demonstration of client-server communication. For the purposes of the demonstration, both the client and the server will be on the same physical machine (“localhost”). However, this can be easily changed later on.
Continue reading

How do we compare the relative performance of several data mining models? Previously, we discussed some basic model evaluation methods and metrics. Now we delve into more of them: ROC curves, Kappa statistic, mean square error, relative squared error, mean absolute error, and relative absolute error are the various metrics used, discussed below.
Continue reading

Building data mining models

In order to build the models used in data mining, one will need a set of data for training the learning algorithm, and then another set to evaluate the model built by the learning algorithm. Some basic evaluation methods and metrics are explored. Additionally, advanced techniques such as boosting and bagging may be applied to improve accuracy.
Continue reading

The transformation process

An operational database is transformed into a data warehouse through the following process:

Identify the facts – What are the facts that need to be obtained; the wanted results from the queries

Identify the dimensions – What are the dimensions (criteria) along which the facts should be filtered

Draw a star schema – Design a new database which consists of a central fact table which has a many-to-one relationship with each of the dimension tables

Identify the dimensions’ attributes – Each dimension should have more than one possible value; what are the characteristics (attributes) that make them different

SQL – Write the structured query language that will perform the actual transformation planned for above

Operational DB vs. Data warehouse

The operational database is the traditional relational database, with tables related to each other by use of foreign keys, and usually normalised until redundant data is minimised. This is useful when one wants to record “live” information such as transactions or logs. However, when one wants to mine the data for analytical purposes – for example, which time window of the day are the most sales made – and you want to do this over an extremely large data set – how should this be done? It is possible to construct a somewhat complex query to get the answer, but in cases such as these, the way in which the data warehouse is structured is optimised to obtain such information. However, before that is done, the operational data first needs to be transformed and put into a data warehouse.

Transformation SQL

I will not go into detail about the planning steps here, and instead will get right into the steps necessary to achieve the transformation.

1. Create the dimension tables

There are 3 ways to create dimension tables:

i) Copy a table wholesale from the operational database. This can be done for some tables, which fit the bill of the dimension tables – they usually have the same name as the identified dimension, plus the attributes identified for the dimension are the columns in this table in the operational database.

ii) Create tables from scratch, manually inserting values. This is done for dimensions which do not exist in the operational database at all. For example a time of day dimension for data that should be sliced according to morning, afternoon or night. In most operational databases, the tables will not have columns that have this information about a time/ date value (at least in normalised ones).

Create Table periodDim --create a period dimension table
(period_id Number,
period_desc Varchar(20));
Insert Into periodDim Values (1, 'Summer');
Insert Into periodDim Values (2, 'Autumn');
Insert Into periodDim Values (3, 'Winter');
Insert Into periodDim Values (4, 'Spring');

iii) Extract values from a operational database table using `Select Distinct`. This is done when the data exists in some form in the operational data, however, cannot be used as-is, and must be extracted in a special manner. This occurs in the converse situation as in (ii), where data could have been potentially further normalised, however values were used directly.

Create Table source_dim
As (Select Distinct
Dense_Rank() Over (Order By o.order_source) As source_id,
o.order_source As source_desc
From op_db.order o);

Note that here the `Dense_Rank() Over`, which is normally used for OLAP purposes, is used here to perform the simple function of assigning unique primary keys for this dimension table. Why `Dense_Rank` instead of `Rank`? You could really use either without issue, just that `Dense_Rank` will give running numbers without gaps, whereas, `Rank` would result in gaps, as a result of being used together with `Distinct`.

2. Create a temporary fact table

There are two main steps to this: First create the actual table by joining the necessary tables from the operational database. Secondly, create the links from the fact table to the dimension tables.

i) The first step is a simple join. Select all the attributes that are needed to obtain the attributes in the fact table. This would include the values that are needed to determine how it relates to each dimension, plus any of the “fact” attributes, which are usually aggregate values. “Aggregate” used here refers to a single figure that represents are larger set of figures: sum, count, average, minimum, maximum, etc.

ii) The next step is to add columns for the primary keys of the dimension tables. For example, the temp table may already have a “purchase date” column, and your dimension table has four entries, one for each season of the year. There are two ways to do this:

--manually insert the period IDs
Update temp Set period_id=1 --summer
Where (to_char(order_date, 'MON') In ('DEC', 'JAN', 'FEB'));
Update temp Set period_id=2 --autumn
Where (to_char(order_date, 'MON') In ('MAR', 'APR', 'MAY'));
Update temp Set period_id=3 --winter
Where (to_char(order_date, 'MON') In ('JUN', 'JUL', 'AUG'));
Update temp Set period_id=4 --spring
Where (to_char(order_date, 'MON') In ('SEP', 'OCT', 'NOV'));
--automatically parse period IDs
Update temp t
Set t.period_id=(
Select pd.period_id
From periodDim pd
Where ((to_date(to_char(t.order_date, 'MMDD'), 'MMDD') <= pd.end_date) And
(to_date(to_char(t.order_date, 'MMDD'), 'MMDD') >= pd.start_date)

Note that the latter involves an additional step where the “start_date” and “end_date” columns are defined for the period dimension table. In this case, since there are a finite number of seasons (four), there is a choice – either method will work fine. However, when there is a theoretically infinite number of possible entries in a dimension table, or if there is a finite but unmanageably large number of such entries, then the latter method is the only practical way to do this.

Also, note that I have converted a date type to a string, and then back again to a date type – this has been done on purpose to scrub the date values such that only their month and day values are used. Otherwise, you would be comparing apples to oranges.

3. Create the fact table from the template

This is the easiest step of them all (including the planning steps), because if everything has been done right up until now, then this is trivial. However, if something has not quite gone right in the previous steps, this is probably where you will notice and then have to backtrack. We just copy the fact table, but instead of select all columns, we select only the relevant ones (identified in the star schema). These should be the primary keys of the dimension tables, plus the aggregate values that are the facts being presented by the fact table.

Create Table fact
As (
Select Count(t.cust_id) As num_cust, --aggregate values
Count(t.order_id) As num_order,
Sum(t.item_price * t.order_qty) As order_cost,
t.source_id, t.period_id, t.locationi_d -- the PK's of the dimension table
From temp t
Group By t.source_id, t.period_id, t.location_id -- group by ALL of the non-aggregate values

A note on PKs and FKs

Primary keys and foreign keys are necessary in operational database because, in addition to them being normalised, they need to support all four types of operations (Create, Read, Update and Delete). Whereas, in a data warehouse, that is not the case – after creation, a data warehouse should only be used for Read operations.