SiteKickr Web Development

Constructing complex SQL queries

As a web programmer, not a day goes by without needing to touch SQL in some form. Some days you're lucky enough to be modeling a schema, but other days leave you struggling to piece together a query that's over 100 lines long, just to return one number! There are an unlimited number of resources on creating quality, normalized database schemas, but unfortunately, not near as many dealing with breaking down and creating a complex query.

The software development process has a very extensive lifecycle, starting with project definition & requirements, then stepping through architecture and high-level design. Eventually, you get to the pseudocode phase, which happens right before you start coding. Constructing a single, complex SQL query can follow a somewhat streamlined version of that process.

Definition

In most cases, this step is simple, and the definition is provided to you. A few days ago, I had a client of mine ask me to create a report that determines customer value. There it is, the query's meaning is defined. I'll use this customer value example for the remaining steps.
 

Requirements

The requirements get a little more detailed. While the query definition will typically be only a sentence long, the requirements will provide more detail, including:

 

High-Level Design / Pseudocode

In the case of a database query, at this point, you should be in good shape to start creating some pseudocode. It will no doubt be a waste of your time to step through every piece of the software lifecycle for a single query. The software cycle was designed for large projects. Individual components of that project don't require a reiteration through the entire process. Designing a query is a lot like creating a function or method. Once you know that function's purpose, you can jump into the pseudocode process. Creating pseudocode for a query may not be a concept you're used to, but why should a query be any different? To start, you can define the basic components that make up the query.

  1. Determine what fields from which tables need to be returned (selected)
  2. Add FROM statement, using table with most selected fields first
  3. Add JOINs for remaining tables that have returned fields
  4. Create GROUP clause (if needed)
  5. Create WHERE clause
  6. Examine the WHERE clause for tables which have not yet been JOINED, then JOIN them as necessary
  7. Add LIMIT and ORDER clauses last, as they are least coupled with the other clauses

Let's use the pieces identified above to write some pseudocode for the requirements defined above:

SELECT customer first name, customer last name, phone number, email address, total orders, years since registered, orders per year
FROM customers, orders
GROUP BY customer
HAVING order count > 1
ORDER BY order per year DESC

Everything here looks straightforward, except orders per year. Writing pseudocode often reveals areas that need further definition. So, we decide to define orders per year as follows:

Years = The difference, in days, between today and the day the customer signed up to the site, divided by 365.
Orders Per Year = Total Number of Orders, divided by Years

 

The Moment of Truth

It's time to write the query. I use "Moment of Truth", because this is where we find out how well our high-level design / pseudocode was thought out.

If you're working from a command-line, good luck. You may be a hardcore, life-time Linux administrator, but the command prompt is no place to test multi-line queries (or anything multi-line for that matter). Open up your favorite database workbench/query tool. I personally use PHPMyAdmin. I like it because it sits on a remote server, so I can access it from any PC. If you do take this approach, make sure to secure your access to PHPMyAdmin via SSL (create a self-signed SSL certificate if you have to), and consider restricting access only to your IP address (use Apache Allow, Deny directives).

SELECT u.userid, u.firstname, u.lastname, u.email, count(1) as orders,
       (DateDiff(now(), u.ts_created) / 365) as years,
       Round((count(1) / (DateDiff(now(), u.ts_created) / 365)), 1) as orders_per_year
FROM users u
INNER JOIN orders o
        ON o.userid = u.userid
GROUP BY u.userid
HAVING count(1) > 1
ORDER BY orders_per_year DESC
LIMIT 100