Welcome to ciysys blog

Analyze data with MSSQL SELECT statement

Published on: 31th May 2021

Updated on 31th May 2021

Overview

SELECT statement is very powerful which allows you to retrieve all records with proper filters and even analyze data.

To explain the SELECT statement, let's start with developing a simple system that is keeping track of attendance for members.

Table structure

The following table holds the attendance information for the member. To simplify the query, we will not include the member table. Let's focus on the attendance.

create table tb_attendance (
    attendance_id uniqueidentifier not null primary key
    , profile_id uniqueidentifier not null      -- links to the member table.
    , check_in_on datetime not null             -- the check in time.
    , check_out_on datetime                     -- the check out time. Fill in upon check out.
)

Sample data

insert into tb_attendance (attendance_id, profile_id, check_in_on, check_out_on) values
('7B7511A7-C4BC-40C2-BB06-961AAF5C263E', '3829BE4D-CCF1-4389-A024-336C06B53509', '2007-07-13 20:14:43.000', '2007-07-13 22:14:03.000')
, ('AC7F544D-A4FF-4AE8-9FF3-B9C0295C8F8B', '2D0F3DDB-9131-4313-891E-A11DD8FC0298', '2007-07-13 14:47:42.000', '2007-07-13 19:04:54.000')
, ('5419770E-D7CD-482C-B3D6-FD89FCCA9789', 'FC5E926D-2417-46C0-BD06-86A2E3D5F257', '2007-07-13 14:50:37.000', '2007-07-13 21:15:05.000')
, ('873683FD-93AD-45C6-BC6D-DA547F71CCE9', '65A1251F-A18D-402D-AA51-84466452A93B', '2007-07-13 18:58:23.000', '2007-07-13 19:04:49.000')
, ('08780963-BAB6-402C-92C1-0CBC3EFC0B70', '65A1251F-A18D-402D-AA51-84466452A93B', '2007-07-13 19:32:49.000', '2007-07-13 20:25:46.000')
, ('B3532C0F-DF5E-4BEF-8247-8CE054839FC3', '65A1251F-A18D-402D-AA51-84466452A93B', '2007-07-14 18:05:33.000', '2007-07-14 19:14:01.000')
, ('B755D38D-8C23-4A68-8CDA-D495DD43E0ED', 'FC5E926D-2417-46C0-BD06-86A2E3D5F257', '2007-07-14 17:23:08.000', '2007-07-14 19:11:15.000')    

Notes: in the above test data, the member of 65A1251F-A18D-402D-AA51-84466452A93B is having two attendances on 13th July. This is because the member has checked out for dinner and re-check in.

Queries

Show all records in the table

In any system, showing the full records to the user is a mandatory feature. This can be done easily with SELECT statement,

select *
from tb_attendance
where
    check_in_on between '2007-07-13' and '2007-07-13 23:59'
order by check_in_on

The query shows all the attendance for 13th July 2007 and the records are sorted by check in time. The * (asterisk symbol) means to retrieve all fields from the given table.

User requirement #1

Sounds simple when we learned the SELECT statement until we started facing the real user requirement which is challenging. Here's the requirement

Rule: the member is allowed to check in multiple times on any date but the total visit will be treated as one.
Requirement: shows the visit count for each date.

It does not sound complicated. So, let's try out this first query,

select
    date = convert(nvarchar, check_in_on, 102)
    , visit_count = count(*)
from tb_attendance
group by
    convert(nvarchar, check_in_on, 102)

The above query extracts the date value from the check_in_on field (in fact this field stores the date + time) and then counts the record.

date visit_count
2007.07.13 5
2007.07.14 2

But, the visit_count value on 13th July is wrong. The correct value should be 4 instead of 5. This is because the above query does not treat the multiple entries that happened on the same date as the same member as one visit.

Try the following new query looks which count by the distinct profile_id (i.e., member),

select
    date = convert(nvarchar, check_in_on, 102)
    , visit_count = count(distinct profile_id)
from tb_attendance
group by
    convert(nvarchar, check_in_on, 102)

And yes. This query works perfectly as it shows 4 visits on 13th July.

date visit_count
2007.07.13 4
2007.07.14 2

User requirement #2

Rule: the member is allowed to check in multiple times on any date but the total visit will be treated as one.
Requirement: shows the first attendance record for 13th July (ignore the re-entry).

To meet the above requirement, we need the help with ROW_NUMBER() OVER (..) keyword which is able to assign a running number to each group. Let's try from the simplest form of ROW_NUMBER().

First, the query will look like this:

select
    attendance_id
    , profile_id
    , row_idx = row_number()
                over (
                    order by profile_id, convert(nvarchar, check_in_on, 102)
                )
from tb_attendance
where
    check_in_on between '2007-07-13' and '2007-07-13 23:59'

Basically, the ROW_NUMBER() keyword assigns a running number for each record and we keep the running number in the row_idx field. The result will look like this,

attendance_id profile_id row_idx
7B7511A7-C4BC-40C2-BB06-961AAF5C263E 3829BE4D-CCF1-4389-A024-336C06B53509 1
873683FD-93AD-45C6-BC6D-DA547F71CCE9 65A1251F-A18D-402D-AA51-84466452A93B 2
08780963-BAB6-402C-92C1-0CBC3EFC0B70 65A1251F-A18D-402D-AA51-84466452A93B 3
5419770E-D7CD-482C-B3D6-FD89FCCA9789 FC5E926D-2417-46C0-BD06-86A2E3D5F257 4
AC7F544D-A4FF-4AE8-9FF3-B9C0295C8F8B 2D0F3DDB-9131-4313-891E-A11DD8FC0298 5

But it is far from meeting the user requirement.

Let's try to add the PARTITION BY keyword into the ROW_NUMBER() section. The record will be partitioned based on profile_id and check in date. This keyword indicates that the running number will be reset to 1 whenever profile_id and/or check in date changed.

select
    attendance_id
    , profile_id
    , row_idx = row_number()
                over (
                    partition by profile_id, convert(nvarchar, check_in_on, 102)
                    order by profile_id, convert(nvarchar, check_in_on, 102)
                )
from tb_attendance
where
    check_in_on between '2007-07-13' and '2007-07-13 23:59'

For the member of '65A1251F-A18D-402D-AA51-84466452A93B' who has two attendance records. The row_idx value for the first record is 1 and the second record is 2. The result is very close to the user requirement except that we need to find a way to get rid of the row_idx=2.

attendance_id profile_id row_idx
7B7511A7-C4BC-40C2-BB06-961AAF5C263E 3829BE4D-CCF1-4389-A024-336C06B53509 1
873683FD-93AD-45C6-BC6D-DA547F71CCE9 65A1251F-A18D-402D-AA51-84466452A93B 1
08780963-BAB6-402C-92C1-0CBC3EFC0B70 65A1251F-A18D-402D-AA51-84466452A93B 2
5419770E-D7CD-482C-B3D6-FD89FCCA9789 FC5E926D-2417-46C0-BD06-86A2E3D5F257 1
AC7F544D-A4FF-4AE8-9FF3-B9C0295C8F8B 2D0F3DDB-9131-4313-891E-A11DD8FC0298 1

That's easy. Basically, the user requirements means that we should return the record if row_idx=1. We made the previous query as a sub-query and added a filter in the main query with row_idx=1.

select *
from (
    select
        attendance_id
        , profile_id
        , row_idx = row_number()
                    over (
                        partition by profile_id, convert(nvarchar, check_in_on, 102)
                        order by profile_id, convert(nvarchar, check_in_on, 102)
                    )
    from tb_attendance
    where
        check_in_on between '2007-07-13' and '2007-07-13 23:59'
) as a
where
    a.row_idx = 1

Now, the result has 4 records and the member of '65A1251F-A18D-402D-AA51-84466452A93B' is having 1 record only.

attendance_id profile_id row_idx
7B7511A7-C4BC-40C2-BB06-961AAF5C263E 3829BE4D-CCF1-4389-A024-336C06B53509 1
873683FD-93AD-45C6-BC6D-DA547F71CCE9 65A1251F-A18D-402D-AA51-84466452A93B 1
5419770E-D7CD-482C-B3D6-FD89FCCA9789 FC5E926D-2417-46C0-BD06-86A2E3D5F257 1
AC7F544D-A4FF-4AE8-9FF3-B9C0295C8F8B 2D0F3DDB-9131-4313-891E-A11DD8FC0298 1

So, we solved the second user requirement.

User requirement #3

Rule: the member is allowed to check in multiple times on any date but the total visit will be treated as one.
Requirement: shows the number of members during the business operation hours for the given date.

This requirement sounds easy but there is a catch! It was asking for the number of members in the facility in operation hour but not the check in hour.

If you wrote the query like the one below, then, you have answered the question wrongly. The following query is showing the number of members who checked in at a particular hour.

select
    hour = datepart(hour, check_in_on)
    , check_in_count = count(*)
from tb_attendance
where
    check_in_on between '2007-07-13' and '2007-07-13 23:59'
group by
    datepart(hour, check_in_on)

The result will look like this. It is not what the user wants!

hour check_in_count
14 2
18 1
19 1
20 1

The correct query that shows the number of members in the facility during the operation hours. Let's say, the operation is between 10am to 9pm.

-- declare a 'timezone' CTE
; with tz (hour)
as (
    select hour = 10
    union all
    select hour + 1
    from tz
    where
        hour + 1 < 21
),

-- declare the attendance hours
att (in_hour, out_hour)
as (
    select
        in_hour = datepart(hour, check_in_on)
        , out_hour = datepart(hour, check_out_on)
    from tb_attendance
    where
        check_in_on between '2007-07-13' and '2007-07-13 23:59'
)
select
    tz.hour
    -- count the number of member who is in the facility
    , member_at_gym = (
        select count(*)
        from att
        where
            tz.hour between att.in_hour and att.out_hour
    )
from tz

Where

User requirement #4

Rule: the member is allowed to check in multiple times on any date but the total visit will be treated as one.
Requirement: show the number of visits for each member.

If the rule does not exist, we may use the following query to retrieve the result.

select
    profile_id
    , visit_count = count(*)
from tb_attendance
group by
    profile_id

and the result is,

profile_id visit_count
3829BE4D-CCF1-4389-A024-336C06B53509 1
65A1251F-A18D-402D-AA51-84466452A93B 3
FC5E926D-2417-46C0-BD06-86A2E3D5F257 2
2D0F3DDB-9131-4313-891E-A11DD8FC0298 1

Since, there is a rule said that multiple entries on the same day will be count as one, then, the correct query should look like the following:

select
    profile_id
    , visit_count = count(*)
from (
    select distinct
        profile_id
        , check_in_date =  convert(nvarchar, check_in_on, 102)
    from tb_attendance
) as a
group by
    profile_id

First, we retrieve the distinct record that contains the profile_id field and extract the date value (without time value) from the check_in_on field in the sub-query. After that, we return the summary based on the result of the sub-query. Then, the result will look like this,

profile_id visit_count
3829BE4D-CCF1-4389-A024-336C06B53509 1
65A1251F-A18D-402D-AA51-84466452A93B 2
FC5E926D-2417-46C0-BD06-86A2E3D5F257 2
2D0F3DDB-9131-4313-891E-A11DD8FC0298 1

Related posts

Jump to #MSSQL blog

Author

Lau Hon Wan, software developer.