SQL Basics: Relational Databases
A relational database is a database that stores related information across multiple tables and allows you to query information in more than one table at the same time.
It's easier to understand how this works by thinking through an example. Imagine you're a business and you want to keep track of your sales information. You could set up a spreadsheet in Excel with all of the information you want to keep track of as separate columns: Order number, date, amount due, shipment tracking number, customer name, customer address, and customer phone number.
This setup would work fine for tracking the information you need to begin with, but as you start to get repeat orders from the same customer you'll find that their name, address and phone number gets stored in multiple rows of your spreadsheet.
As your business grows and the number of orders you're tracking increases, this redundant data will take up unnecessary space and generally decrease the efficiency of your sales tracking system. You might also run into issues with data integrity. There's no guarantee, for example, that every field will be populated with the correct data type or that the name and address will be entered exactly the same way every time.
With a relational database, like the one in the above diagram, you avoid all of these issues. You could set up two tables, one for orders and one for customers. The 'customers' table would include a unique ID number for each customer, along with the name, address and phone number we were already tracking. The 'orders' table would include your order number, date, amount due, tracking number and, instead of a separate field for each item of customer data, it would have a column for the customer ID.
This enables us to pull up all of the customer info for any given order, but we only have to store it once in our database rather than listing it out again for every single order.
Our Data Set
Let's start by taking a look at our database. The database has two tables, trips
and stations
. To begin with, we'll just look at the trips
table. It contains the following columns:
id
— A unique integer that serves as a reference for each trip
duration
— The duration of the trip, measured in seconds
start_date
— The date and time the trip began
start_station
— An integer that corresponds to the id
column in the stations
table for the station the trip started at
end_date
— The date and time the trip ended
end_station
— The 'id' of the station the trip ended at
bike_number
— Hubway's unique identifier for the bike used on the trip
sub_type
— The subscription type of the user. "Registered"
for users with a membership, "Casual"
for users without a membership
zip_code
— The zip code of the user (only available for registered members)
birth_date
— The birth year of the user (only available for registered members)
gender
— The gender of the user (only available for registered members)
Our Analysis
With this information and the SQL commands we'll learn shortly, here are some questions that we'll try to answer over the course of this post:
- What was the duration of the longest trip?
- How many trips were taken by 'registered' users?
- What was the average trip duration?
- Do registered or casual users take longer trips?
- Which bike was used for the most trips?
- What is the average duration of trips by users over the age of 30?
The SQL commands we'll use to answer these questions are:
SELECT
WHERE
LIMIT
ORDER BY
GROUP BY
AND
OR
MIN
MAX
AVG
SUM
COUNT
Comments
Post a Comment