Introduction

This project aims at use R and exploratory data analysis techniques to explore a public available dataset about Wine Quality. The paper covering the dataset is available at Elsevier and a short description of the available variables and their meanings is found on this description file.

The dataset contains several physicochemical attributes from red variants of the Portuguese “Vinho Verde” wine and sensory classification made by wine experts.

Data analysis and exploration

Univariate Plots Section

First things first. Lets have a glimpse at the data.

## Observations: 1,599
## Variables: 12
## $ fixed.acidity        (dbl) 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, ...
## $ volatile.acidity     (dbl) 0.700, 0.880, 0.760, 0.280, 0.700, 0.660,...
## $ citric.acid          (dbl) 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06,...
## $ residual.sugar       (dbl) 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2...
## $ chlorides            (dbl) 0.076, 0.098, 0.092, 0.075, 0.076, 0.075,...
## $ free.sulfur.dioxide  (dbl) 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15...
## $ total.sulfur.dioxide (dbl) 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, ...
## $ density              (dbl) 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0...
## $ pH                   (dbl) 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30,...
## $ sulphates            (dbl) 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46,...
## $ alcohol              (dbl) 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, ...
## $ quality              (fctr) 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5...

There are 12 variables and 1599 observations. All variables are numerical except for the quality score which is represented as a ordered factor.

Quality

##   3   4   5   6   7   8 
##  10  53 681 638 199  18

We can say the distribution of quality appears to be normal with many wines at average quality (4-5) and fewer wines at low quality and high quality. There are no wines with a quality worse than 3 and no wines with quality higher than 8.

Fixed Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The median fixed acidity in the wines present in the dataset is 7.90 \(g/dm^3\). Most wines have an acidity between 7.10 and 9.20. The distribution of fixed acidity is slightly right skewed. There are some outliers in the higher range (~ >15)

Volatile acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The distribution of volatile acidity is non-symmetric and bimodal with two peaks at 0.4 and 0.6. The median value is 0.52. Most observations fall in the range 0.39 - 0.64 and outliers on the higher end of the scale are visible.

Citric acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Most wines have 0 \(g/dm^3\) of citric acid. This acid is always found in very small quantities. The distribution is right skewed with some ups and downs. We can see peaks at 0.250 and 0.500 which may hint at some bimodal behavior. A single wine appears far away on the right side with 1 \(g/dm^3\) of citric acid.

Residual sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The distribution of residual sugar has a median value of 2.2 \(g/dm^3\). The distribution is right skewed with a long tail in the right side. There are many small bars on the right side of the main peak.

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The amount of chlorides in the wines has a median value of 0.079 \(g/dm^3\). The distribution with looks normal around its main peak but has a very long right tail, with small counts of wines with values until 0.611 \(g/dm^3\)

Free sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

The distribution of free sulfur dioxide concentrations is right skewed. The median value is 14 \(mg/dm^3\). The right tail extends until a maximum of 72 with a gap between 57 and 66.

Total sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

The distribution of total sulfur dioxide is right skewed with a median value of 38 \(mg/dm^3\). On the right tail we can see a local maximum near 80. There’s a gap between 165 and 278 with only two wines with a concentration greater than or equal to 278.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

The density of wines varies few, with most of the values between 0.9956 and 0.9967. The distribution is almost symmetric and has median value of 0.9968 \(g/cm^3\). The density if close to the density of water (1 \(g/cm^3\) at 4 \(^\circ C\)).

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

All wines have low pH. This makes sense since trough the fermentation process, acids are produced. The distribution seems symmetrical or could be also considered bimodal with both peaks very close to each other. There seems to be a local maximum at around 3.2 and then another one at 3.35. The median value is 3.31, and most wines have a pH between 3.21 and 3.4.

Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The distribution of sulphates is slightly right skewed. Some outliers on the right tail have around 2 g/dm^3 of sulphates. The median value of sulphates is 0.62 and most wines have a concentration between 0.55 and 0.73.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The alcohol concentration distribution is right skewed. There seems to be a natural border on the left side. Maybe a minimum amount of alcohol needs to be present for the drink to be considered a wine? The highest peak of the distribution is at 9.5 % alcohol and the median value is 10.20%. The maximum amount of alcohol present in the dataset is 14.90.

Univariate Analysis

What is the structure of your dataset?

The dataset has 12 variables regarding 1599 observations. Each observation corresponds to a red wine sample. 11 variables correspond to the result of a physicochemical test and one variable (quality) corresponds to the result of a sensory panel rating.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is the quality rating.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think all the physicochemical test results may help support the investigation. All of them are related to characteristics which may affect the flavor of the wine. They correspond to concentration of molecules which may have an impact on taste. Density is a physical property which will depend on the percentage of alcohol and sugar content, which will also affect taste.

Some variables may have strong correlation with each other. For instance, the pH will depend on the amount of acid molecules, while total sulfur dioxide may always follow a similar distribution of free sulfur dioxide.

Did you create any new variables from existing variables in the dataset?

No new variables were created in the dataset.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There were no unusual distributions, no missing values and no need to adjust the data. The dataset presented is already tidy which makes it an ideal dataset for a learning project as this one.

Bivariate Plots Section

Fixed Acidity vs. Quality

## [1] "Median of fixed.acidity by quality:"
## wines$quality: 3
## [1] 7.5
## -------------------------------------------------------- 
## wines$quality: 4
## [1] 7.5
## -------------------------------------------------------- 
## wines$quality: 5
## [1] 7.8
## -------------------------------------------------------- 
## wines$quality: 6
## [1] 7.9
## -------------------------------------------------------- 
## wines$quality: 7
## [1] 8.8
## -------------------------------------------------------- 
## wines$quality: 8
## [1] 8.25

We see a very slight upwards trend of higher quality with higher fixed acidity. However, the extreme quality classes (3 and 8) have less observations than the middle ones, which may make the median value not so accurate. And we see a drop of acidity from 7 to the 8 quality class. Additionally, we see a big dispersion of acidity values across each quality scale. This may be a indicator that the quality cannot be predicted based only on the value of acidity and is the result of a combination of more variables.

Volatile Acidity vs. Quality

## [1] "Median of volatile.acidity by quality:"
## wines$quality: 3
## [1] 0.845
## -------------------------------------------------------- 
## wines$quality: 4
## [1] 0.67
## -------------------------------------------------------- 
## wines$quality: 5
## [1] 0.58
## -------------------------------------------------------- 
## wines$quality: 6
## [1] 0.49
## -------------------------------------------------------- 
## wines$quality: 7
## [1] 0.37
## -------------------------------------------------------- 
## wines$quality: 8
## [1] 0.37

Having in mind the same limitations as referred for the Fixed Acidity (extreme classes with less observations and variability inside the same quality class), we can see a more obvious trend. Lower volatile acidity seems to mean higher wine quality.

Citric Acid vs. Quality

## [1] "Median of citric.acid by quality:"
## wines$quality: 3
## [1] 0.035
## -------------------------------------------------------- 
## wines$quality: 4
## [1] 0.09
## -------------------------------------------------------- 
## wines$quality: 5
## [1] 0.23
## -------------------------------------------------------- 
## wines$quality: 6
## [1] 0.26
## -------------------------------------------------------- 
## wines$quality: 7
## [1] 0.4
## -------------------------------------------------------- 
## wines$quality: 8
## [1] 0.42