Reading Matrix Rows and Columns From Csv File in C
Working with data in a matrix
Loading data
Our example data is quality measurements (particle size) on PVC plastic production, using eight unlike resin batches, and three different motorcar operators.
The data fix is stored in comma-separated value (CSV) format. Each row is a resin batch, and each column is an operator. In RStudio, open pvc.csv
and accept a expect at what information technology contains.
read.csv("data/intro-r/pvc.csv", row.names= one)
Nosotros have called read.csv
with two arguments: the name of the file we desire to read, and which column contains the row names. The filename needs to exist a character cord, so we put it in quotes. Assigning the second argument, row.names
, to be i
indicates that the data file has row names, and which cavalcade number they are stored in. If we don't specify row.names
the result volition not have row names.
dat <- read.csv("data/intro-r/pvc.csv", row.names= i)
## Alice Bob Carl ## Resin1 36.25 35.40 35.30 ## Resin2 35.fifteen 35.35 33.35 ## Resin3 30.70 29.65 29.20 ## Resin4 29.70 thirty.05 28.65 ## Resin5 31.85 31.40 29.30 ## Resin6 30.20 xxx.65 29.75 ## Resin7 32.90 32.50 32.80 ## Resin8 36.80 36.45 33.15
## [1] "data.frame"
## 'data.frame': 8 obs. of 3 variables: ## $ Alice: num 36.2 35.1 thirty.7 29.7 31.9 ... ## $ Bob : num 35.four 35.four 29.6 xxx.i 31.iv ... ## $ Carl : num 35.iii 33.4 29.2 28.6 29.3 ...
read.csv
has loaded the data as a information frame. A information frame contains a collection of "things" (rows) each with a set of properties (columns) of different types.
Actually this information is better thought of equally a matrixane. In a data frame the columns contain unlike types of information, but in a matrix all the elements are the aforementioned type of data. A matrix in R is like a mathematical matrix, containing all the same type of matter (usually numbers).
R often but not always lets these exist used interchangably. It's likewise helpful when thinking about data to distinguish between a information frame and a matrix. Unlike operations make sense for data frames and matrices.
Information frames are very central to R, and mastering R is very much virtually thinking in information frames. However when we get to RNA-Seq we will be using matrices of read counts, so information technology will be worth our time to learn to use matrices as well.
Let usa insist to R that what we take is a matrix. as.matrix
"casts" our data to have matrix type.
mat <- as.matrix(dat) class(mat)
## [1] "matrix"
## num [ane:8, 1:iii] 36.2 35.1 30.7 29.vii 31.9 ... ## - attr(*, "dimnames")=Listing of ii ## ..$ : chr [one:8] "Resin1" "Resin2" "Resin3" "Resin4" ... ## ..$ : chr [one:3] "Alice" "Bob" "Carl"
Much ameliorate.
Indexing matrices
We tin cheque the size of the matrix with the functions nrow
and ncol
:
## [one] 8
## [1] 3
This tells us that our matrix, mat
, has 8 rows and 3 columns.
If we desire to get a unmarried value from the matrix, we tin can provide a row and column index in square brackets:
# outset value in mat mat[i, 1]
## [one] 36.25
# a middle value in mat mat[4, 2]
## [1] 30.05
If our matrix has row names and cavalcade names, we tin also refer to rows and columns by name.
## [1] thirty.05
An index like [4, 2]
selects a single chemical element of a matrix, simply nosotros can select whole sections as well. For example, we tin select the showtime two operators (columns) of values for the first 4 resins (rows) like this:
## Alice Bob ## Resin1 36.25 35.xl ## Resin2 35.15 35.35 ## Resin3 thirty.70 29.65 ## Resin4 29.70 thirty.05
The slice 1:4
ways, the numbers from 1 to iv. It'due south the same as c(one,ii,three,four)
, and doesn't need to be used within [ ]
.
## [1] i ii three iv
The slice does not need to first at 1, e.g. the line below selects rows five through viii:
## Alice Bob ## Resin5 31.85 31.forty ## Resin6 30.20 30.65 ## Resin7 32.90 32.50 ## Resin8 36.80 36.45
We tin use vectors created with c
to select non-contiguous values:
## Alice Carl ## Resin1 36.25 35.3 ## Resin3 xxx.70 29.2 ## Resin5 31.85 29.3
Nosotros as well don't have to provide an index for either the rows or the columns. If we don't include an index for the rows, R returns all the rows; if we don't include an alphabetize for the columns, R returns all the columns. If we don't provide an index for either rows or columns, eastward.g. mat[, ]
, R returns the full matrix.
# All columns from row five mat[five, ]
## Alice Bob Carl ## 31.85 31.40 29.30
# All rows from cavalcade 2 mat[, 2]
## Resin1 Resin2 Resin3 Resin4 Resin5 Resin6 Resin7 Resin8 ## 35.40 35.35 29.65 30.05 31.40 30.65 32.l 36.45
Summary functions
Now let's perform some common mathematical operations to acquire about our data. When analyzing information we often want to look at fractional statistics, such every bit the maximum value per resin or the boilerplate value per operator. I fashion to practise this is to select the data we desire to create a new temporary vector (or matrix, or information frame), and then perform the adding on this subset:
# start row, all of the columns resin_1 <- mat[1, ] # max particle size for resin i max(resin_1)
## [1] 36.25
Nosotros don't actually need to shop the row in a variable of its own. Instead, we can combine the selection and the function call:
# max particle size for resin 2 max(mat[ii, ])
## [1] 35.35
R also has functions for other common calculations, e.g. finding the minimum, mean, median, and standard deviation of the information:
# minimum particle size for operator 3 min(mat[, 3])
## [one] 28.65
# hateful for operator 3 mean(mat[, iii])
## [1] 31.4375
# median for operator 3 median(mat[, three])
## [1] 31.275
# standard deviation for operator iii sd(mat[, 3])
## [1] 2.49453
Summarizing matrices
What if we need the maximum particle size for all resins, or the average for each operator? Every bit the diagram below shows, we desire to perform the performance across a margin of the matrix:
To support this, we tin can apply the utilize
office.
apply
allows united states of america to echo a function on all of the rows (MARGIN = 1
) or columns (MARGIN = two
) of a matrix. We tin think of apply
as collapsing the matrix downward to but the dimension specified by MARGIN
, with rows being dimension 1 and columns dimension two (recall that when indexing the matrix we give the row first and the cavalcade second).
Thus, to obtain the average particle size of each resin we will demand to calculate the mean of all of the rows (MARGIN = 1
) of the matrix.
avg_resin <- utilize(mat, 1, hateful)
And to obtain the average particle size for each operator we will demand to calculate the hateful of all of the columns (MARGIN = ii
) of the matrix.
avg_operator <- employ(mat, ii, mean)
Since the second argument to apply
is MARGIN
, the above command is equivalent to apply(dat, MARGIN = two, mean)
.
Challenge - Slicing (subsetting) information
We tin take slices of grapheme vectors also:
phrase <- c("I", "don't", "know", "I", "know") # first three words phrase[one:three]
## [1] "I" "don't" "know"
# terminal iii words phrase[3:v]
## [i] "know" "I" "know"
-
If the beginning four words are selected using the slice
phrase[ane:4]
, how tin we obtain the first four words in reverse order? -
What is
phrase[-2]
? What isphrase[-5]
? Given those answers, explain whatphrase[-1:-3]
does. -
Use a slice of
phrase
to create a new character vector that forms the phrase "I know I don't", i.due east.c("I", "know", "I", "don't")
.
Challenge - Subsetting data 2
Suppose you desire to determine the maximum particle size for resin 5 across operators 2 and 3. To do this you would excerpt the relevant piece from the matrix and calculate the maximum value. Which of the post-obit lines of R lawmaking gives the correct answer?
-
max(dat[5, ])
-
max(dat[ii:three, five])
-
max(dat[5, ii:iii])
-
max(dat[5, ii, 3])
t examination
R has many statistical tests built in. Ane of the virtually commonly used tests is the t test. Practice the means of two vectors differ significantly?
## Alice Bob Carl ## 36.25 35.40 35.30
## Alice Bob Carl ## 35.15 35.35 33.35
## ## Welch Two Sample t-test ## ## data: mat[1, ] and mat[2, ] ## t = ane.4683, df = ii.8552, p-value = 0.2427 ## alternative hypothesis: true difference in ways is not equal to 0 ## 95 percent confidence interval: ## -i.271985 3.338652 ## sample estimates: ## mean of x mean of y ## 35.65000 34.61667
Actually, this can be considered a paired sample t-test, since the values can exist paired up by operator. By default t.examination
performs an unpaired t test. Nosotros see in the documentation (?t.examination
) that we tin can give paired=TRUE
every bit an argument in guild to perform a paired t-test.
t.examination(mat[1,], mat[2,], paired= TRUE)
## ## Paired t-examination ## ## information: mat[i, ] and mat[2, ] ## t = ane.8805, df = 2, p-value = 0.2008 ## alternative hypothesis: true divergence in means is not equal to 0 ## 95 percentage confidence interval: ## -1.330952 3.397618 ## sample estimates: ## mean of the differences ## 1.033333
Claiming - using t.exam
Can y'all find a significant difference between any ii resins?
When we call t.test it returns an object that behaves like a listing
. Recall that in R a list
is a miscellaneous collection of values.
result <- t.exam(mat[1,], mat[2,], paired= TRUE) names(effect)
## [1] "statistic" "parameter" "p.value" "conf.int" "gauge" ## [6] "nix.value" "alternative" "method" "data.name"
## [1] 0.2007814
This means nosotros can write software that uses the diverse results from t.test
, for example performing a whole serial of t tests and reporting the pregnant results.
Plotting
The mathematician Richard Hamming once said, "The purpose of computing is insight, not numbers," and the best style to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but nosotros tin can explore a few of R's plotting features.
Let's take a wait at the average particle size per resin. Recall that nosotros already calculated these values in a higher place using utilise(mat, 1, mean)
and saved them in the variable avg_resin
. Plotting the values is done with the function plot
.
Above, we gave the office plot
a vector of numbers corresponding to the average per resin across all operators. plot
created a besprinkle plot where the y-axis is the average particle size and the x-centrality is the order, or alphabetize, of the values in the vector, which in this instance correspond to the 8 resins.
plot
can accept many different arguments to change the appearance of the output. Here is a plot with some extra arguments:
plot(avg_resin, xlab= "Resin", ylab= "Particle size", principal= "Average particle size per resin", type= "b")
Permit'due south have a look at two other statistics: the maximum and minimum particle size per resin. Boosted points or lines can be added to a plot with points
or lines
.
max_resin <- apply(mat, 1, max) min_resin <- utilise(mat, 1, min) plot(avg_resin, type= "b", ylim= c(25,40)) lines(max_resin) lines(min_resin)
R doesn't know to adjust the y limits if nosotros add new data outside the original limits, so we needed to specify ylim
manually. This is R's base graphics system. If in that location is fourth dimension today, nosotros will look at a more than advanced graphics package chosen "ggplot2" that handles this kind of issue more intelligently.
Claiming - Plotting data
Create a plot showing the standard deviation for each resin.
Advanced: Create a plot showing +/- ii standard deviations about the mean.
Extension: Create similar plots for operator. Which dimension (resin or operator) is the major source of variation in this data?
Saving plots
It'south possible to save a plot as a .PNG or .PDF from the RStudio interface with the "Export" push. Nevertheless if nosotros want to keep a complete tape of exactly how we create each plot, we prefer to do this with R code.
Plotting in R is sent to a "device". By default, this device is RStudio. However we can temporarily transport plots to a dissimilar device, such as a .PNG file (png("filename.png")
) or .PDF file (pdf("filename.pdf")
).
pdf("test.pdf") plot(avg_resin) dev.off()
dev.off()
is very important. It tells R to stop outputting to the pdf device and return to using the default device. If you forget, your interactive plots will stop appearing as expected!
The file you created should appear in the file managing director pane of RStudio, y'all can view information technology by clicking on it.
christianprolemare.blogspot.com
Source: http://monashbioinformaticsplatform.github.io/2015-11-30-intro-r/matrices.html
0 Response to "Reading Matrix Rows and Columns From Csv File in C"
Enregistrer un commentaire