1
00:00:00,390 --> 00:00:05,930
Hi, everyone.
2
00:00:05,930 --> 00:00:09,380
Welcome back to the Heterogenious Parallel
Programming class.
3
00:00:09,380 --> 00:00:13,300
We are at lecture 1.8, Kernel-Based
Parallel Programming,
4
00:00:13,300 --> 00:00:17,260
and we will be discussing basic
matrix-matrix multiplication.
5
00:00:18,450 --> 00:00:21,400
The objective for this lecture is to
prepare
6
00:00:21,400 --> 00:00:25,290
you for the lab assignment of
matrix-matrix multiplication.
7
00:00:25,290 --> 00:00:30,810
And we will first review the computation
for matrix multiplication.
8
00:00:30,810 --> 00:00:36,350
And we will also discuss how we can do
data
9
00:00:37,370 --> 00:00:42,400
block and thread index to data index
mapping for this particular computation.
10
00:00:42,400 --> 00:00:46,060
And then we will introduce loop control
flow in kernels.
11
00:00:49,098 --> 00:00:53,980
Matrix multiplication is a fairly
frequently
12
00:00:53,980 --> 00:00:59,490
used computation in engineering and
science applications and the
13
00:00:59,490 --> 00:01:05,050
computation is fairly simple.
We have two input matrices A and B.
14
00:01:05,050 --> 00:01:08,535
And then we will generate a product matrix
C.
15
00:01:08,535 --> 00:01:14,610
The each C element is generated as a
product of a row of A and
16
00:01:14,610 --> 00:01:19,250
a column of B.
So we in this particular picture,
17
00:01:19,250 --> 00:01:24,480
we show that a C element has a coordinate
row
18
00:01:24,480 --> 00:01:29,360
at Col.
And so we would use a row in
19
00:01:29,360 --> 00:01:34,980
A with the index Row, and then we will
have a column
20
00:01:34,980 --> 00:01:39,610
of B with the index Col.
And we will take the
21
00:01:39,610 --> 00:01:44,460
row and the column to take, to do a
vector.product to
22
00:01:44,460 --> 00:01:49,660
generate the value of the C element.
In
23
00:01:49,660 --> 00:01:55,120
all the discussions that we have today, we
will assume that we would be doing
24
00:01:55,120 --> 00:02:00,270
a we will have A matrix which is of
dimension m times n.
25
00:02:01,450 --> 00:02:06,050
And B matrix will be of dimension n
26
00:02:06,050 --> 00:02:07,028
times k.
27
00:02:07,028 --> 00:02:14,330
And then we will generate a C matrix n
times k.
28
00:02:14,330 --> 00:02:19,460
So the n part is going to the number of
elements in
29
00:02:19,460 --> 00:02:24,530
a row and the number of elements in a
column, which is n of B.
30
00:02:25,570 --> 00:02:28,450
Which is represented in variable n here,
must
31
00:02:28,450 --> 00:02:31,430
match, because we're doing a .product so
we must
32
00:02:31,430 --> 00:02:33,960
have the same number of elements of A and
the
33
00:02:33,960 --> 00:02:38,220
same number of elements of B in that in
that .product.
34
00:02:38,220 --> 00:02:41,910
So we are going to use these variables
35
00:02:41,910 --> 00:02:45,190
consistently in all the examples in this
lecture.
36
00:02:48,180 --> 00:02:54,760
So, here we show a simple sequential C
code for matrix multiplication.
37
00:02:54,760 --> 00:02:58,530
And the C code will have have two levels
of loops.
38
00:02:58,530 --> 00:03:03,722
And the loop these two levels of loops
systematically go
39
00:03:03,722 --> 00:03:08,826
through all the C elements in all the row
positions and row,
40
00:03:08,826 --> 00:03:13,667
all the column positions.
And for every C element
41
00:03:13,667 --> 00:03:19,166
that it visits it would generate a top
product of, of the
42
00:03:19,166 --> 00:03:25,290
A, corresponding A row and B column for
that C element.
43
00:03:25,290 --> 00:03:30,400
So here is the C code for generating each
C element.
44
00:03:30,400 --> 00:03:33,600
We'll first initialize the c element to 0,
and then we will
45
00:03:33,600 --> 00:03:38,970
go and systematically visit all the
elements in the row of A and
46
00:03:38,970 --> 00:03:40,490
column of B.
47
00:03:40,490 --> 00:03:43,338
When we visit the row of A, we will
48
00:03:43,338 --> 00:03:48,110
first need to reached the beginning
location of that row.
49
00:03:48,110 --> 00:03:53,540
And the beginning location is is can be
found by multiplying
50
00:03:53,540 --> 00:03:58,700
variable row, by the number of elements in
each row, which is m.
51
00:03:58,700 --> 00:04:04,470
So, row times n moves us into the
beginning location of the row
52
00:04:04,470 --> 00:04:12,000
in the linear row major layout of A.
And then we do the same thing with B.
53
00:04:12,000 --> 00:04:17,370
In B's case, the beginning location of the
column is actually in row 0.
54
00:04:17,370 --> 00:04:23,700
So it's simply, a we can simply use Col to
move the
55
00:04:23,700 --> 00:04:29,699
you know the, to move the the, the index
into the beginning element of B.
56
00:04:29,699 --> 00:04:35,080
Now, with for every iteration of the I
loop, we will need to be able
57
00:04:35,080 --> 00:04:40,780
to calculate the corresponding element
within the row and column.
58
00:04:40,780 --> 00:04:43,710
So for A it's very simple.
59
00:04:43,710 --> 00:04:50,170
We all the elements in the in the same row
are going to be in adjacent locations.
60
00:04:50,170 --> 00:04:54,780
So all we have to do is to just add i to
the beginning index, and that
61
00:04:54,780 --> 00:05:01,130
will give us the the element, in the
property element for that iteration.
62
00:05:01,130 --> 00:05:05,747
For B it's a little bit more complex
because in row major layout
63
00:05:05,747 --> 00:05:10,753
all the elements in the column are going
to be in strided locations.
64
00:05:10,753 --> 00:05:15,768
So we need to multiply i by the number of
elements in each row, which is k
65
00:05:15,768 --> 00:05:21,280
for, for B and then we, we add that to the
beginning location.
66
00:05:21,280 --> 00:05:26,270
So these two memory accesses give us the
appropriate elements
67
00:05:26,270 --> 00:05:30,850
of A and B in each iteration.
And then we would do the multiplication.
68
00:05:30,850 --> 00:05:36,240
We accumulate the result into sum.
And this once we finish
69
00:05:36,240 --> 00:05:41,740
the for loop, we have the total value for
the C element.
70
00:05:41,740 --> 00:05:46,410
So now we can write the C element value
into the into the appropriate
71
00:05:46,410 --> 00:05:51,370
C location and this is the location at row
and column.
72
00:05:51,370 --> 00:05:56,176
So we again we will just calculate the
linearized address by
73
00:05:56,176 --> 00:06:01,960
multiplying row with k, and then add
column to the index.
74
00:06:01,960 --> 00:06:07,930
So this gives us the complete C sequential
code for matrix multiplication.
75
00:06:07,930 --> 00:06:12,070
Notice that this particular calculation is
a very
76
00:06:12,070 --> 00:06:15,590
simple and very basic form.
77
00:06:15,590 --> 00:06:20,540
And since matrix multiplication is a very
important computation, there are
78
00:06:20,540 --> 00:06:25,510
actually lots and lots of optimized ways
of doing sequential computation.
79
00:06:25,510 --> 00:06:27,330
We will come back to this point later on.
80
00:06:30,370 --> 00:06:35,560
So here we will begin to design a kernel
81
00:06:35,560 --> 00:06:40,460
function in CUDA to perform matrix
multiplication in parallel.
82
00:06:40,460 --> 00:06:44,110
So we're going to use a very small example
to to
83
00:06:44,110 --> 00:06:48,610
give you the intuition behind the design
of the kernel function.
84
00:06:48,610 --> 00:06:52,100
So here we're going to assume that we
85
00:06:52,100 --> 00:06:55,780
are going to be using a two dimensional
square,
86
00:06:55,780 --> 00:07:00,800
square block to organize the threads to
calculate C.
87
00:07:00,800 --> 00:07:04,230
And that is each thread block is going to
have
88
00:07:04,230 --> 00:07:07,430
equal number of threads in both X and Y
dimension.
89
00:07:07,430 --> 00:07:11,226
And obviously, this is a simplifying
assumption, that
90
00:07:11,226 --> 00:07:14,803
you can you can definitely use a
rectangular
91
00:07:14,803 --> 00:07:17,358
block organization, but here we use a, a
92
00:07:17,358 --> 00:07:21,890
square organization, just to keep the, the
slide simple.
93
00:07:21,890 --> 00:07:27,170
And so, assuming that we, we have a square
organization,
94
00:07:27,170 --> 00:07:33,080
we will just assume that we have a compile
time defined constant of tile width.
95
00:07:33,080 --> 00:07:34,950
And this tile width is going to give us
the
96
00:07:34,950 --> 00:07:38,233
number of threads in each dimension of the
thread block.
97
00:07:38,233 --> 00:07:47,390
And we we, we have a simple example here
of C to calculate
98
00:07:47,390 --> 00:07:55,230
a C matrix of 4 times 3 for a 4 by 3
matrix for C.
99
00:07:55,230 --> 00:07:59,402
So in this case, m equal to 4, and K equal
to 3.
100
00:07:59,402 --> 00:08:03,702
We also will further assume that n is 4
for both A and B
101
00:08:03,702 --> 00:08:08,830
and which is not visible in this light,
but will become useful later on.
102
00:08:08,830 --> 00:08:12,720
So for this very simple small example we
103
00:08:12,720 --> 00:08:18,700
have four elements in the Y dimension and
three elements in the X dimension of C.
104
00:08:18,700 --> 00:08:21,080
And we need to make sure that we generate
105
00:08:21,080 --> 00:08:24,830
enough thread blocks to calculate all the
C elements.
106
00:08:24,830 --> 00:08:30,280
If we assume that the tile width is 2 for
this simple example then we can, we will
107
00:08:30,280 --> 00:08:33,500
go through the same ceiling function
calculation and we
108
00:08:33,500 --> 00:08:36,920
would that we use before in the picture
kernel.
109
00:08:36,920 --> 00:08:38,120
Then we will,
110
00:08:38,120 --> 00:08:41,045
we will conclude that we need to have two
thread
111
00:08:41,045 --> 00:08:43,895
blocks in the Y dimension as well as two
thread
112
00:08:43,895 --> 00:08:46,670
blocks in the X dimension to be able to to
113
00:08:46,670 --> 00:08:50,734
cover the generation of, calculation of
all the C elements.
114
00:08:50,734 --> 00:08:56,521
So this give us the simple example that we
are going to working with.
115
00:08:56,521 --> 00:08:59,797
And in every thread block, we are going to
have
116
00:08:59,797 --> 00:09:03,409
two threads in the X dimension and two,
two threads
117
00:09:03,409 --> 00:09:09,625
in the Y dimension so that these threads
are going to be accessing the C
118
00:09:09,625 --> 00:09:15,837
elements that are located in the pile that
this thread block is responsible for.
119
00:09:15,837 --> 00:09:21,981
So now we are ready to to discuss the host
code for evoking the matrix
120
00:09:21,981 --> 00:09:28,797
multiplication kernel and we assume that
we have the, the matrix
121
00:09:28,797 --> 00:09:34,080
dimension variables m, n and k available
in the host code.
122
00:09:34,080 --> 00:09:40,460
So we will be generating in we will be
using the ceiling function expression
123
00:09:40,460 --> 00:09:48,870
based on k and n to generate enough thread
blocks for covering all the C elements.
124
00:09:48,870 --> 00:09:54,070
So this is just you know, very similar to
what we used in the picture kernel.
125
00:09:54,070 --> 00:09:58,130
And for each thread block we are going to
declare that we are going to
126
00:09:58,130 --> 00:10:02,918
have thread width and thread width number
of threads in the X and Y dimension.
127
00:10:02,918 --> 00:10:06,691
And we will because Z dimension is not
used,
128
00:10:06,691 --> 00:10:10,726
we're going to be just using 1 for that
dimension.
129
00:10:10,726 --> 00:10:14,506
So we will be launching the kernel
130
00:10:14,506 --> 00:10:19,651
with the grid dimension and block
dimension that
131
00:10:19,651 --> 00:10:23,326
we set up in the host code, and then we
will
132
00:10:23,326 --> 00:10:27,720
pass m, n, k and the pointers to A, B and
C.
133
00:10:27,720 --> 00:10:33,630
Just another reminder A matrix is going to
be m times n,
134
00:10:34,730 --> 00:10:41,920
the B matrix is going to be n times k, and
C matrix is going to be m times k.
135
00:10:46,940 --> 00:10:49,920
Here we show the matrix multiplication
kernel.
136
00:10:49,920 --> 00:10:55,023
And we, at the beginning, every thread is
going to determine the
137
00:10:55,023 --> 00:11:00,650
row and column index of the C element that
it's it's calculating.
138
00:11:00,650 --> 00:11:04,120
And then they will all proceed to test
whether the
139
00:11:04,120 --> 00:11:07,810
row index and column index are within the
valid range.
140
00:11:07,810 --> 00:11:12,490
If these indices are within the valid
range, then we will
141
00:11:12,490 --> 00:11:17,890
do, go through that for loop to access the
elements of the row
142
00:11:17,890 --> 00:11:22,880
of A and the elements of the column of B
to produce the inner product.
143
00:11:22,880 --> 00:11:26,810
So this is pretty much the same as the
sequential code.
144
00:11:26,810 --> 00:11:32,350
And after the for loop, we, you know, we
have the C element value.
145
00:11:32,350 --> 00:11:38,130
So we will, we write a C element value
into the global memory, and
146
00:11:38,130 --> 00:11:43,220
the calculation is exactly the same as C.
So this is the reason why we took
147
00:11:43,220 --> 00:11:48,237
the time to discuss the row major layout
and the linearization of variables.
148
00:11:48,237 --> 00:11:52,917
You know, keep in mind that in, in CUDA
because A, B and C
149
00:11:52,917 --> 00:11:57,584
are all going to be dynamically allocated
with cudaMalloc.
150
00:11:57,584 --> 00:12:03,317
So that's why we need to you know, do the
linearization in
151
00:12:03,317 --> 00:12:09,782
the of these indices when we read from A
and B and right into C.
152
00:12:09,782 --> 00:12:12,914
Now we are ready to to review just a
153
00:12:12,914 --> 00:12:18,120
little bit more dynamic execution behavior
of the kernel.
154
00:12:18,120 --> 00:12:24,281
So as we mentioned there will be four
thread blocks, executing this
155
00:12:24,281 --> 00:12:28,388
kernel for our small 4 by 3 C example.
So we
156
00:12:28,388 --> 00:12:33,724
are going here we show the work that the
threadblok 0,0 is
157
00:12:33,724 --> 00:12:39,422
going to be doing.
So you know what, for threadbox 0,0
158
00:12:39,422 --> 00:12:44,850
the block index at the x and y dimensions
are both 0.
159
00:12:44,850 --> 00:12:48,200
So this means that when we calculate the
row index and
160
00:12:48,200 --> 00:12:53,790
column index the these values are solely
going to be determined
161
00:12:53,790 --> 00:12:56,940
by the thread index in the x and y
dimension.
162
00:12:56,940 --> 00:13:02,420
So here we show that that row and column
163
00:13:02,420 --> 00:13:07,840
indices will be 0 and 1 and for all the
four threads.
164
00:13:07,840 --> 00:13:12,162
So let's take a look at the work that
thread 0,0 is going to do.
165
00:13:12,162 --> 00:13:17,220
Thread 0,0 is going to have row value of 0
and column value of 0.
166
00:13:17,220 --> 00:13:19,600
So it's going to be accessing
167
00:13:19,600 --> 00:13:23,450
the zeros row of a A and the zeros column
of B.
168
00:13:23,450 --> 00:13:26,850
We are showing a horizontal arrow in this
direction
169
00:13:26,850 --> 00:13:30,010
here and a vertical arrow down the zeros
column
170
00:13:30,010 --> 00:13:33,342
of B to indicate all the memory accesses
that
171
00:13:33,342 --> 00:13:37,940
thread 0,0 is going to do in block 0,0.
172
00:13:37,940 --> 00:13:44,740
And then, similarly for thread 1,1, it's
going to to, to have a
173
00:13:44,740 --> 00:13:50,320
row value of 1 and column value of 1.
So it's going to be accessing
174
00:13:50,320 --> 00:13:55,800
row 1 and column 1 row 1 of A and column 1
of B.
175
00:13:55,800 --> 00:14:01,365
So we're also showing this with the
horizontal arrow going through the row
176
00:14:01,365 --> 00:14:07,240
1 and vertical arrows going through column
1 in the in the picture.
177
00:14:07,240 --> 00:14:09,920
So this picture shows the work
178
00:14:09,920 --> 00:14:16,100
that all the four threads in block 0,0
will be doing when executing that kernel.
179
00:14:16,100 --> 00:14:22,620
And in fact your, you, your encouraged to
use very small numbers, such
180
00:14:22,620 --> 00:14:25,900
as the ones that we provided here to test
your kernel and make
181
00:14:25,900 --> 00:14:30,080
sure that you, you can look at all the
numbers and review the
182
00:14:30,080 --> 00:14:32,680
correctness of these numbers before you
start
183
00:14:32,680 --> 00:14:35,310
to use your kernel on large matrices.
184
00:14:38,440 --> 00:14:44,778
So just for a little bit more insight we
would look at the
185
00:14:44,778 --> 00:14:51,530
work for block 0, 1.
In this case the block, the Y dimension
186
00:14:51,530 --> 00:14:57,550
the Y dimension block index is still 0,
but the X dimension block index is 1.
187
00:14:57,550 --> 00:15:03,140
So in, in this case the row indexes remain
the same because the row index,
188
00:15:03,140 --> 00:15:08,780
indexes are calculated based on the y
block index which is 0.
189
00:15:08,780 --> 00:15:16,210
However the x index of the block the x
dimension of the block index is now 1.
190
00:15:16,210 --> 00:15:22,610
So we effectively shifted the common value
of the threads to
191
00:15:22,610 --> 00:15:28,814
2, 3, compared to block 0, 0.
So now we see that all
192
00:15:28,814 --> 00:15:32,662
the threads in this thread block are going
to
193
00:15:32,662 --> 00:15:38,080
be processing the upper right corner of
the C matrix.
194
00:15:38,080 --> 00:15:45,115
And so if we look at thread 0, 0 in this
block, it's going to be still
195
00:15:45,115 --> 00:15:52,999
accessing row 0 of A, but it's going to
uh,uh, access accessing column 2 of B.
196
00:15:52,999 --> 00:15:54,069
And then you
197
00:15:54,069 --> 00:15:59,228
will generate that product and write into
C 0, 2.
198
00:15:59,228 --> 00:16:04,378
And the more interesting situa-,
comparison is
199
00:16:04,378 --> 00:16:09,528
that for thread 0, 1 and thre-, thre-,
200
00:16:09,528 --> 00:16:14,608
thread 1, 0 and thread 1, 1.
For those two threads
201
00:16:14,608 --> 00:16:19,249
the they are actually going to be
processing column
202
00:16:19,249 --> 00:16:23,010
3 of C, which is outside the valid range.
203
00:16:23,010 --> 00:16:24,970
So, both of those threads are going to
204
00:16:24,970 --> 00:16:29,270
fail the condition test of the if
statement.
205
00:16:29,270 --> 00:16:32,990
So the, those thread these two threads
will not be doing
206
00:16:32,990 --> 00:16:36,690
any work or doing any kind of memory
access in the kernel.
207
00:16:37,694 --> 00:16:44,899
ex-, exec-, execute in the kernel.
So this gives us a, a good illustration
208
00:16:44,899 --> 00:16:51,961
of the runtime behavior, and I'd like to
encourage you to finish the analysis
209
00:16:51,961 --> 00:16:58,176
of, of, of this example for block 1,0 and
block 1,1, okay?
210
00:16:58,176 --> 00:17:04,161
So, at this point, we have finished the
the coverage
211
00:17:04,161 --> 00:17:09,348
of the simple matrix multiplication kernel
in CUDA.
212
00:17:09,348 --> 00:17:10,524
And you
213
00:17:10,524 --> 00:17:15,770
are ready for the lab.
And what I like to reiterate is that
214
00:17:15,770 --> 00:17:21,140
when you test your kernel you should start
with some very small matrices.
215
00:17:21,140 --> 00:17:26,398
Such as the dimension that we had, we
shown in this in this, in this example.
216
00:17:26,398 --> 00:17:31,728
You can even start with even smaller, like
2 by 2 matrices, and so
217
00:17:31,728 --> 00:17:35,910
that you can use the kind of data value
that you know for sure,
218
00:17:35,910 --> 00:17:39,026
you can just calculate by hand and verify
that
219
00:17:39,026 --> 00:17:42,442
your kernel is, is doing all the right
things.
220
00:17:42,442 --> 00:17:46,512
And then before you can you, you apply the
kernel to much
221
00:17:46,512 --> 00:17:52,472
bigger matrices that, of the data that we
pro-,uh, supply for your testing.
222
00:17:52,472 --> 00:17:58,960
So in, at this point you, we are at the
end of week one.
223
00:17:58,960 --> 00:18:01,030
So I like to also encourage
224
00:18:01,030 --> 00:18:04,960
you to take the quiz as soon as possible
to make
225
00:18:04,960 --> 00:18:09,420
sure that you solidify all your
understanding of the week one material.
226
00:18:09,420 --> 00:18:11,650
For those of you who'd like to learn more,
I'd
227
00:18:11,650 --> 00:18:15,852
like to encourage you to read the
textbook, section 4.3.
228
00:18:15,852 --> 00:18:17,588
Thank you.