CNN has been successful in various text classification tasks. In [1], the author showed that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks – improving upon the state of the art on 4 out of 7 tasks.

However, when learning to apply CNN on word embeddings, keeping track of the dimensions of the matrices can be confusing. The aim of this short post is to simply to keep track of these dimensions and understand how CNN works for text classification. We would use a one-layer CNN on a 7-word sentence, with word embeddings of dimension 5 – a toy example to aid the understanding of CNN. All examples are from [2].

**Setup**

Above figure is from [2], with *#hash-tags* added to aid discussion. Quoting the original caption here, to be discussed later. *“Figure 1: Illustration of a CNN architecture for sentence classification. We depict three filter region sizes: 2,3,4, each of which has 2 filters. Filters perform convolutions on the sentence matrix and generate (variable-length) feature maps; 1-max pooling is performed over each map, i.e., the largest number from each feature map is recorded. Thus, a univariate feature vector is generated from all six maps, and these 6 features are concatenated to form a feature vector for the penultimate layer. The final softmax later then receives this feature vector as input and uses it to classify the sentence; here we assume binary classification and hence depict two possible output states.”*

**#sentence**

The example is “I like this movie very much!”, there are 6 words here and the exclamation mark is treated like a word – some researchers do this differently and disregard the exclamation mark – in total there are 7 words in the sentence. The authors chose 5 to be the dimension of the word vectors. We let *s *denote the length of sentence and *d* denote the dimension of the word vector, hence we now have a sentence matrix of the shape *s* x *d*, or 7 x 5.

**#filters**

One of the desirable properties of CNN is that it preserves 2D spatial orientation in computer vision. Texts, like pictures, have an orientation. Instead of 2-dimensional, texts have a one-dimensional structure where words sequence matter. We also recall that all words in the example are each replaced by a 5-dimensional word vector, hence we fix one dimension of the filter to match the word vectors (5) and vary the region size, *h*. Region size refers to the number of rows – representing word – of the sentence matrix that would be filtered.

In the figure, #filters are the illustrations of the filters, not what has been filtered out from the sentence matrix by the filter, the next paragraph would make this distinction clearer. Here, the authors chose to use 6 filters – 2 complementary filters to consider (2,3,4) words.

**#featuremaps**

For this section, we step-through on how CNN perform convolutions / filtering. I have filled in some numbers in the sentence matrix and the filter matrix for clarity.

The above illustrates the action of the 2-word filter on the sentence matrix. First, the two-word filter, represented by the 2 x 5 yellow matrix **w**, overlays across the word vectors of “I” and “like”. Next, it performs an element-wise product for all its 2 x 5 elements, and then sum them up and obtain one number (0.6 x 0.2 + 0.5 x 0.1 + … + 0.1 x 0.1 = 0.51). 0.51 is recorded as the first element of the output sequence, **o**, for this filter. Then, the filter moves down 1 word and overlays across the word vectors of ‘like’ and ‘this’ and perform the same operation to get 0.53. Therefore, **o** will have the shape of (*s*–*h*+1 x 1), in this case (7-2+1 x 1)

To obtain the feature map, **c**, we add a bias term (a scalar, i.e., shape 1×1) and apply an activation function (e.g. ReLU). This gives us **c,** with the same shape as **o **(*s*–*h*+1 x 1).

**#1max**

Notice that the dimensionality of **c** is dependent both *s* and *h*, in other words, it will vary across sentences of different lengths and filters of different region sizes. To tackle this problem, the authors employ the 1-max pooling function and extract the largest number from each **c** vector.

**#concat1max**

After 1-max pooling, we are certain to have a fixed-length vector of 6 elements ( = number of filters = number of filters per region size (2) x number of region size considered (3)). This fixed length vector can then be fed into a softmax (fully-connected) layer to perform the classification. The error from the classification is then back-propagated back into the following parameters as part of learning:

- The
**w**matrices that produced**o** - The bias term that is added to
**o**to produce**c** - Word vectors (optional, use validation performance to decide)

**Conclusion**

This short post clarifies the workings of the CNN on word embeddings by focussing on the dimensionality of matrices in each intermediate step.

**References**

- Kim Y. Convolutional Neural Networks for Sentence Classification. 2014;
- Zhang Y, Wallace B. A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification. arXiv preprint arXiv:151003820. 2015; PMID: 463165

Thanks for sharing this – I found it very helpful – keep writing 🙂

Thanks Alireza, appreciate it! 🙂

Thank you so much for this. 😀

No problem. Thanks Zenon. 🙂

in the paper https://arxiv.org/pdf/1510.03820.pdf (A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification) it’s stated “we set the number of feature maps for this region size to 100.” Does this mean that the filters per region size is 100? Are filter maps equivalent to the number of filters per region size?

Don’t mind my previous question, i think i know the answer.

Hey Zenon,

How did you go?

Thanks for the interesting paper. From my understanding of Section 2.1 CNN Architecture, I agree with your interpretation. In the paper, feature maps are of dimension (sentence_length-region_size+1, 1). The purpose is to learn complementary features from the same region size.

Hope this helps,

Joshua

Hi, I also wonder the same. What does it mean by having feature maps with value 100?

What does this mean “‘feature maps’ refers to the number of feature maps for each filter region size.”?

How can one achieve more than one feature map for each region size? (Like in the example above)

Hi Celina,

I’m assuming that you’re referring to this paper: https://arxiv.org/pdf/1510.03820.pdf

Table 2: “‘feature maps’ refers to the number of feature maps for each filter region size.”

Going back to our toy example in this blog, please refer to the area of the figure where it is labelled #featuremaps. Here, we have 2 feature maps for each region size. And in our toy example, the region size is (2,3,4).

Let’s try a more layman way of explaining this. Let’s have 6 people to independently try to build their own intuition to determine the sentiment of the sentence.

Person A (dark red) and B (light red) are limited to looking 4 consecutive words at a time.

Person C (dark green) and D (light green) are limited to looking 3 consecutive words at a time.

Person E (dark yellow) and F (light yellow) are limited to looking 2 consecutive words at a time.

You can bet that Person A and B will build different intuitions even though they are both looking at 4 words at a time because their preexisting knowledge is different (the parallel here is that the random initialization is different).

So, if we have 100 feature maps for each region size (3,4,5) in the paper. It means:

100 different people are limited to looking 3 consecutive words at a time.

100 different people are limited to looking 4 consecutive words at a time.

100 different people are limited to looking 5 consecutive words at a time.

A total of 300 different people.

Hope this helps,

Joshua

Very helpful.. thank you

Cheers, thanks Saja.

Is it not the case that software implementations like Keras actually have the transpose of what you show for #sentences where they would be 5×7 and then the kernel is 5x where filter length is 2,3,5 above?

Sorry Brian, I’m not sure about the Keras implementation. But I’ll post your question here so that someone else can pick it up.

Maybe Reddit could help too.

Cheers,

Joshua

thank you.please discribe d dimension of “I”,”like”???

Hi Ayesha,

Following the section on #sentence, there, we have 6 words plus an exclamation mark. In the example, the exclamation mark is counted as one word, hence we have 7 words.

Since d = 5 then we have a matrix of 7 * 5.

For your question, if there is just “i like”, then the matrix would be 2 * 5.

In both of the above, we have assumed that the dimension of word vectors is 5.

Hope this helps,

Joshua

Hi Joshua,

Thank you for clarifying details about cnn.

I would like to ask that how did you map 7 words to 5 features?

Can you give some advise?

Hi Kutay,

The high-level concept answer is that the 7 words are looked up in a lookup table of vectors. Every English word has a vector in this lookup table, and they have been pre-trained. Glove vectors and Word2Vec vectors are good examples of these. So in our toy example, each word vector has a length of 5. There are 7 words, so the resulting matrix is 7 x 5.

The low-level execution answer is to point you to a good tutorial. I enjoyed following machinelearningmaster by Jason Brownlee, in this article https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/ he explains the Python script to achieve what I have described above.

Hope this helps,

Joshua

Hello Sir.!

I plan to work on text classification , Dear sir can you give me some idea .

I am still confusing about it

Thanks

Hi Tad,

When I first started, I watched the entire series of CS224D by Stanford on Youtube. I found it a great resource to get started. Hope this helps you just as well!

https://www.youtube.com/watch?v=OQQ-W_63UgQ&list=PL3FW7Lu3i5Jsnh1rnUwq_TcylNr7EkRe6

Best Regards,

Joshua

Thank you very much for the detailed description. I did not understand the filtering part completely. In the #featuremaps section, how the numbers in the yellow matrix (2×5) are defined? are they randomly generated?

Yes, you are right, the numbers in the yellow matrix are randomly initialized and then updated through backpropagation.

Best Regards,

Joshua

Thanks for translating this into Vietnamese and citing the blog post.