Skip to main content

Optimizing CUDA Program

· 21 min read
VisualDust
Ordinary Magician | Half stack developer
Sonder
HPC Engineer

Go Deeper into GPU Architecture

The starting point of all optimizations is to better "squeeze" hardware performance through programming.

The GPU architecture is built around a scalable array of Streaming Multiprocessors (SM). GPU hardware parallelism is achieved through the replication of this architectural building block.

Each SM in a GPU is designed to support concurrent execution of hundreds of threads, and there are generally multiple SMs per GPU, so it is possible to have thousands of threads executing concurrently on a single GPU. When a kernel grid is launched, the thread blocks of that kernel grid are distributed among available SMs for execution. Once scheduled on an SM, the threads of a thread block execute concurrently only on that assigned SM. Multiple thread blocks may be assigned to the same SM at once and are scheduled based on the availability of SM resources. Instructions within a single thread are pipelined to leverage instruction-level parallelism, in addition to the thread-level parallelism you are already familiar with in CUDA.

Key components of a Fermi SM are:

  • CUDA Cores
  • Shared Memory/L1 Cache
  • Register File
  • Load/Store Units
  • Special Function Units
  • Warp Scheduler

Figure: Key components of a Fermi(a GPU architecture) SM

CUDA Programming as Beginner

· 57 min read
VisualDust
Ordinary Magician | Half stack developer
Sonder
HPC Engineer

CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by NVIDIA. It enables developers to utilize the immense computational power of NVIDIA GPUs (Graphics Processing Units) for general-purpose processing tasks beyond graphics rendering.

Key features of CUDA programming include:

  1. Parallelism: CUDA enables developers to exploit parallelism at multiple levels, including thread-level, instruction-level, and data-level parallelism, allowing for efficient computation on GPUs.
  2. CUDA C/C++ Language Extensions: CUDA extends the C/C++ programming languages with additional keywords and constructs to facilitate programming for GPU architectures, making it easier to write parallel code.
  3. CUDA Runtime API: The CUDA Runtime API provides a set of functions for managing GPU devices, memory allocation, data transfer between CPU and GPU, and launching kernel functions (the functions executed on the GPU).
  4. CUDA Libraries: NVIDIA provides a collection of libraries optimized for GPU computing tasks, such as cuBLAS for linear algebra, cuFFT for Fast Fourier Transforms, cuDNN for deep neural networks, and more.
  5. CUDA Toolkit: The CUDA Toolkit includes compilers, debuggers, profilers, and other development tools necessary for CUDA programming.

CUDA programming allows developers to harness the massive parallel processing power of GPUs to accelerate a wide range of computational tasks, including scientific simulations, image and video processing, machine learning, and more.

LLM basics from scratch

· 14 min read
VisualDust
Ordinary Magician | Half stack developer

Abstract

The main purpose of this article is to use basic self-attention blocks to build a simple large language model for learning purposes. Due to limitations in model scale and embedding methods, the model built in this article will not be very effective, but this does not affect the ability to learn various basic concepts of language models similar to Transformer through the code provided in this article.

What happens on this page:

  • get full code of a basic Large(?) Language Model (data preparation, model architecture, model training and predicting)
  • understand the general architecture of transformer with a few illustrations
  • understand how self regressive training works
  • understand how to load very large text dataset into limited memory (OpenWebTextCorpus)
  • train and observe the training procedure
  • load the trained model into a simple ask-and-answer interactive script

Get full code

Code available at github.com/visualDust/naive-llm-from-scratch

warning

Download the code via git clone before continue.

Variational AutoEncoders from scratch

· 18 min read
VisualDust
Ordinary Magician | Half stack developer

image-20231228150104010

AutoEncoders

AutoEncoders are a type of artificial neural network used primarily for unsupervised learning tasks, particularly in the field of dimensionality reduction and feature learning. Their general purpose is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction or noise reduction.

The main components of an autoencoder are:

  1. Encoder: This part of the network compresses the input into a latent space representation. It encodes the input data as an encoded representation in a reduced dimension. The encoder layer is typically composed of a series of layers that gradually decrease in size.

  2. Bottleneck: This is the layer that contains the encoded representation of the input data. It is the heart of the network, which holds the compressed knowledge of the input data. The bottleneck is where the dimensionality reduction takes place.

  3. Decoder: The decoder network performs the reverse operation of the encoder. It takes the encoded data from the bottleneck and reconstructs the input data as closely as possible. This part of the network is typically symmetrical to the encoder, with layers increasing in size.

The objective of an autoencoder is to minimize the difference between the original input and its reconstruction, typically measured by a loss function like mean squared error. By learning to reconstruct the input data, the network learns valuable properties about the data and its structure. AutoEncoders are used in various applications like anomaly detection, image denoising, and as a pre-training step for deep learning models.

basicautoencoder

The objective of an autoencoder is to minimize the difference between the original input and its reconstruction, typically measured by a loss function like mean squared error. By learning to reconstruct the input data, the network learns valuable properties about the data and its structure.

Get full code before continue

Code available at gist : single-file-linear-ave.py

tip

This is a single file approach consists of model architecture, training code, testing code, and inference code. We will talk about each part of the code separately later. If you're in hurry, run this single file python script and open localhost:20202 to see the result.

warning

Please make a new directory for putting this file, once you run this file, it automatically download MNIST dataset to the relative path ./data. You might not want the code download data into unwanted place.

Cityscapes class level boundary labeling with improved one-hot

· 10 min read
VisualDust
Ordinary Magician | Half stack developer

Why

I'm working on some semantic segmentation related code, where I need to enhance segmentation accuracy on boundaries. Therefore, I tried to use boundary loss to assist model training. This article is my attempt and codes.

image-20231119145929665

Our purpose is clear. In Cityscapes, we have indexed image that represents pixels of each class named gtFine_labelIds. What we want is to generate class level boundary from gtFine_labelIds so that we can use it to optimize boundary regions for specific class.

End to end methods for Autonomous Driving as a beginner (LLM approach)

· 22 min read
VisualDust
Ordinary Magician | Half stack developer

I look back at the assisted driving systems project I created during my undergraduate studies. As we continue to improve it, it is gradually becoming more bloated and difficult to scale. As I continued to learn, I found that an end-to-end approach would be the right solution. In this page, I will record my review of this project and my gradual understanding of the end-to-end approach as a beginner.

image-20231105213039182

In this page, I first review the assisted driving project I once created and explain how it was structured. Then I will introduce World Model as a novice and introduce the advantages of LLM for autonomous driving. Next, I will share a paper on how to integrate LLM into an autonomous driving system. Finally, I will go deeper into Augmented Language Models and share how I begin.

注意力机制

· 42 min read
VisualDust
Ordinary Magician | Half stack developer

image-20210603164503677

1. Attention是什么

注意力机制在很多AI领域内得到了成功的应用。这是人工神经网络在模仿人类进行决策过程的重要发展。

In humans, Attention is a core property of all perceptual and cognitive operations. Given our limited ability to process competing sources, attention mechanisms select, modulate, and focus on the information most relevant to behavior.

上面这段文字摘自Alana de Santana Correia, and Esther Luna Colombini的论文 ATTENTION, PLEASE ! A SURVEY OF NEURAL ATTENTION MODELS IN DEEP LEARNING。你应该注意到了,在你的视野中,只有一部分区域是很清晰的。对于视野周围的场景,你往往需要转转眼珠,把视野朝向它,才能完全看清。或者,你还发现,比起历史老师开始强调重点,你似乎对下课铃声的响起更加敏感——这就是注意力。你所处的环境包含着远超你的处理能力的信息,而注意力机制让你的大脑集中精力处理你视野中心的场景,或是你“更应该”关心的事物。

Attention机制听上去是一个很高大上的词汇,实际上,Attention在不经意间就会被使用。例如,循环神经网络中每一步计算都依赖于上一步计算结果的过程就可以被视为一种Attention:在 Attention 机制引入之前,有一个问题大家一直很苦恼:长距离的信息会被弱化,就好像记忆能力弱的人,记不住过去的事情是一样的。

img

如上图,在处理序列的循环神经网咯中,Attention的功能是关注重点,就算文本比较长,也能从中间抓住重点,不丢失重要的信息。上图中红色的预期就是被挑出来的重点。

对于文字序列的输入,有基于编码器解码器的注意力;对图像输入,有各种各样的空间注意力;在神经网络处理得过程中,还有通道注意力;还有强大的自注意力,并且具有能够将自注意力适用于各种输入的广泛化设计。

Attention具有以下三大优点:参数少、速度快、效果好。Attention机制如果浅层的理解,跟他的名字非常匹配。他的核心逻辑就是从关注全部到关注重点。在认知科学中,由于信息处理的瓶颈,人类会选择性地关注所有信息的一部分,同时忽略其他信息。同理,当神经网络处理大量的信息输入时,快速聚焦其中一些关键信息来进行处理,这便是注意力机制。

从有穷自动机到正则表达式

· 39 min read
VisualDust
Ordinary Magician | Half stack developer

绪论:语言及其描述

形式语言使用数学符号和规则堆语言进行形式化的描述。形式语言常用于确切地描述和定义高级语言,在编译器设计中起到重要作用。形式语言是研究符号的语言,仅考虑符号之间的关系,不考虑符号的含义。

note

1956年,乔姆斯基(N.Chomsky)首次采用马尔可夫(Markov)模型描述自然语言,从语言学和数学的角度对有限状态模型、短语结构模型和转换模型进行了理论上的分析,建立了形式语言理论。

乔姆斯基将语言形式地定义为==由一个字母表的字母组成的一些串的集合==。对于任意一个语言,有一个字母表,可以在字母表上按照一定的形成规则定义一个文法,这个文法产生的所有句子组成的集合就是文法所产生的语言

基本概念

字母表(alphabet)是一个非空有穷集合,字母表中的元素称为该字母表的一个字母(letter),也可以叫做符号(symbol),或者字符(character)。

success

定义:设 Σ1\Sigma_1Σ2\Sigma_2 是两个字母表, Σ1\Sigma_1Σ2\Sigma_2 的乘积:

Σ1Σ2={abaΣ1,bΣ2}\Sigma_1 \Sigma_2 = \{ab | a\in\Sigma_1,b\in\Sigma_2\}

如何阅读科技类论文

· 4 min read
VisualDust
Ordinary Magician | Half stack developer

How to Read and Comprehend Scientific Research Articles

Scientific articles are how scholars and researchers communicate with each other. Reading scientific articles helps you to participate in your comprehension by wondering how the researchers explain their ideas. Books, websites, papers, scientific magazines are general places to start with.

This tutorial will discuss:

  • How to read a scientific article
  • How to find the main points of an article
  • How to take effective notes

如何写作科技类论文

· 25 min read
VisualDust
Ordinary Magician | Half stack developer

写作一篇科技类论文?

注:这一部分是在听了陈关荣教授的分享后写下的。

认真写作的重要性

请认真写作。

image-20210607125125488

上图是不认真写作的结果。

一篇科技论文典型的结构