Chenyo's org-static-blog

Welcome!

Take a look if you are bored.

08 Sep 2024

Build a free Telegram sticker tag bot

1. What happened

When I started to use Telegram, I really missed being able to search stickers by text, like I can in WeChat. Sadly, Telegram only lets you search emojis by text, or find stickers using emojis.

After digging around, I discovered the easiest way to add this cool feature was through Telegram bots. The idea? Store a tag-to-sticker map in the bot, then fetch all matching stickers when given a tag in the bot’s inline mode. There are a few sticker tag bots on Telegram already, but they’re either dead or can’t handle Unicode input like Chinese characters. Plus, I’m not super keen on trusting my data to someone else’s database. Moreover, I might want to use a bot for other personal stuff later.

So, I decided to build my own Telegram bot.

2. What do I need

My goal was to create a bot using only free services, including cloud storage for key-value pairs and a hosting platform to keep the bot running.

I stumbled upon Render from an X recommendation which offers 750 hours per month for free deployment (which equals 31 days), so I deployed my bot there once I got the bot running locally. But then I found out Render’s free tier doesn’t offer permanent storage and shuts down services after 15 minutes of inactivity.

A sticker tag bot without memory isn’t much use to anyone, so I went hunting for free cloud storage. With some help from Claude and Perplexity, I discovered Firebase Realtime database, which offers 1GB storage and 10GB throughput per month on its free tier.

Even with cloud storage, a bot that konks out every 15 minutes just won’t cut it - I need my stickers now! So my next quest was finding a way to keep the bot awake, which led me to UptimeRobot. It’s actually a web monitoring tool, but I can use it to ping the bot regularly, tricking it into staying “active” instead of dozing off.

So, to sum it up, building this sticker tag bot required:

  • a Telegram bot from BotFather,
  • a free Render deployment,
  • a Firebase’s free key-value storage, and
  • an UptimeRobot’s free monitoring service.

However, these services do not work together automatically. Gluing them together required additional effort.

3. How to build a bot

The first step in building any bot is asking BotFather for a new bot and keeping the bot token secure. Telegram offers a helpful tutorial that explains the process using Java. Examples in other languages can be found in their Gitlab repo. In my opinion, the most challenging part here is creating a unique bot username that is still available.

The next step involves working with the Telegram bot API in the chosen programming language. This includes learning how to handle messages effectively. For example, I used tgbotapi(Golang).

3.1. How to build a sticker tag bot

A sticker tag bot needs two main functionalities:

  • Handle direct messages to add new sticker tags.
  • Handle inline queries to search for stickers using a given tag.

To implement the first functionality, I created a simple state machine with two states:

  1. The initial state waits for a sticker input and then moves to the tag state.
  2. The tag state waits for any text input to use as the tag for the previous sticker.

To implement the second functionality, one needs to use the InlineQueryResultCachedSticker method.

For local testing, one can use a lightweight local key-value storage to store and search sticker tags. I used BadgerDB(Golang) for example.

I noticed that the generated file ID for the same sticker is different each time, making it hard to check for duplicates when adding new tags. To address this, I added a /delete method to remove tags when needed.

3.2. How to make the bot private

I couldn’t find an official way to make a bot visible only to me. Suggested by Claude, I predefined a list of authorized users. Then I performed a sanity check before handling any messages.

4. How to deploy the bot on Render

Deploying a service on Render with a free account is challenging due to the lack of shell access, disk access, and non-so-live logs. The process of making everything work at this stage was time-consuming and I even contacted Render’s technical support although they only responded “Ok I will close the ticket” after the issue was self-resolved.

Three main steps are required here:

  1. start an HTTP server with several endpoints at the bot, and
  2. configure the web service and environment variables on Render’s dashboard,
  3. configure the Telegram webhook at the bot.

In step 1, starting an HTTP server at 0.0.0.0:<some-port> is necessary. One should also enable GET methods for the root and a health check endpoint to allow Render to scan the service regularly.

In step 2, one needs to fill in the service configuration and environment variables in different boxes. This includes settings such as port, build command, and health check endpoint. The issue I encountered was Render could not scan any port even if I have triple-checked that everything worked fine locally. In the end, I solved this issue by adding the Golang build tag -tags netgo in the build command. Actually this flag was configured by default, but I initially replaced it with a simpler build command.

In step 3, one needs to configure the webhook with the Bot API and to enable the POST method for the webhook at the HTTP server (this can also be handled by the Bot API). The webhook can be https: //<your project>.onrender.com/<your-token> (or another unique URL). This URL informs Telegram where to send and receive all messages for the bot.

5. How to connect to the Firebase

The Firebase Realtime database stores key-value pairs in the JSON format. Connecting the bot with the database requires the following steps:

  1. Create the app and the database on Firebase’s dashboard. Specifically, one needs to store the following 3 values for interaction:
    • The database URL, which looks like https: //<your-app>-default-rtdb.*.firebasedatabase.app.
    • The credential file, which can be downloaded at Project settings->Service accounts->Firebase Admin SDK (and should also be added to Render).
  2. Import the language-specific Firebase API to configure the database in the bot. For example, I use firebase(Golang).
  3. Update the database rules in Firebase dashboard to only allow authorized writes for specific tags, e.g., one name/path to refer to those key-value pairs.

It’s worth noting that connecting to the database on Render may take some time after a fresh start. During this initialization period, the log may display a 502 Bad Gateway error to the database.

6. How to configure UptimeRobot

Before configuring UptimeRobot, an attempt was made to ping the bot from within itself, but this approach did not function for Render.

Using UptimeRobot to maintain the bot’s active status involves two primary steps:

  1. Enable the HEAD method (the sole method available for a free account) for any endpoint on the HTTP server.
  2. Configure an HTTP(S) monitor for that endpoint, which appears as <you-project>.onrender.com/<endpoint>, and establish the monitoring interval to less than 15 minutes.

7. Conclusion

This post isn’t meant to be a step-by-step guide for building a Telegram bot. It skips some steps and doesn’t include screenshots. But don’t worry, most of the missing bits can be figured out using AI language models these days. The rest really depends on each specific situation. The main point here is to show how to set up a free small web service, even when there’s no single platform that does it all.

When I first wrote this, my bot had been up and running for 10 days. It only had 30 minutes of downtime, which I think happened because UptimeRobot couldn’t reach Render’s IP address during that time.

Right now, the repository is private since I plan to add a second functionality to the bot soon.

Tags: tool telegram
07 Sep 2024

Install Doom Emacs with Lisp native compilation in WSL2

Today I installed Doom Emacs for my husband on his WSL2. Although the entire process was guided by Claude, there were some back-and-forth during the interaction. Therefore I would like to record the full commands I have used here in sequence for any potential reference.

1. Assumptions

This installation guide assumes a fresh installation of WSL2 Ubuntu 22.04 on Windows 11 in 2024 September.

2. Install prerequisite packages

According to the Doom Emacs documentation, the following packages are recommended:

  • Git 2.23+: this is already installed by default.
  • Emacs 29.4 with Lisp native compilation: this is finicky and will be elaborated later.
  • ripgrep 14.0+: the documentation says 11.0+ suffices, but doom doctor still complains the latest version (13.0) installed from apt is not advanced, so we need to install it from its Github released package later.
  • GNU find: also installed already.
  • fd: sudo apt install fd-find suffices.

3. Install Emacs 29.4

3.1. Before building

First, let’s install some build dependencies:

sudo apt update
sudo apt upgrade  # update packages
sudo apt install build-essential libjansson4 libjansson-dev \
    libgnutls28-dev libtree-sitter-dev libsqlite3-dev
sudo apt install texinfo  # to generate documentation

build-essnetial should install necessary tools to build C/C++ programs such as gcc, g++, make, gdb and dpkg. The rest packages install pre-compiled libraries.

Besides these packages, there are two important packages to support List native compilation:

sudo apt install ibgccjit0 libgccjit-11-dev  # 11 is my gcc version

After installing them, make sure to export the path in the current session, otherwise the compiler will not realize it.

export LD_LIBRARY_PATH=/usr/lib/gcc/x86_64-linux-gnu/11:$LD_LIBRARY_PATH

The last thing to do is install a bunch of X and GTK-3 development libraries for Emacs GUI, and another bunch of image processing libraries.

sudo apt install libx11-dev libtiff-dev libgtk-3-dev libncurses-dev
sudo apt install libtiff5-dev libgif-dev libjpeg-dev libpng-dev libxpm-dev

Without the above packages, one may encounter the following error when configuring the Emacs build:

You seem to be running X, but no X development libraries were found. You should install the relevant development files for X and for the toolkit you want, such as Gtk+ or Motif. Also make sure you have development files for image handling, i.e. tiff, gif, jpeg, png and xpm.

3.2. Build Emacs 29.4 with native-comp

At this moment, we can start to download Emacs source code:

wget https://ftp.gnu.org/gnu/emacs/emacs-29.4.tar.xz
tar xvf emacs-29.4.tar.xz
cd emacs-29.4

Then we can configure the build (i.e., generate Makefile) with the following command.

./configure --with-native-compilation --with-x-toolkit=gtk3 --without-pop
  • --with-native-compilation: with this flag the Emacs source code is compiled to native machine code to achieve better performance.
    • Otherwise it is compiled to bytecode and then interpreted by Emacs virtual machine during runtime.
  • --with-x-toolkit=gtk3: this is recommended by Claude.
  • --without-pop: if we are not using Emacs as the email client, we don’t need to bother configure the protocol.

If everything goes well, one should see the following line in the output. If not, make sure libgccjit has been installed and exported.

Does Emacs have native lisp compiler? yes

Now we can finally start compiling the Emacs:

make -j$(nproc)

If some error occurs, we may want to start again, to do this:

sudo apt install autoconf automake
rm -f Makefile
./autogen.sh  # regenerate the configuration file
# then rebuild
make -j$(nproc)

Finally, install Emacs globally:

sudo make install

To confirm the Emacs indeed used the native Lisp compiler, one can evaluate inside the vanilla Emacs with M-: (M is Alt in WSL2):

(native-comp-available-p) ;; should return t

Congratulations! You have now installed the latest and fastest Emacs on WSL2.

4. Install ripgrep

As mentioned in ripgrep documentation, for Debian/Ubuntu users, one should install the latest ripgrep 14.0+ with the following commands.

curl -LO https://github.com/BurntSushi/ripgrep/releases/download/14.1.0/ripgrep_14.1.0-1_amd64.deb  # check the latest version on its documentation
sudo dpkg -i ripgrep_14.1.0-1_amd64.deb  # dpkg has been installed before

Instead, if one installs it with apt, a 13.0+ version is installed and running doom doctor later returns the warning:

The installed ripgrep binary was not built with support for PCRE lookaheads.

5. Install Doom Emacs

Installing Doom Emacs is straightforward, but before that, one should first remove the default Emacs configuration folder:

rm -rf ~/.emacs.d

Then, clone and install Doom Emacs, it could take a while.

git clone --depth 1 https://github.com/doomemacs/doomemacs ~/.config/emacs
~/.config/emacs/bin/doom install

Don’t forget to export ~/.config/emacs/bin to PATH:

echo 'export PATH="$HOME/.config/emacs/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

Now one can run doom doctor to check any missing dependencies, e.g., shellcheck. One common issue is the Nerd font is not installed by default so that some icons are not properly displayed. To fix that, run M-x nerd-icons-install-font inside the Emacs, then update the font cache with:

fc-cache -fv
# fc-list | grep Nerd  # to verify the font is installed

6. Some issues with running Emacs in WSL2

  • The first thing is I cannot reload the configuration with M-x doom/reload as running this command always gives me the following error message so that I need to restart the Emacs every time the configuration is changed.

    %s sync -B -e /bin/bash: line 1: fg: no job control

  • I really dislike the white border that surrounds any application launched by WSL!
Tags: linux emacs tool
06 Sep 2024

Hiked: Stoos

1. What happened?

About three weeks ago, I hiked Stoos and wrote a Redbook post. Unfortunately, the post was only visible to me, apparently violating community guidelines.

I spent most of that day trying to identify the problematic content, repeatedly editing and reposting. Later, I learned this behavior is discouraged on Redbook, e.g., I am not really authorized to edit my own content.

Frustrated, I abandoned my efforts, leaving my Redbook homepage cluttered with articles like “10 Behaviors That Harm Your Redbook Account” and “100 Forbidden Words on Redbook”.

Though brief, this experience instilled in me a tendency towards self-censorship, a feeling I deeply resent. I also noticed the lack of competitive alternatives to this centralized platform where users share authentic life experiences through multimedia.

Well played, Redbook!

2. What now?

Anyway, given that I invested considerable effort in creating post figures, I will post them here instead.

stoos-2.png
Figure 1: The Stoos roadmap
stoos-3.png
Figure 2: The uphill view
stoos-4.png
Figure 3: The top view
stoos-5.png
Figure 4: The ridge roadmap
stoos-6.png
Figure 5: The ridge view
stoos-7.png
Figure 6: My trace
stoos-8.png
Figure 7: Recommended apps
stoos-9.png
Figure 8: My favorite hike shoots
stoos-post.png
Figure 9: The Redbook post
Tags: personal trip stoos
05 Sep 2024

CMU 15-445 notes: Hash Tables

This is a personal note for the CMU 15-445 L7 notes as well as some explanation from Claude.ai.

1. DBMS data structure application

1.1. Design decisions

  • Data layout for efficient access.
  • Concurrent access to data structures.

2. Hash tables

  • Implements an associative array that maps keys to values.
  • On average \(O(1)\) operation complexity with the worst case \(O(n)\); \(O(n)\) storage complexity.
    • Optimization for constant complexity is important in real world.

2.1. Where are hash tables used

  • For tuple indexing. While tuples are stored on pages with NSM or DSM, during the query the DBMS needs to quickly locate the page that stores specific tuples. It can achieve this with separately-stored hash tables, where each key can be a hash of a tuple id, and the value points the location.

3. Hash function

  • Maps a large key space into a smaller domain.
  • Takes in any key as input, and returns a deterministic integer representation.
  • Needs to consider the trade-off between fast execution and collision rate.
    • Does not need to be cryptographically.
    • The state-of-art (Fall 2023) hash function is XXHash3.

4. Hashing scheme

  • Handles key collision after hashing.
  • Needs to consider the trade-off between large table allocation (to avoid collision) and additional instruction execution when a collision occurs.

5. Static hashing scheme

  • The hash table size is fixed; the DBMS has to rebuild a larger hash table (e.g., twice the size) from scratch when it runs out of space.

5.1. Linear probe hashing

  • Insertion: when a collision occurs, linearly search the adjacent slots in a circular buffer until a open one is found.
  • Lookup: search linearly from the first hashed slot until the desired entry or an empty slot is reached, or every slot has been iterated.
    • Requires to store both key and value in the slot.
  • Deletion: simply deleting the entry prevents future lookups as it becomes empty; two solutions:
    • Replace the deleted entry with a dummy entry to tell future lookups to keep scanning.
    • Shift the adjacent entries which were originally shifted, i.e., those who were originally hashed to the same key.
      • Very expensive and rarely implemented in practice.
  • The state-of-art linear probe hashing is Google absl::flat_hash_map.

5.1.1. Non-unique keys

  • The same key may be associated with multiple different values.
  • Separate linked list: each value is a pointer to a linked list of all values, which may overflow to multiple pages.
  • Redundant keys: store the same key multiple times with different values.
    • A more common approach.
    • Linear probing still works, if it is fine to get one value.

5.1.2. Optimization

  • Specialized hash table based on key types and size; e.g., for long string keys, one can store the pointer or the hash of the long string in the hash table.
  • Store metadata in a separate array, e.g., store empty slot in a bitmap to avoid looking up deleted keys.
  • Version control of hash tables: invalidate entries by incrementing the version counter rather than explicitly marking the deletion.

5.2. Cuckoo hashing

  • Maintains multiple hash tables with different hash functions to generate different hashes for the same key using different seeds.
  • Insertion: check each table and choose one with a free slot; if none table has free slot, choose and evict an old entry and find it another table.
    • If a rare cycle happens, rebuild all hash tables with new seeds or with larger tables.
  • \(O(1)\) lookups and deletion (also needs to store keys), but insertion is more expensive.
  • Practical implementation maps a key to different slots in a single hash table.

6. Dynamic hashing schemes

  • Resize the hash table on demand without rebuilding the entire table.

6.1. Chained Hashing

  • Maintains a linked list of buckets for each slot in the hash table; keys hashed to the same slot are inserted into the linked list.
  • Lookup: hash to the key’s bucket and scan for it.
  • Optimization: store bloom filter in the bucket pointer list to tell if a key exist in the linked list.

6.2. Extendible hashing

  • Improve chained hashing to avoid letting chains grow forever.
  • Allow multiple slot locations in the hash table to point to the same chain.
db-extendible-hashing.png
Figure 1: Extendible hashing example

6.3. Linear hashing

  • Maintains a split pointer to keep track of next bucket to split, even if the pointed bucket is not overflowed.
db-linear-hashing.png
Figure 2: Linear hashing example
  • There are always only 2 hash functions: \((key\ mod\ n)\) and \((key\ mod\ 2n)\) where \(n\) is the length of buckets when the split pointer is at the index 0 (i.e., the bucket length at any time is \(n + index(sp)\)).
db-linear-hashing-deletion.png
Figure 3: Linear hashing deletion example
  • Why does \(k\ mod\ 2n < n + sp\) hold?
    • A key is only mod by \(2n\) if the result of \((k\ mod\ n)\) is above the split pointer, i.e., \(0 \leq k\ mod\ n < sp)\).
    • Let \(r = k\ mod\ n\), then \(k = pn + r\) and \(0 \leq r < sp\).
    • Let \(r' = k\ mod\ 2n\), then \(k = q(2n) + r'\).
    • If \(p = 2m\), then we also have \(k = m(2n) + r = q(2n) + r'\), in this case \(0 \leq r = r' < sp\).
    • If \(p = 2m + 1\), then we have \(k = m(2n) + r + n = q(2n) + r'\), in this case \(n \leq r' = n + r < n + sp\).
Tags: study database cmu
29 Aug 2024

My blog search function

1. What happened

Two months ago, I started to use the current blog to keep track of my study and personal notes.

This blog uses bastibe/org-static-blog, which is a convenient Emacs package to publish a static blog from Org-mode files.

When I first started, this blog had no styling at all. With the help of Claude, I gradually decorated it with the theme and fonts of my choice.

During this process, my husband made a feature request: a blog search function.

2. What is not enough with existing search functions?

Most default search tools provided by popular static website frameworks do not return complete search results and their context.

For instance, I am learning database design. Assume I cannot remember what a “slotted page” is but I know I have noted it before. If I enter “slotted” in the search box of a Jekyll-based blog, it will provide me with 2 post links implying these posts contain this keyword. In the best case, it also highlights the first occurrence of “slotted” in each post with its context.

This does not really meet my needs. I would like to see all notes I have made for “slotted” to better refresh my memory, but the current results force me to click each post and perform a local search myself to achieve this goal.

I understand that modern search tools consider scalability and intelligence. However, for a personal blog, we can achieve greater breadth.

3. What does the search function do in this blog?

blog-search.gif
Figure 1: A blog search example

The current search function searches for all occurrences of the given search input and highlights each occurrence in an item surrounded by its context, e.g., 50 characters before and after the input. If multiple occurrences are close to each other, they are displayed once in the same item. Clicking each item opens a new tab and jumps to the closest header before the input.

In this way, I can immediately see all results and their context on one page, and I only click an item when the short context is not sufficient.

4. How is it implemented?

I implemented the search function along with all styling with the help of Cursor. The full logic can be found in search.js.

In short, every time I build the blog, a list of all post links is stored in post-list.json. When the page is loaded, for each post link, the blog caches the full post content and all header indices. When a search input is detected, the search function iterates through each post to find each occurrence and the closest header. It then wraps the occurrence with certain context and prepares the link by appending the header tag to the post link.

5. How can it be improved?

The current search function is simple and sub-optimal. It always re-stores all posts when the page is reloaded and is not lazy at all. The search is also neither fuzzy nor the most efficient.

It just works well so far for a personal blog with about 30 posts.

In addition, after I finished the implementation, my husband made another feature request: can you also highlight all occurrences in the opened link?

Tags: personal blog
Other posts