Lesson #4: How to hide a picture of a cat in Redshift

I didn’t start out trying to put a picture of a cat into Redshift. My motivation isn’t less strange though. The product manager of Aginity Team put out a call for “interesting” things to do with Aginity Team. I thought, “what could be more interesting than having Aginity Team print a picture of Aginity’s Enterprise Product Manager and Director of Solution Architecture, George L’Heureux?” Of course, encouraging you to put a picture of George, or of any of your co-workers into your Redshift database, would just be creepy. So, you can use an ASCII picture of a cat, hence the title of this article. The desktop software, Aginity Pro, works identically to Aginity Team, so you should be able to follow this tutorial with either version of software.

The truth is that those of us who wrangle data are often asked to do seemingly strange things with the tools at hand: query the Google Analytics API from Redshift; pull data from a GIS system directly into Snowflake; build an ad hoc ETL system inside Hive while we are waiting of IT to start the “real” project; etc. In this article, we will take some arbitrary string data, compress it according to a well-known algorithm by using some advanced SQL windowing functions, store the results in a database, read the compressed data, and uncompress the data using Reshift’s regular expression syntax. The fact that the string data makes a picture of a cat, or of your co-worker, is just for fun. Along the way, we’ll overcome some particular challenges that Redshift has in looping over functions.

Step 1 - Getting your string input

I don’t have any particular advice on an ASCII art converter program. Search Google and you will find many of them. I kept my resulting output to 100 columns wide and 51 rows, which worked well for my image. Also I stayed away from output that included single quotes. Here is the image I was working with.

/////::::::::::::::::---------------------------------------::::::::::::::::::::::////////////++++++
/:::::::::::::----------------------------------------------------:::::::::::::::::::://////////++++
::::::::::::------------------------------://+ossshho+//--------------::::::::::::::::::::://///////
::::::::-----------------------------:/ohdNNNMNMMMMMMNNNdy+:--------------:::::::::::::::::::://////
::::-----------------------.....--:sdNNNMMMMMMMMMMMMMMMMMMNd+------------------:::::::::::::::::////
---------------------...........-sdNMMMMNMMMMMNNNMMMNNMMMMMMNd+---------------------:::::::::::::://
-----------------.............-odMMMMMNNNmNmmdddmmmmddmmNNNNNMNdo:----------------------::::::::::::
--------------..............-sdNMMNNmmdhhhhssoosyyssssssyhhmmNNMNms:----------------------::::::::::
----------.................:yNMMNNmdhyssooo+++++++++++++ossyyhmNMMNd+------------------------:::::::
------....................-hNMMMmhyysoo++++////////////+++oosyhdmNMMNy--..---------------------:::::
-----.....................+NMMNmhysso+++/////////:::////++++osyhdmNNNNs-.....--------------------:::
---.......................sNMMNdysso+++////////////////+++++oooyhdNMMMm/.......-------------------::
-........................-mMMMNdyso++++//////////////////+++++osyhmNMMNs-.........------------------
-........................:NMMMNdyso+++///::::-::/::::-:::///++osyhmNNMNh:...........----------------
.........................-mMMNNmysoo++//::----.-:------::////++syhmNMMNh:.............--------------
.........................-dNMNNmhysoooo++++++/:-:::///+oooo+osssshmNNNms-...............------------
..........................+NNNNmhyssyyo++///://::::///://////+shyydNNNm+.................-----------
..........................-mNNNmhssys+++sys/+//////+o+++/yhyssoosyhNNNh:..................----------
...........................+NNNmysooosysss+///+o++++so/////++ooo+oymNms......................-------
...........................:hmNms++++++//////++o+++oso+/::///+++oosdNmo.......................------
...........................:oyNms++++////////++oo+/oso+/::::///++oodNd+.......................------
............................/ymdooo+//::::::/+oo+/:/oso+:-----:/+syhmh/.......................------
.............................smdsso+/:---://++++/:-:+ooo+/:---::/oyhdy:.........................----
.............................-ydyso//:::://:+o++++++ssss+////////oyhdo-.........................----
.............................-sdhyo++//+oossyyysso++syyyyyssso++oshmms-..........................---
..............................omdsoo+/+yyhhhhysso+/+oyyhdhddhs+ssydmmh/.........................----
.............................-ommhsss/+shyhhy+/:::::/:/osysyhsyshdmmmd+.........................-.--
..............................+mmdhyyyoos+/oys+/-...:/+ysoyysyhhddmmmd+-.........................---
..............................-ymmdhhhhyy+//+sssoooossyoooshyhddmmNmmy/-.........................---
.............................../mNmdhhhhyo/:/+++++++oo++oyyhhddmmNNmd+-........................-..--
...............................-hNNmddddhs/:/+o++++++++o+shdmmmNNNmmy:..........................-..-
..............................:ymmNNmmmmdyo//+osssssso+/ohdmNNNNNNmy/...........................----
...........................-+ymNhomNNNNmmdhsoosssoo+syssydmNNNNNNmh+............................----
.....................-:/oydmNNNNy-+dmNNNmmmdhhyyysooyhhddmNNNNNNNNmdyo:-.........................---
................-:/oydmNNNNNNNNms..-+dmNNNmmmmdddddhdmmmmmNNNmhNNNNNNNmdhs+:-.....................--
.........--/+oyhdmNNNNNNNNNNNNNms..``-sdmmNNNNmNmmmNNNNmNNNmy/+mmNNNNNNNNNNmdyo+/:-..............---
--..-:/oyhdmNNNNNNNNNNNNNNNNNNmms..````-ohdmmNNNNNNNNmNNmy+:..smmNNNNNNNNNNNNNNNmmdyo/:-.........---
-/+shdNNNNNNNNNNNNNNNNNNNNNNNNmms..```  `.:osyhhddddhyy+:.```-hmmmmmNNNNNNNNNNNNNNNNNNmdyo/:--------
dmNNNNNNNNNNNNNNNNNmNNNNNNNNNNmmy.`````    `.:/+//:-.````````-dmmmmmmmmmmNNNNNNNNNNNNNNNNNNmhy+:----
NNNNNNNNNNNNNNNNNNmNNNNNNNNNNmmmh.```````  `.-::``   `````  `/mmmmmmmmmmmNNmNNNNNNNNNNNNNNNNNNNmhs/-
NNNNNNNNNNNNNNNNNNNmNNNNNNNNNmmmd.` ```````shddy:  `````    `ommmmmmmmmmmmNNNNNmNNNNNNNNNNNNNNNNNNms
NNNNNNNNNNNNNNNNNNNNNNNNNNmmmmmmm-``   ```.ymmhyh:````      `ymmmNmmmmmmmNNNNNNNmNNNNNNNNNNNNNNNNNNm
NNNNNNNNNNNNNNNNNNNNNNNNmNNmmmmmm:``   ``.`:Ndyhd/``       `.mmmmNmmmmmmNNNNNNNNmNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNmmmmmmmmm+`````````+Ndhdh``       ``/mmmmmmNmmmNNmNNNNNNmmNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNmNNNNNNmmNmmmmms`````````ymhddh-``  ` ````smmmmmmNmmNNNNNNNNNNNmNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNmNNmmmmmh.``````.omdhmdhd:````````.dmmNmmmNmmNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNmNNmNmmd.``````/hmhhmmymh.```````:mmmNmmNNmNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNmNNNmNmm:`````.smdhhdmhmd:```````smmmmmNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNmmNNmmNm+`````:dNdhhdddmd/```.``.hmmNmNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNmNNNmmms.```.+mmhsyyddmd+`..```:mmNmmNNmNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNmh..`..hmmhhyhmmmds`..```ommNmNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

You can use whatever techniques you like for getting data into Redshift. The COPY command is very handy. Here is what I did:

drop table my_text;
create table my_text (seq int IDENTITY, line varchar(255));

copy my_text
from 's3://your_s3_bucket/txt/my_picture_ascii.txt'
credentials 'aws_access_key_id=XXXXXXXXX;aws_secret_access_key=ZZzzZZZzzZZZZ'
format as FIXEDWIDTH 'line:100'
;

However, note the sentences in the CREATE TABLE documentation that say:

With a COPY operation, the data is loaded in parallel and distributed to the node slices.
To be sure that the identity values are unique, Amazon Redshift skips a number of values
when creating the identity values.  As a result, identity values are unique and sequential,
but not consecutive, and the order might not match the order in the source files.

What I found is that with a small file like this, although there are gaps in the seq numbers, as long as I order by seq, my lines are correct. If you run into problems, then you can add a sequence number to your text file to keep things straight.

Step 2 - Encoding the Data

We don’t want to just insert our text picture into a table and then pull it back out. Where’s the challenge in that? We need to encode it somehow. I looked at the DEFLATE algorithm, but decided not to tackle encoding that in SQL today. Our data is plain ASCII with a lot of repeats. After a bit of searching, Run Length Encoding seemed like a good choice for this exercise.

Run Length Encoding

For Run Length Encoding, we are going to take a string, such as the first line of our ASCII picture:

/////::::::::::::::::---------------------------------------::::::::::::::::::::::////////////++++++

and encode all of the repeats. I found several implementations online and it seems pretty loose as to whether the character or number is first. For our purposes, we’ll put the character first. So the encoding for this line is:

/5:16-39:22/12+6

because the string has 5 slashes followed by 16 semicolons, followed by 39 dashes, etc.

Iterating in Redshift

The Redshift substring function will be useful here,

select seq, line, substring(line, 1, 1) as letter
from my_text
order by seq
;
../../_images/cat_result_1.png

That gives us the first letter of each line. What we really want is each letter of each line turned into rows of our query result. For example, Hive has functions that can pull a string apart into an array and can “explode” that array into rows of a query result. Redshift doesn’t have that, so we will have to handle that ourselves.

In order to explode our letters from the string into rows, we need to iterate over each letter. We want a row for each letter of the original string. We can use the SQL cross join against a “numbers” table to achieve this effect.

Note

Sidebar numbers

Needing a set of numbers is so common in SQL that many database systems have a function to generate numbers as needed. For example, PostgreSQL 9 has generate_series(). Unfortunately, Redshift and many other MPPs lack this feature; fortunately in MPPs, we have lots of things we can count in order to generate numbers.

In Redshift, the stl_plan_info table has a record for every query you have run. If you have been using Redshift for a while, then this willl be sufficient. Our string is only 100 characters wide, so we are only going to use the numbers from 1 through 100.

drop table if exists numbers;
create temp table numbers as
select num from
(select cast(row_number() over (partition by 1) as int) as num from stl_plan_info
) inner_query
;

To achieve this iteration, we can modify our query like this:

select seq, num, line, substring(line, num, 1) as letter
from my_text
cross join numbers
where num <= length(line)
order by seq, num
;
../../_images/cat_result_2.png

For each row in the my_text table, the cross join will give us a copy for each row of our numbers table. Our substring now includes num from the numbers table. We don’t need copies for all of the rows in the numbers table, so we limit it to only the length of the text. Now we have our string pivoted so that each letter is on a separate row.

Where are we heading? We want to count the 5 slashes followed by the 16 dashes and so on down the result set.

Grouping with windowing functions

Redshift supports the standard SQL Window functions. My first thought was to use count() or row_number() grouped over our letters. However, letters that we have already seen can return later in a line. For example, the slash characters that begin the first line also end the first line. This return of characters foils any attempt to use only grouping functions. First, we need to mark each change, preserving the character number where the change occurs. We use the LAG() function to bring in the previous letter for comparison and then some case logic to determine that a change has happened.

select seq, num, line, substring(line, num, 1) as letter,
   lag(letter) over(partition by seq order by num) as previous_letter,
   case when num = 1 then 1
        when letter <> previous_letter then num
   else 0 end as group_start
from my_text
cross join numbers
where num <= length(line)
order by seq, num
;
../../_images/cat_result_3.png

We want to use the group_start along with the next_group_start using the LEAD() function, filtering out all of the rows that don’t start a new group. Now we have one row for each group. Each of those rows has sufficient information to calculate the beginning and end of a substring for that group, which we’ll call a “chunk”.

select seq, group_start,
       nvl(lead(group_start) over(partition by seq order by group_start), length(line)+1) as next_group_start,
       substring(line, group_start, next_group_start - group_start) as chunk,
       substring(chunk, 1, 1) || length(chunk) as encoded_chunk
from
(select seq, num, line, substring(line, num, 1) as letter,
       lag(letter) over(partition by seq order by num) as previous_letter,
       case when num = 1 then 1
            when letter <> previous_letter then num
       else 0 end as group_start
    from my_text
    cross join numbers
    where num <= length(line)
    order by seq, num
 )
where group_start <> 0
order by seq, group_start
;
../../_images/cat_result_4.png

Rolling up with the listagg function

Now we have each chunk “encoded” with the character and count. We only need to bring all of the encoded chunks up to the same row so that the encodings for the entire line are together. The Redshift LISTAGG() function is what we want here.

select seq as id, listagg(encoded_chunk) within group (order by group_start) as value
from (
    select seq, group_start,
           nvl(lead(group_start) over(partition by seq order by group_start), length(line)+1) as next_group_start,
           substring(line, group_start, next_group_start - group_start) as chunk,
           substring(chunk, 1, 1) || length(chunk) as encoded_chunk
    from
    (select seq, num, line, substring(line, num, 1) as letter,
           lag(letter) over(partition by seq order by num) as previous_letter,
           case when num = 1 then 1
                when letter <> previous_letter then num
           else 0 end as group_start
        from my_text
        cross join numbers
        where num <= length(line)
        order by seq, num
     )
    where group_start <> 0
    order by seq, group_start
)
group by seq
order by seq
;
../../_images/cat_result_5.png

We have now encoded our string lines into an RLE format. Before we decode, let’s store that in a table:

create table dev.public.rle_data as
select * from ...
;

Step 3 - Decoding the Data

Now our picture is “hidden” in a Redshift table, with each line compressed into RLE format. We used iteration over the substring() function plus some windowing functions to get there. To pull the data out, we are going to use iteration over Redshift regular expression functions.

Here is our first row of data:

/5:16-39:22/12+6

We have a target character, followed by a number that tells how many times to repeat the character. We do have a REPEAT() function in Redshift.

select repeat('/', 5);

So, if we can identify the components, we can build the string for that repetition. We know from the previous section that if we can get those repetition strings into columns, we can use the LISTAGG() function to pull them together into the same row, reconstituting the original string.

Iterating using Regular Expression functions

Redshift supports several regular expression functions. Understanding regular expressions is one of the most important skills a programmer can have. With modern database systems, that includes SQL programmers. MPP SQL systems vary widely in their support for regular expressions. In Redshift, the REGEXP_SUBSTR() function is straight-forward but limited.

select *, REGEXP_SUBSTR(value, '\\D\\d{1,}', 1, 1)
from rle_data a
order by id
;
../../_images/cat_result_6.png

With this function we are substringing the value column, which contains the RLE encoded string, using the pattern \\D\\d{1,}, starting from the beginning of the string, and extracting the first occurrence of the string, which is /5. Developing regular expression patterns begins with thinking clearly about what you are trying to achieve. In this case, we set up our RLE to be “a character followed by some digits”. Another way to characterize that statement is “a non-digit followed by at least 1 digit”, which is represented by the pattern \\D\\d{1,}.

With Hive, we could “explode” directly into an array with multiple occurrences of the regular expression. With Redshift, we’ll use the previous technique of using a numbers table to drive iteration over the occurrence parameter of REGEXP_SUBSTR() . The only thing we need to know is when to stop iterating, which could be different for each string. Fortunately, the REGEXP_COUNT() function will help with that.

select *, REGEXP_SUBSTR(value, '\\D\\d{1,}', 1, num)
from rle_data a
cross join numbers n
where num <=REGEXP_COUNT(value, '\\D\\d{1,}')
order by id, num
;
../../_images/cat_result_7.png

For each pattern occurrence, we want to expand it using the REPEAT() function shown above. For that function, we need the first character of the occurrence and the number of times to repeat, which is everything in the pattern occurrence after the first character. Let’s look at all of that data on a single row:

select row_number() over (partition by 1) as key,
       id as line_number, num as pat_number, value as rle_line,
       REGEXP_SUBSTR(value, '\\D\\d{1,}', 1, num)  as pattern_occurrence,
       substring(pattern_occurrence, 1, 1) as rle_char,
       cast(substring(pattern_occurrence, 2) as int) as char_count,
       repeat(rle_char, char_count) as expanded_pattern
from
 (select *, REGEXP_SUBSTR(value, '\\D\\d{1,}', 1, num)
  from rle_data a
  cross join numbers n
  where num <=REGEXP_COUNT(value, '\\D\\d{1,}')
  order by id, num
 ) as rle_data
order by id, num
 ;
../../_images/cat_result_8.png

Rolling up

Once again, we can use the LISTAGG() function to roll this data up to a single row.

select line_number, rle_line, listagg(expanded_pattern) within group (order by line_number, pat_number) as full_line
from
    (select row_number() over (partition by 1) as key,
           id as line_number, num as pat_number, value as rle_line,
           REGEXP_SUBSTR(value, '\\D\\d{1,}', 1, num)  as pattern_occurrence,
           substring(pattern_occurrence, 1, 1) as rle_char,
           cast(substring(pattern_occurrence, 2) as int) as char_count,
           repeat(rle_char, char_count) as expanded_pattern
    from
     (select *, REGEXP_SUBSTR(value, '\\D\\d{1,}', 1, num)
      from rle_data a
      cross join numbers n
      where num <=REGEXP_COUNT(value, '\\D\\d{1,}')
      order by id, num
     ) as rle_data
    order by id, num
    ) rle_detail
group by line_number, rle_line
order by line_number
;
../../_images/cat_result_9.png

Now the full_line column has your complete picture, decoded from RLE format.

Conclusion

You will probably never need to hide an ASCII picture of a cat (or a co-worker) inside your Redshift database. However, you will certainly need to iterate over subsets of data, use windowing functions to group data, and use regular expressions to manipulate strings.

As modern data analysts and data engineers, we need to expand our toolbox to include all types of string manipulations. The next time you are asked to do something out of the ordinary in Redshift, Snowflake, or any other cloud database, you will be ready.