Web Scraping With JavaScript Using Axios and JSDom

It’s been almost a year since I created a website whose primary purpose was to consume a Japanese idiom API and give users a new idiom (and its pronunciation) each time they visited in a new browser window. When I originally imagined a concept for this website, I wanted to offer users an idiom, its pronunciation, and its definition. (As a language learner, idioms are pretty challenging. By nature, they are more figurative than literal, so only having access to the characters and pronunciation for a word doesn’t help much to understand its meaning.)

However, the API I used did not offer definitions. My first instinct was to find a Japanese dictionary API, pass the idiom received from the Japanese idiom API to it as an argument, and then fetch its definition. Unfortunately, I could not find a Japanese-to-Japanese dictionary API, so I temporarily gave up on this goal.

Now, months later, after learning more about web parsing, I decided to return to my original goal of having definitions for idioms.

I began by adding an Express backend to my application and creating an endpoint specific to idioms. I wanted there to be one specific location for my client side JavaScript to source information. I started by adding in the code for the idiom API I was currently using.

async function getIdiomAndPronunciation () {
const response = await fetch('https://corsservice.appspot.com/yojijukugo/api/');
const data = await response.json();
const { pronunciation } = data;
const { idiom } = data;
const idiomAndPronunciation = [ { pronunciation }, { idiom }, ];
return idiomAndPronunciation;
}
This asynchronous function would return an object that would contain both the Japanese characters for an idiom (jukugo) as well as its reading (yomi). I could use the value of the idiom when trying to retrieve its definition.

Then, I found a website that used a consistent pattern for displaying definitions and decided to fetch my data from there. Available definitions could be found with URLs that looked like this: https://dictionary.goo.ne.jp/word/WORD_TO_DEFINE.

I created another asynchronous function whose job would be to get the definition of a word. I used axios to make a get request for definitions, where WORD_TO_DEFINE would be a variable representing the idiom. I stored the data for the response in a separate variable.

const response = await axios.get(`https://dictionary.goo.ne.jp/word/${idiom}/`);
const data = await response.data;

Next, I decided to use JSDOM to create a DOM object that I could parse through using JavaScript.

const html = new JSDOM(data);

I took a peak at the page for a definition on my selected website and opened up my developer tools. I found the area where the definition was stored and took note of the different HTML classes it was contained in. I wanted to retrieve content that matched all of the various classes where definitions were stored, so I decided to use querySelectorAll. As querySelectorAll returns an array, I took the text content from the first index of the array.

const unparsedDefinition = html.window.document.querySelectorAll('.content-box, .contents_area, .meaning_area, .p10, .contents, .text')[0].textContent;

This definition came padded with whitespace, so I used trim to remove it.

const definition = unparsedDefinition.trim();

I ran into an interesting issue at this point because I was using Japanese characters to retrieve the definition. These characters had to be escaped before they could be used to retrieve data. I created a simple function to encode these characters properly.

const escapeCharacters = (word) => {
const result = encodeURI(word);
return result;
};

And then passed the escaped characters into my get request.

const word = escapeCharacters(idiom);
const response = await axios.get(`https://dictionary.goo.ne.jp/word/${word}/`);

In the case that no definition was available for the idiom, an error would occur. To avoid errors, I created an error message to present to users when there was no definition available. I created an empty result array, wrapped my code inside of a try/catch block, pushed the definition to the result array when tries were successful, pushed an apology message when there were errors, and returned whatever was inside of the result array.

async function getDefinition (idiom) {
const word = escapeCharacters(idiom);
const result = [];
try {
const response = await axios.get(`https://dictionary.goo.ne.jp/word/${word}/`);
const data = await response.data;
const html = new JSDOM(data);
const unparsedDefinition = html.window.document.querySelectorAll('.content-box, .contents_area, .meaning_area, .p10, .contents, .text')[0].textContent;
const definition = unparsedDefinition.trim();
result.push(definition);
} catch (error) {
result.push('Sorry, definition unavailable');
}
return result[0];
}

I created a third function that would get the appropriate definition for the idiom at hand. The idiom API I used changes the idiom/pronunciation it provides with each request, so it was pivotal that I only request an idiom once. Otherwise, I was likely to wind up with two idioms and a potentially mismatched definition.

async function getIdiomWithMeaning() {
const idiomAndPronunciation = await getIdiomAndPronunciation();
const { pronunciation } = idiomAndPronunciation[0];
const { idiom } = idiomAndPronunciation[1];
const definition = await getDefinition(idiom);
const idiomWithMeaningData = [
{ pronunciation },
{ idiom },
{ definition },
];
return idiomWithMeaningData;
}

I then served this definition (alongside the idiom itself and its pronunciation) up as JSON data available to my client side JavaScript.

async function displayIdiomWithPronunciation() {
const data = await getIdiomWithPronunciation();
res.json(data);
}
displayYojijukugo();

Instead of making multiple network requests on the client side, I made only one request to my own server:

const idiomWithDefinition = await fetch('http://localhost:3000/idiom');

Migrating from MongoDb to PostgreSQL (Case Insensitive Sorting)

After creating a new database in PostgreSQL, I needed to connect my Node/Express application to my database. I decided to use node-postgres to do this. I followed the recommended setup for node-postgres and created a new db folder with an index.js file setting up my PostgreSQL database. After quickly testing to ensure that I was connected to the database, I began the process of updating my routes to reflect the new database.

The process of querying from the PostgreSQL database instead of the MongoDb database was mostly straightforward (though I did learn that parameterized queries are preferred over passing information directly to queries, so as to avoid SQL injection attacks). However, I encountered an interesting problem when sorting my queries.

Sorting happens by unicode character, so a query to sort alphanumeric characters is not case insensitive. ‘Z’ comes before ‘a.’

Previously, in MongoDb, I solved this issue by adjusting collation for the query. I used the English (“en”) locale and gave it a strength of “1” for a primary level of comparison that ignored case.

I looked for a way to adjust collation strength similarly in PostgreSQL, but the collation documentation did not seem to indicate this was possible.

I changed my strategy and decided that I could make a case insensitive query if everything in my query was the same case. If I made all of the data in my query either lower case or upper case, I would no longer have an issue with data appearing out of order. However, I wanted to be sure that I did not actually alter any data within the table.

To do this, I used the lower() function. To query the names of bookmark lists that a user with a user ID of 1 had saved in alphabetical order, I used the following query:

SELECT * FROM lists WHERE user_id = 1 ORDER BY lower(name)

This did not alter any data within the table and the query returned the properly capitalized data in alphabetical order.

I ran into another issue with case insensitive sorting later on when I adjusted the sorting options for videos. Users can choose to sort videos in several ways: by date bookmarked, by date added to TikTok, by creator name, and by video description. When I used the same lower() function I previously used, this generated a series of different errors.

By default, videos appear either by date bookmarked or by date added to bookmark list (depending on whether a user is viewing all videos or only the videos that appear in a particular list they have created). The lower() function did not work on timestamps so this generated the first error.

To fix this error, I decided to create a new “pipeline” to handle video sorting. First, I used my existing getSortOption inside of a newly created function called checkCaseSensitivity. I have been trying harder to create functions that serve singular purposes, so checkCaseSensitivity was a pretty simple, self-explanatory function.

const checkCaseSensitivity = () => {
const startsWithDate = (getSortOption(app.locals.sortVideoOption)).startsWith('date')
startsWithDate ? getVideosCaseInsensitive() : getVideosCaseSensitive()

All of the video sorting options without case sensitivity began with “date,” so I created a constant with a Boolean value. Either the sorting option started with “date” (and case sensitivity did not apply) or it did not start with “date” (and case sensitivity did apply). If the statement was true (i.e. if there was no case sensitivity to worry about), another function would be called. If the statement was false, a different function would be called. (I used a ternary operator to put all of this on one line!)

I created two functions to correspond to the functions I referenced in my checkCaseSensitivity() function: getVideosCaseInsensitive() and getVideosCaseSensitive(). The only real difference between each function was the text that they would use to query the database. Therefore, I decided to create a separate, reusable getVideos() function that would accept the text for the query as an argument.

In the ORDER BY clause for getVideosCaseInsensitive(), I called the getSortOption() function I created and passed whatever sorting option the user had chosen to it as an argument. Because the case insensitive query would only be used for dates, there was no need to use the lower() function here.

ORDER BY ${getSortOption(app.locals.sortVideoOption)}

When creating the getVideosCaseSensitive() function, I encountered a new problem. I tried to create an ORDER BY clause that looked like the following:

ORDER BY lower(${getSortOption(app.locals.sortVideoOption)})

However, the lower() function could not be called when there was an ascending or descending option in the text for the query. If I removed the ascending/descending option, the query would work and would be case insensitive, but this only worked for ascending queries.

After doing a bit of digging, I found that the column and the desired sorting direction could be separated so that the lower() function could be used on the column and the direction could be included outside of the lower() function. I decided to create two arguments to pass to the getVideosCaseInsensitive() function: column and direction. That would make my ORDER BY clause look like the following:

ORDER BY lower(${column}) ${direction}

I returned to my checkCaseSensitivity() function to pass a column and direction argument when I called the getVideosCaseSensitive() function. When a user chooses a sorting option, the data is passed a string formatted as ‘column direction.’ I transformed this string into an array and used the space to split it into separate items. The first item (0) would be the column, and the second item (1) would be the direction.

const sortOptionArray = app.locals.sortVideoOption.split(' ')
const sortColumn = sortOptionArray[0]
const sortDirection = sortOptionArray[1]

Unfortunately, I had to get rid of my ternary operator to define these variables. If a user has not selected a sort option and the default sort option is in place, trying to perform the split() operation on an undefined variable generates an error. It made the most sense to return to a regular conditional statement and define these variables only when the user had selected an option that required case insensitive sorting.

However, fortunately, case insensitive sorting was functional again!

Migrating from MongoDb to PostgreSQL (Database Structure)

As I continued to restructure my database to accommodate user authentication, I ran into several issues. I solved some of these problems using embedded documents and aggregation. However, as I continued to learn more about databases, I came to an important realization: I misunderstood the relationships within the data I created; I incorrectly identified various relationships within my database as one-to-many.

After learning more about databases, I sat down (with a pencil and paper) and made some new notes. One user could have many favorite videos and many lists. At first glance, users had a one-to-many relationship with both videos and lists. Although each user would have their own unique set of lists, multiple users could add the same video. In my current database structure, many users could add the same video and this would add a new embedded document to their user document, despite being the exact same information as any other video with the same TikTok video ID. Instead of users and videos having a one-to-many relationship, users and videos should have a many-to-many relationship.

Next, I re-examined the relationship between videos and lists. I erroneously thought this was also a one-to-many relationship, but it is was actually another many-to-many relationship. One video could appear in multiple lists, and one list could have multiple videos.

Although MongoDb could handle these relationships using document references instead of embedded documents, it made more sense to me to turn to a relational database for data that included multiple many-to-many relationships. I began learning about SQL and decided to migrate from MongoDb to PostgreSQL for this project.

After installing PostgreSQL and creating a new database for my project, I decided to begin the migration by recreating the documents I made in MongoDb as tables in PostgreSQL.

MongoDb Users SchemaPostgreSQL Users Table
username: {
type: String,
required: true,
minlength: 4,
maxlength: 100,
},
password: {
type: String,
required: true,
minlength: 6,
maxlength: 100,
},
email: {
type: String,
required: true,
unique: true,
minlength: 4,
maxlength: 256,
},
createdAt: {
type: Date,
default: Date.now,
},
resetPasswordToken: {
type: String,
},
resetPasswordTokenExpires: {
type: Date,
}
user_id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,

username VARCHAR(100) CHECK (LENGTH(username) > 3) NOT NULL,

password VARCHAR(100) CHECK (LENGTH(password) > 5) NOT NULL,

email VARCHAR(256) UNIQUE CHECK (LENGTH(email) > 3) NOT NULL,

created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

reset_password_token TEXT,

reset_password_token_expiration TIMESTAMP

There were several key differences between MongoDb, a noSQL database, and PostgreSQL, a SQL database, that I had to learn to create the user table. First, PostgreSQL does not automatically generated an id the way that MongoDb does. Instead, I created my own user ID column using an integer data type and assigned it as a primary key generated always as identity. This automatically assigned a unique, increasing id number for each new user inserted into the table that could be referenced in other tables as a foreign key.

MongoDb Videos SchemaPostgreSQL Videos Table
url: {
type: String,
match: /tiktok.com\//i,
},
id: {
type: String,
},
title: {
type: String,
},
authorUrl: {
type: String,
},
authorName: {
type: String,
},
dateAdded: {
type: Date,
},
dateBookmarked: {
type: Date,
default: Date.now
}
video_id CHAR(19) PRIMARY KEY NOT NULL,

url VARCHAR(800) CHECK (url ~* ‘tiktok.com\/’) NOT NULL,

title VARCHAR(256),

author_url VARCHAR(256) NOT NULL,

author_name VARCHAR(256) NOT NULL,

date_added TIMESTAMP NOT NULL,

I handled video ids differently on my videos table than I did for my videos schema. For my MongoDb videos schema, each video document had two ids: the id that MongoDb assigned it and the id for the TikTok video itself. There was no need for these two ids as the id that TikTok assigns each video is already unique. Instead of asking PostgreSQL to assign a unique, increasing integer for each video, I decided to use the id TikTok created when inserting new videos. Aside from the difference in ids, I also introduced a variable character limit for strings in the PostgreSQL table.

The final big difference the original schema and the new table was the date bookmarked field. All of the information for one video could be shared between users, with the exception of the date bookmarked field. This value would be unique among users. However, I did not want to structure this as a one-to-many relationships where there were potentially multiple rows in a table that had identical information, save for one field (the date bookmarked field). I decided to handle the data for the date bookmarked later, in a reference table.

MongoDb Lists SchemaPostgreSQL Lists Table
name: {
type: String,
required: true,
maxlength: 100
},
videos: [{
type: Schema.Types.ObjectId,
ref: ‘Video’
}]
list_id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,

user_id BIGINT NOT NULL,

name VARCHAR(100) NOT NULL,

CONSTRAINT fk_user
FOREIGN KEY(user_id)
REFERENCES users(user_id)
ON DELETE CASCADE
ON UPDATE CASCADE

I generated an id for lists in the same manner that I generated an id for users. I also introduced relationships in the lists table, just as I had in the MongoDb schema, but I referenced a user id instead of video ids. This represents the one-to-many relationship where one user could have many lists. Each list therefore will reference the one user who created it. I added a foreign key as a constraint and made sure that, if the user it references is updated or deleted, the list will reflect those changes. (For example, if a user decides to delete their account, the lists they have created will also disappear.)

To represent the many-to-many relationship between users and videos and videos and lists, I had to create new tables that linked relevant ids together.

I created this table to link users and videos together:

user_id BIGINT NOT NULL,
video_id CHAR(19) NOT NULL,

date_bookmarked TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT fk_user
FOREIGN KEY(user_id)
REFERENCES users(user_id)
ON DELETE CASCADE
ON UPDATE CASCADE,
CONSTRAINT fk_video
FOREIGN KEY(video_id)
REFERENCES videos(video_id)
ON DELETE CASCADE
ON UPDATE CASCADE,
PRIMARY KEY(user_id, video_id)

The table above also includes a column for date bookmarked, as previously discussed.

Then, I created this table to link videos and lists together:

list_id BIGINT NOT NULL,
video_id CHAR(19) NOT NULL,
CONSTRAINT fk_list
FOREIGN KEY(list_id)
REFERENCES lists(list_id)
ON DELETE CASCADE
ON UPDATE CASCADE,
CONSTRAINT fk_video
FOREIGN KEY(video_id)
REFERENCES videos(video_id)
ON DELETE CASCADE
ON UPDATE CASCADE,
PRIMARY KEY(list_id, video_id)

Using Aggregation to Work With Nested Arrays in MongoDB

After restructuring my database to accommodate user authentication, I was faced with a new challenge: displaying and manipulating information from arrays.

Before restructuring the database, I had implemented pagination and a way for users to sort videos and have their sorting preference remembered. The variable videoLimitPerPage established the number of videos a user would see on each page before needing to go another page. I added a page parameter for routes that displayed videos and stored the current page in the variable page. Users could store their personal preference for how videos should be sorted in a variable called sortVideoOption, and this variable would in turn be stored as a local variable in app.locals.sortVideoOption. I implemented these things by creating a function called “video_list” and accessing its callback in my views to display videos.

video_list: function(callback) {
Video.find({}, callback)
.limit(videoLimitPerPage)
.skip(videoLimitPerPage * (page - 1))
.sort(app.locals.sortVideoOption)
}

After restructuring my database, I no longer had a video model. Instead, each user had a “videos” array that video objects. Therefore, I would have to completely rethink how I accessed video data.

Initially, I tried to display my data by finding a specific user and populating it with the objects in the videos array.

User.find({ email : req.user.email }, 'videos', callback)
.limit(videoLimitPerPage)
.skip(videoLimitPerPage * (page - 1))
.sort(app.locals.sortVideoOption)

After editing the way my views handled callback data, I was able to view videos added directly to a specific user account and use the .length array property to get an accurate count of how many video objects were in my videos array. However, my pagination and sorting did not function as intended. After all, I was only retrieving one user, so there wasn’t enough to limit or sort.

After searching for how to access nested arrays in MongoDB, I began to learn about aggregation. Essentially, aggregation means taking data and processing it in some way to get a result. MongoDB has several ways of taking advantage of aggregation, but, in this case, I wanted to use the aggregation pipeline. The aggregation pipeline lets you process data in stages. Every subsequent stage processes the data further.

I began to create my aggregation pipeline by thinking of the criteria I wanted to use. First, I would need to access the correct user. MongoDB advises filtering early on in the aggregation pipeline, so I decided to a $match filter for finding the correct user.

$match: {email: req.user.email}

Once I had the correct user, I would then need to access that user’s videos array. To do so, I used the $unwind operator to deconstruct the videos array. This would allow me to access the individual objects within the videos array.

$unwind: '$videos'

After deconstructing the videos array, I wanted to sort results before anything related to pagination because all of the videos should be sorted together, not on a page-by-page basis. This presented me with an interesting problem. When using cursor .sort, as I had done previously, results would load properly even if the user had not requested a sorting method (i.e. app.locals.sortVideoOption was undefined). However, when using aggregation, results would not load if there was a aggregation $sort operator and no valid means of sorting. I wanted to create a default sort condition that could be used if the user had not requested a specific sorting method.

To create this default sort condition, I created a new function before I began the aggregation pipeline process. I called this function getSortOption and gave it an argument called userInput. I created a conditional statement based on whether userInput existed. If userInput was undefined, I returned an object sorting by date bookmarked as the default sort option. If there was any userInput, I returned that instead. In this case, userInput would be form data submitted in the following format: ‘videos.sortOption,sortDirection.’ I used the .split() method to separate the form data into two separate variables: the field to be sorted and the direction for sorting. I then returned these variables as an object.

function getSortOption(userInput) {
if (userInput === undefined) {
const defaultSortOption = {'videos.dateBookmarked' : 1}
return defaultSortOption
} else {
const optionArray = userInput.split(',')
const sortField = optionArray[0]
const sortDirection = parseInt(optionArray[1])
const userSortOption = {[sortField] : sortDirection}
return userSortOption
}
}

I then called this function in the $sort operator and passed app.locals.sortVideoOption to it as the argument for userInput.

$sort: getSortOption(app.locals.sortVideoOption)

I then took the .skip and .limit cursor methods and changed them into $skip and $limit aggregation operators.

$skip: (videoLimitPerPage * (page - 1))
$limit: videoLimitPerPage

Finally, I used the $group operator to return output, or a document. The $group operator has to be used with an accumulator operator, so I decided to use the $push operator. That way, I would have an array with all of the appropriately matched, sorted, skipped, and limited results. To include all of the results in the same array, I chose an id of null.

$group: { _id: null, videos: {$push:'$videos'}}

With all of these considerations made, my final aggregation looked like this:

User.aggregate([
{ $match: {email: req.user.email} },
{ $unwind: '$videos' },
{ $sort: getSortOption(app.locals.sortVideoOption) },
{ $skip: (videoLimitPerPage * (page - 1)) },
{ $limit: videoLimitPerPage },
{ $group: { _id: null, videos: {$push:'$videos'} } },
], callback)

Previously, I used this function (the video_list) function to get a tally of the videos in the array. With the $limit operator in place, the length of the array always appeared to be 15 or less, even when more videos were available. Finally, I created a separate function to find the length of videos in the array. Originally, I recycled my original function and found the length of the video array in my views, as I did before implementing the aggregation framework.

User.find({ email : req.user.email }, 'videos', callback)

However, I wanted to see if I could find the number of videos in the videos array using the aggregation framework as well. I started off by using the same $match criteria and using $unwind to access the videos array. I used the $group operator next, but instead of using $push to add the videos to my callback, I used $sum to tally up the total number of videos in the array.

User.aggregate([
{ $match: { email: req.user.email } },
{ $unwind: '$videos' },
{ $group: { _id: null, count: { $sum: 1 } } }
], callback)

Creating One-to-Many Relationships with Embedded Documents in MongoDB

After adding user authentication, I had three separate schemas for my application: a user schema, a video schema, and a list schema (that referenced videos created using the video schema). After creating an account and logging in, users could see any video or list that any user had created. However, users should have only been able to see the videos and lists that they themselves added. I had to create some sort of relationship between the user and videos and lists of videos.

MongoDB outlines two separate methods for creating one-to-many relationships: embedded documents and document references. After reviewing both methods, it looked like embedded documents adhered more to best practices in MongoDB. Based on this, I decided to replace my video and list schemas with a single user schema. Each user schema would retrieve all of the information relevant to any one particular user.

I began the process of redesigning my database by adding the information from the video schema into an array of objects on my user schema.

videos: [{
url: {
type: String,
match: /tiktok.com\//i,
},
id: { type: String, },
title: { type: String, },
authorUrl: { type: String, },
authorName: { type: String, },
dateAdded: { type: Date, },
dateBookmarked: {
type: Date,
default: Date.now
}
}]

Without the video schema, I would need to change the process by which videos were added to the database. I also took this as an opportunity to clean up my existing code and use an async waterfall function to hand off callback data to the next function.

First, I created a function that followed the URL the user submitted to see if there were any redirects. Next, I created another function that used that redirected URL to check for API data. If the API data could be successfully fetched, the next function I created then checked to confirm that it was not undefined. After that, I used a different function to determine the ID TikTok assigned to the video. My next function checked for duplicate videos within the database.

This duplicate checker was the first meaningful deviation from my old code. Instead of searching through videos as separate documents, I was searching through the videos array of my user model.

Previously, I could’ve found the TikTok ID of a video by searching for this:

Video.findOne( {'id': tiktokId } )

However, I now needed to search users instead of videos, and, after finding the correct user, I needed to search their video array for a matching TikTok ID. I used the $and operator to look for the user currently logged in and a video ID in their videos array that matched the TikTok ID in question.

User.findOne( { $and: [ { email: req.user.email }, { 'videos.id': tiktokId } ] } )

After checking for duplicate videos, I created four functions (get binary ID, get thirty two left bits, get decimal from bits, and get date added) to find the date the video was created on. I created another function to save all of the video data to a new object.

The way the video data was saved to a variable represents another deviation from my previous code. Previously, I created a object called and added TikTok data to that object. After I finished adding all of the necessary data to the object, I created a variable that represented a new video.

const newVideo = {
url: redirectedUrl,
id: tiktokId,
title: tiktokData.title,
authorUrl: tiktokData.author_url,
authorName: tiktokData.author_name,
dateAdded: dateAdded,
dateBookmarked: Date.now()
}

const video = new Video(newVideo)

However, in my new code, I could not use the new operator. I was not creating a new user; I was only adding something to a user. To add something to my user’s video array, I used the push method instead in my final function.

user.videos.push(newVideo)

In this final function, I also took the last step of saving this updated information. Previously, I saved the video itself.

video.save((error) => { if (error) { return next(error) } })

This time, I saved the user (with its updated video array) instead.

user.save((error) => { if (error) { return next(error) } })

Now, each new video would be associated with the correct user and would be an embedded document within the user.

Getting TikTok Date Information in Node.js

One of the features many TikTok users, including myself, would like to have is the ability to search for or sort videos according to when they were uploaded. This information is not included in the official TikTok API, so I assumed that it was not available. However, I recently read a very helpful post from Digital Forensics that detailed how to get timestamps from TikTok videos. I decided to implement the steps in Node.js and append this information to videos saved in the database.

After getting the redirected URL and confirming the data retrieved from the TikTok API was valid, I began the process by extracting the video ID from the TikTok URL. So far, I have found two separate patterns for TikTok URLs. URLs coming from mobile browsers look something like “http://m.tiktok.com/v/0000000000000000000” while URLs from desktop browsers look more like this: “https://www.tiktok.com/@user/video/0000000000000000000.” I created two separate regular expressions to match both patterns.

For the mobile URL patterns:

const idRegexPattern = new RegExp(/\d+/)
const tiktokIdArray = redirectedUrl.match(idRegexPattern)
const tiktokId = tiktokIdArray[0]

For the desktop URL patterns:

const videoRegexPattern = new RegExp(/video\/\d+/)
const tiktokVideoPattern = redirectedUrl.match(videoRegexPattern)
const tiktokVideoString = tiktokVideoPattern.toString()
const tiktokIdArray = tiktokVideoString.match(idRegexPattern)
const tiktokId = tiktokIdArray[0]

To find the timestamp for the video, you have to convert the video ID to binary. However, as it is a 19-digit integer, JavaScript cannot handle the number with enough accuracy for the binary number to correspond to the correct timestamp. The latest Node.js version can handle these big integers (using BigInt), but I was working with an earlier version. Instead, I installed BigNumber to correspond with big integers. I then used a recursive function to calculate the binary number of the id.

const tiktokIdBigIntBinaryArray = []
const tiktokIdInteger = BigNumber(tiktokId)
let currentInteger = tiktokIdInteger
while (currentInteger > BigNumber(0)) {
let remainder = BigNumber(currentInteger).mod(2)
let remainderString = remainder.toString()
tiktokIdBigIntBinaryArray.unshift(remainderString)
currentInteger = BigNumber(currentInteger).div(2)
if (currentInteger == 0) {
const currentIntegerString = currentInteger.toString()
tiktokIdBigIntBinaryArray.unshift(currentIntegerString)
}
}

To calculate the binary number, I used a loop to recursively divide the id of the TikTok video and add its remainder to the beginning of a new array. I then took the first 32-digits from the array to prepare them for conversion to decimal.

const tiktokIdBinaryArray = tiktokIdBigIntBinaryArray.map(number => parseInt(number)) const tiktokIdBinaryString = tiktokIdBinaryArray.join('')
const thirtyTwoLeftBits = tiktokIdBinaryString.slice(0,32)

I used another recursive loop to convert the binary string into decimal.

const decimalArray = []
let arrayPlace = 0
let previousValue = 0
while (arrayPlace < 32) {
let valueTotal = previousValue * 2
let bitAsInteger = parseInt(thirtyTwoLeftBits[arrayPlace])
let newTotal = bitAsInteger + valueTotal
decimalArray[0] = newTotal
previousValue = newTotal
arrayPlace += 1
}
const decimal = decimalArray.toString()

From left to right, each value is the result of the previous value doubled with the current value added to it. I looped through the entire array to find a 10-digit decimal number. I used an array to add the results of the recursive function, and, once the function was compete, the sole value left in the array was the decimal number to be converted into a date.

The date currently in the decimal variable is a Unix timestamp. To make it compatible with the Date object, I multiplied it by 1000.

const dateAdded = new Date(decimal * 1000)

I was able to add this information into a new dateAdded field for videos in my database. Now, I will be able to include sorting by the date a video was actually added to TikTok, not just by when a video was added to the bookmark app.

Making Many HTTP Requests Using Recursion

One of the most important functions for my TikTok bookmark app is the ability to import an entire list of liked videos. During the process of testing files for liked videos, I realized that longer lists (with thousands of videos) struggled to import all files in one attempt. After examining 150 videos or so, I started receiving error messages about the connect timing out. I fiddled with the asynchronous code I created (and upgraded the for loop I was using to an ES6 “for…of” loop in the process), but I continued to see the same error messages after a minute or so of processing HTTP get requests.

To solve this, I decided to add a delay between requests using setTimeout. However, after running some more tests, I realized that setTimeout was not adding any additional time between requests–even when I placed it inside of a loop.

Although I was sad to say goodbye to my new for…of loop, I realized that I would need to implement a recursion method with a counter. I would need to create a function that ran as many times as there were entries in the array, kept track of the position within the array, and added a delay before running the function again.

I created a new function called “delay” that accepted the number of times I wanted the function to run as an argument. Within that function, I created a new (anonymous) function using setTimeout that would decrease the number of times the function would run. I would call the function later and pass the length of the array to it as the argument for the number of times it would run. That way, it wouldn’t continue to run indefinitely.

After creating a terminal condition within the delay function, I began to rework the code for adding a single video. Previously, when I used a for loop, I was able to automatically iterate through the array to look for videos and dates. This time around, I had to create my own counter and increment it as necessary. Outside of the delay function, I created a new variable called videoCount and gave it a value of zero. I then created a new variable inside of the setTimeout function that would access the index of my array at the videoCount number. At the end of the setTimeout function inside of my delay function, I incremented the videoCount by 1. This way, I would work my way through the full length of the array, starting at zero, until I had reached the end of the length of my array.

My final code looked something like this:

const arrayLength = dateVideoArray.length
let videoCount = 0
function delay(times) {
if (times < 1) {
console.log('All available, non-duplicate videos added.')
} setTimeout(() => {
if (videoCount < arrayLength) {
const dateVideo = dateVideoArray[videoCount]
// Additional code for adding a single video
delay(times-1)
videoCount += 1
} else {
return
}} catch (error) {
throw new Error(error)
}
}, 500)
}
delay(arrayLength)

Using Data Uploaded From Text Files in Node/Express

Longtime users of TikTok will have probably “liked” hundreds, if not thousands, of videos on the app. It would be incredibly tedious for a user to select each video, click its share icon, copy a link, and then submit data its URL data. Thankfully, TikTok allows users to download all of their data, including their list of “Liked” videos. The user receives this information as a text file called “Like List.txt.”

The list is formatted as follows:
Date: YYYY-MM-DD HH:MM:SS
Video Link: https://www.tiktokv.com/share/video/videoidnumber

I want a user to be able to upload this text file. Each separate entry would create a new video object using the video link and its corresponding date. The video link uses a different schema from the other videos (the host URL is tiktokv.com as opposed to tiktok.com), so I will continue to use the follow-redirect package to get a URL scheme that can fetch JSON data.

I created a simple form in my template engine for the user to upload their list.

form(method='POST' action='/multi-add' enctype='multipart/form-data')
label(for='like_list') Upload Your Like List
input(type='file', name='like_list', required)
button(type='submit') Upload List

I used the express middleware multer to handle the file data from the form. I wanted all files uploaded to fit certain criteria:

  1. Any file uploaded must be a text (.txt) file
  2. Any file uploaded must be named “Like List.txt”
  3. Any file uploaded must be smaller than 1,000 kilobytes
  4. Files can only have alphanumeric characters, spaces, line breaks, colons, dashes, forward slashes, and periods.

I used multer to make sure that files fit the first three criteria before being uploaded.

const upload = multer({ dest: 'uploads/',
limits: { fileSize: 1000000},
fileFilter: function(req, file, cb) {
if ((file['originalname'] === 'Like List.txt') &&
(file['mimetype'] === 'text/plain')) {
return cb(null, true)
}
cb(null, false)
} })

After a file is uploaded, I read its data using the builtin file system and deleted the original file. I then checked to see if it had invalid characters.

const data = fs.readFileSync(req.file.path, 'UTF-8')
fs.unlinkSync(req.file.path)
const regexCheck = new RegExp(/[^A-z0-9\s:\-/.]/)
if (data.match(regexCheck)) {

res.send('Your file contains invalid characters. Please upload your Like List file.')
}

I then created a pattern using regular expressions to find each date and its corresponding video URL within the text file.

/Date: \d{4}-\d\d-\d\d\s\d\d:\d\d:\d\d\s
Video Link: https:\/\/www.tiktokv.com\/share\/video\/\d*\//g

I decided to use this information to create another RegExp object. I wanted to create an array of objects where each object represented one video. Within this video object, there would be two key-value pairs. One would represent the date, and the other would represent the video link. Each video link and date would be used to create a new video in the mongodb database.

const regexSort = new RegExp(/Date: \d{4}-\d\d-\d\d\s\d\d:\d\d:\d\d\sVideo Link: https:\/\/www.tiktokv.com\/share\/video\/\d*\//, 'g')
const dateVideoArray = data.match(regexSort)

I then created a couple of conditions that I thought could be possible for a potential user:

  1. I wanted the app to be able to handle an empty Like List.
  2. I wanted the app to be able to handle videos that already existed in the user’s collection.
  3. I wanted the app to be able to handle videos that may have been deleted since the user added them to their like list (i.e. unavailable videos).
  4. I wanted the app to add all videos that did not meet any of the criteria above.

First, I created a conditional statement to make sure the Like List had at least one video that matched the regular expression.

if (dateVideoArray.length === 0) {
res.send('Your Like List is empty.')
}

Next, I decided to create a new regular expression to divide entries in the array into subarrays where one entry corresponded to the date the video was liked and the other entry corresponded to the video URL. I used a for loop to iterate over each entry in the larger array.

const dateMatch = new RegExp(/\d{4}-\d\d-\d\d\s\d\d:\d\d:\d\d/)
const videoMatch = new RegExp(/https:\/\/www.tiktokv.com\/share\/video\/\d*\//)
for (let i = 0; i < dateVideoArray.length; i++) {
const videoArray = dateVideoArray[i].match(videoMatch)
const dateArray = dateVideoArray[i].match(dateMatch)
const newVideo = videoArray.toString()
const newDate = dateArray.toString()
}

I used follow-redirects and fetch to get response data about each video. I then checked to see if a video with the same author and title existed in the database already.

Video.findOne({ 'title': data.title, 'author_name': data.author_name })
.exec(function (err, found_video) {
if (err) {return next(err)}
if (found_video) {
console.log(`A video by ${videodetail.author_name} called ${videodetail.title} already exists.`)
}

I then checked to see if the response data for the video was undefined.

else if (videodetail.author_name === undefined) {
console.log('This video is unavailable. It may have been deleted.')
}

Finally, if the video did not meet any of those conditions, I added it to the database.

else {
video.save(function (err) {
if (err) {return next(err)}
console.log(`Video by ${videodetail.author_name} called ${videodetail.title} added!`)
})

Deleting Multiple Documents In One Request With Hidden Form Input

Creating a router and a controller to delete a single video by ID was fairly straightforward. However, this method would give each video its own unique route where a user could then click to delete a video. This sounded pretty tedious, so it was important to me that users could select as many videos as they liked and then click a “delete” icon to remove videos from their collection of favorites.

In the template engine, I created a “cell” for each video in a grid. Each video came with a checkbox whose value was the ID of the video. I wanted a user to be able to select a checkbox or multiple checkboxes and click the delete icon in the menu. At this point, I wanted some sort of alert asking the user to confirm that they wanted to delete the items, and then I wanted the items to be deleted from the database.

It took several steps, in several different files, to accomplish this.

First, I created a controller to handle POST requests for deleting videos. Secondly, I created a corresponding route for the POST request. Third, I created a form that was empty, aside from having a submit button. I used Material Icons to retool the submit button to look like a trash icon. Although the form was connected to the POST method I had previously created, it wasn’t submitting any data for me to use in the controller. I needed to associate the data from the checkboxes the user had (or had not) marked.

To do this, I created a click event in JavaScript that would determine how many videos a user had checked, if any at all, and store this value as an array called checkedVideos. The user would then by asked if they were sure they wanted to delete the selected videos. If the user confirmed that they wanted to delete the selected videos, the data from the videos they selected would be appended to the form and the form would be submitted.

I decided to use a for loop to iterate through all of the values of the array. Each value represented one checked video, and I appended the value of that checked video (i.e. its unique ID in the database) to the form before submitting it.

for (let i = 0; i < checkedVideos.length; i++) {
const deletedVideo = document.createElement('input')
deletedVideo.type = 'hidden'
deletedVideo.name = 'deleted_video'
deletedVideo.value = checkedVideos[i].value
deleteForm.appendChild(deletedVideo)
}
deleteForm.method = 'POST'
deleteForm.action = 'video/delete'
deleteForm.submit()

In the controller, I stored the video IDs in an array (appropriately named videoid) and iterated through that array to delete each video ID stored within the array.

Video.remove({'_id' : {$in: videoid}})

Following URLs with Redirect Requests in Node.js with Follow Redirects

While building out my TikTok app, I encountered a problem that made storing and displaying data difficult: the TikTok API stopped allowing response requests from mobile TikTok links. When using the “Copy Link” function on the native TikTok app, the app produces a shortened URL.

For example, TikTok uses this video for their API example: https://www.tiktok.com/@scout2015/video/6718335390845095173. The link to the response data is formatted like this: https://www.tiktok.com/oembed?url=https://www.tiktok.com/@scout2015/video/6718335390845095173.

The mobile link is formatted like this: https://vm.tiktok.com/ZMJAfjEYF/, and TikTok does not support that URL scheme.

When viewing TikTok from their web applications, for both desktop and mobile, the link to the video that the user can send is properly formatted for this URL scheme.

Unfortunately, the ID provided by the native link does not directly provide the information (i.e. author name and web video ID) needed to construct the web application URL. However, when you enter the link from the native application into a browser, it redirects the user to the lengthier, web application URL.

To solve this problem of not being able to retrieve JSON data from mobile URL schemes, I used the Follow Redirects npm package. When a user adds a video, we use a get request first to find the destination (if any) that the URL redirects to. This guarantees that a user can add a URL from the web application or the native mobile application, and the desired JSON data (i.e. author name, author URL, and title) can be fetched and stored in the MongoDB database with its corresponding video.

https.get(video_url, response => {
const redirectedUrl = response.responseUrl
fetch(`https://www.tiktok.com/oembed?url=${redirectedUrl}`)
.then((fetchResponse) => fetchResponse.json())
.then((data) => {
videodetail = { video_url: redirectedUrl }
videodetail.title = data.title //
}}

Relatedly, the image from the thumbnail URL is not stored in the database at all because TikTok occasionally changes the URLs for thumbnail URLs, so the latest URL must be retrieved regularly. Likewise, the JavaScript in the front end of the application performs a separate fetch request:

const url = videoLinks[i].getAttribute('href') // The link used in the template engine comes is the destination of the redirect, not the original URL scheme for the native mobile application
fetch(`https://www.tiktok.com/oembed?url=${url}`)
.then((response) => response.json())
.then((data) => {
if (data.thumbnail_url === undefined) {
videoImages[i].style.backgroundImage = "url('/../images/videounavailable.jpg')"
} else {
videoImages[i].style.backgroundImage = `url(${data.thumbnail_url})`
}
})