magnify
magnify
Home JavaScript A LinkedIn Crawler with Meteor.js
category

A LinkedIn Crawler with Meteor.js

Published on June 13, 2014 by in JavaScript, MeteorJS

UPDATE, June 18 2014
5 days after I published this article, I received:

Mr. Demonte,

I’m writing you in regards to an article posted on your website entitled “A Linkedin Crawler with Meteor.js” (http://jb.demonte.fr/blog/linkedin-crawler-meteor-js/). While I believe this crawler was built with good intentions, I am sure you can understand that it bypasses security controls we have put in place to protect the site as well as our members privacy. Given others that read your article may not share your good intentions, we’d like to ask that you remove the post, and discontinue any scraping of profile data on the LinkedIn platform.

Thank you for your cooperation, and if you have any questions, please do not hesitate in contacting me.

Paul Steinau
Sr. Manager, Cyber Threat
Trust & Safety

So, to avoid any problem, the sensible informations are now censored in this article and the project is no longer available on github.

Building web applications with Meteor.js is really easy, as they said “It just works”.
In this article, I’ll present you how I wrote my LinkedIn crawler.

linkedin-crawler

This article is a complete rebuild of my previous LinkedIn Crawler.

This time, let’s drop the Phantom.js server and let’s use the power of Meteor.js and Node.js packages.

Notice the goal is not to use the LinkedIn API, but to have an automatic activity seen as a real user’s one (profiles visited get “viewed” notifications).

Due to continuous upgrades, the code on github may change from the one described in this article.

Restriction
In this article, I confine the application to a single use, I do not deal with the mutiple sessions / users in parallel.

Useful
I assume you’re using Chrome with developers tools to explore the dom and code.

To build this webapp, I used several tools to reverse LinkedIn website, here is a must-have / must-know non-exhaustive list:

Creating a new meteor application

Once meteor updated from 0.6.x to the latest with:

meteor update

Let’s create our project:

meteor create linkedin
cd linkedin

As previously said, instead of using Phantom.js, I’m going to use http to get the page and cheerio to manipulate the dom using jQuery on server side.

To use these node modules in meteor, I followed this article.

$ npm install -g meteor-npm #single time operation

The magic happens when updating the packages.json with:

{
  "cheerio"    : "0.16.0"
}

All the packages are automatically installed, it’s really cool 🙂

An important thing which I missed on the documentation is, to use HTTP service, you need to add it into your meteor application.

$ meteor add http

To get a nicer apps, I installed Bootstrap 3 with meteorite:

$ mrt add bootstrap-3

LinkedIn’s exchange analysis

At first, I need a logged session, so, I started with the login page.

In the Chrome console, check the “Preserve log” option in the Network item.

preserve_log

Once logged in linkedIn, I saved the cURL equivalent to the login submission to get posted values and more important, the headers.
This command is helpful to reproduce the post.

copy_as_curl

I also used the HAR to get a JSON version.

I got each LinkedIn HTTP request using this way.

Build HTTP requests in meteor

To simplify my http requests, I first coded two little classes to handle the cookie and the options object to set in the http.get / http.post requests.

Cookie class

The idea is to use a cookie object through the whole research process, from the login to the profiles view.

View the cookie class

Options class

An options object will be created for each HTTP request, each time using the same cookie.

View the Options class

Step 1: Get login page

At first, we do not have any cookie, so, this will be the only request without the cookie.

  res = HTTP.get("https://www.linkedin.com/uas/login", new Options());

The result is an object like:

code-censored

Where:

  • statusCode is the HTTP response code
  • content is a raw string representing the webpage / json object
  • headers is an array contening the response headers
  • data is an object if the content is a JSON stringified

If the HTTP response includes a set-cookie header, the cookie needs to be updated with this array of settings.
The response payload has to be extracted, either as a jQuery objet when it’s an HTML string or as a JSON object when it’s a JSON one.

To simplify this task, I coded some helpers to make the HTTP requests, process the cookie update and return the extracted object (dom or json)

View the HTTP helpers

With this, here is my way to get a new page:

  $ = get("https://www.linkedin.com/uas/login", new Options(), cookie);

The result is the jQuery object containing the HTML and the cookie is updated with the response headers.

Step 2: Login to LinkedIn

I extract the inputs (hidden or not) fields from the form (div…) container using a small helper.

View the GetInputs helper.

At this point, I had to analyse deeper the headers sent when logging.

There are some headers which can be set directly:

code-censored

And some which seems to be injected in the post request by code:

code-censored

To find where these are generated, I used Visual Event to get the JavaScript code binded to the submit button and I got its location, here is its code after a prettify using jsbeautifier.org:

code-censored

Lucky me! I only had to rewrite it including the global scope functions used in it.
In the console, I got the value and the location of window.txt-censored, here is its code:

code-censored

I’m now going to integrate it into my node code, removing useless parts, merging and cleaning it.

The post only needs the params object, so, I updated the previous LinkedIn function to remove the HTML node access and to direcly provide the “checksum” values to this object.

View the LinkedIn checksum equivalent.

The latest custom header to add is txt-censored. A search across all files using CTRL + SHIFT + F in the chrome dev tool reveals it:

code-censored

Not a big challenge to rewrite it:

code-censored

Login final code

Last thing before loging successfully is to inject in the request the user credentials from the application form.

code-censored

At this point, the cookie provides the connected session information, so each next request will be already connected.

Step 3: Subit search on LinkedIn

The search process will be in two steps:

  • 1. A classic post to get the first page result
  • 2. X JSON posts to get the following results. Depending on you account, the result count are limited from 100 to ?.

1. Post search

This first post is like login post:

code-censored

Response payload is stringified and commented into a code block on the HTML result page:

code-censored

I coded a little function to uncomment it in order to get the JSON result.

At my first attempt, I got the error Parsing error: [SyntaxError: Unexpected token \], that was due to the utf-8 encoding. To avoid fixing each special character manually, I searched \u002d across all files and finally got an unescape function into a fizzy.js library.

View the LinkedIn unescape equivalent.

With another helper, to get the JSON object, I only have now to use:

json = uncomment($("#voltron_srp_main-content"));

View the JSON extrator from an HTML comment.

When building the tool, I wrote some temporary helpers to simplify my analysis of this JSON which is really huge.

Manipulating the search to find some predefined result, this one helped me to find the JSON path to the results:

function lookFor(txt, tree, path) {
  var k, value;
  path = path || "";
  txt = txt.toLowerCase();
  for (k in tree) {
    if (tree.hasOwnProperty(k)) {
      value = tree[k];
      if (typeof value === "object") {
        lookFor(txt, value, path + k + ".");
      } else if (typeof value === "string" && value.toLowerCase().indexOf(txt) !== -1) {
        console.log(path + k);
        console.log("  >>" + value);
      }
    }
  }
  if (!path) {
    console.log("end");
  }
}

The persons results are stored in txt-censored which is an array of object.

Save profiles (persons)

On server side, I defined a collection to store the person objects:

// define a new collection using the global scope
Persons = new Meteor.Collection("persons");

Because the JSON object is quite huge and I dont want to write 100 meters of “if (json && json.hasOwnProperty()…)” tests, I added a snippet to test a path in an object.

View the path testing function.

The function extractPersons extracts persons from the JSON result. It inserts person object into the database and visit her profile if it has not been done yet. The visit is to notify them that I viewed their profile, but ignoring the result, because it’s useless.

View the extract and visit profile function.

In this one, you’ll notice 2 points:

  • I added a sleep to avoid being too aggressive with LinkedIn website
  • The URL profile is built using some credentials from the profile (to protect against +1 loop on profile view url)

2. Post ajax pagination search

Idea is to loop on each pagination result, starting on current post result:

code-censored

In this loop, I directly use the JSON result from the response.data object.

Create the interface and display results

I’m going to give an overview of this HMI, avoiding to present the classic meteor features.

I use 3 templates, the form, a status block and a result table. The result table is powered by datatables for a nicer presentation and to use the filter feature.

Form

form

The form is quite classic and powered up using the classic meteor way.

Status block

In the whole pieces of code above, I voluntarily removed some status states which are displayed on the HMI.

status

On server side, I use a global status object updated all over the process:

var status = {
    log: '',
    resultCount: '',
    count: 0,
    added: 0
};
...
status.log = "Connected to LinkedIn...";
...
status.count++;
...
status.added++;
...

View the status publisher.

View the status template.

View the status handler.

Profiles display

To show the crawled profiles , I use a datatable.

datatables

Source inclusion

I simply include in the layout its cdn sources:


Template
At first, I generated the profile items directly in the template:


Crawled profiles ({{count}})

 

Linked to:

if (Meteor.isClient) {
  Template.persons.persons = function () {
    return persons.find();
  };
  Template.persons.count = function () {
    return persons.find().count();
  };
}

Finally, to make datatable faster, and to have the row generating code at the same place, I chose to simplify it by removing the row loop.

View the Datatable template.

Note: I removed the auto-publication using:

$meteor remove autopublish

So, I need to publish my collection:

if (Meteor.isServer) {
  Meteor.publish('persons');
}

I’m using the fnAddData function to add new rows into the datatable both on start and on new insertion (also use fnUpdate and fnDeleteRow).

I first define two template handlers:

– The first one will initialise the datatable feature without dealing with the data.

View the Datatable initialisation handler.
The first column is hidden and is used to store the mongo id (_id)to retrieve a row associated to a person thanks to the function fnFindCellRowIndexes

– The second one handles the changes on the collections after the initialisation:

View the database change observer handler.

The columns data to use in the datatable are in generated from the Person object to an array of value.

View the Datatable row data generator.

Once the collection is ready, the datatable is initialised.

View the “on Collection ready” publisher.

View the “on Collection ready” subscriber.

Conclusion

After a few days, the crawler works fine, I’ve done a searchs which saved 4000+ profiles.
I got in return a nice increase of the visits on my profile, and some invitations from interesting profiles (recruiters…).

linkedin-result

Limitation / Warning

This tool is a learning exercise, it is not intended for being used in whatever ways, certainly not to steal data from LinkedIn or any others evil intentions.

Notice, that, after having reach about 2000 viewed profiles within the first day, my account has been looked by LinkedIn for 24 hours because of the large amount of pages viewed.

403_linkedin

If you try this tools, it’s at your own risk, I cannot be held responsible of anything.

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
4 Comments  comments 

4 Responses

  1. Pensy

    Very useful ! Thanks ^^

  2. Jeremy Thille

    Too bad (but predictable) LinkedIn made you censor some bits of it 🙁 Awesome stuff, congrats

  3. jbdemonte

    Sure, it was predictable… but they were faster than I was expecting

  4. Kyle Hailey

    That’s very cool. I’ve been wanting something like this for a couple of years to get the results you found at the end of your article. It was also interesting to see that they locked your account and sad to see they ask you to pull the code. 🙁
    thanks for the info though!


You must be logged in to post a comment.