Software Engineering in WordPress and Musings on the Deep Life

Importing CSV Files into WordPress: Preventing Timeouts

This is part one of a series for how to Import CSV Files into WordPress.

This is one of set of posts that I hesitated writing because I don’t know how applicable it really is because it’s going to be dealing with talking about how to achieve something at an abstract level without using any real code examples.

And it’s not that I think talking about algorithms or procedures without code is bad (we all use pseudo-code, right?), I’m just not sure if it’s one of those series of articles that’ll help someone or not.

But I can’t be the only one who’s encountered these issues, so I’ll give it a shot. And this is going to be spread out over a few posts.

Importing CSV Files into WordPress: Preventing Timeouts

One of the more common tasks that you’ll see in web application development deals with importing data usually in some file type (often times which is CSV though other file formats are becoming more popular). A challenge, though, with importing data via CSV – especially a large data set – comes with giving the user visual feedback as to how the process is going.

And when it comes to importing CSV files into WordPress, the same rules apple. Importing large data sets and relying on the API to take care of, say, creating posts, associated taxonomies, and adding media can be a time consuming process.

But there are some strategies that we can employ that will help to ensure:

  1. There won’t be any PHP timeouts
  2. The importer gives visual cues without blocking the page load
  3. The functionality cleans up after itself when it’s done.

Though there are a number of ways to attack this particular problem, here’s how I’ve been doing it, as of late.

Importing CSV Files into WordPress

Obviously, the times during which you’d be importing files is when you have data in a CSV format that needs to be parsed, processed, and stored in the WordPress database.

There’s a lot of things to handle during this process such as error checking, sanitization, and all of that jazz, but all of that is outside the scope of this particular series. Instead, this is more about breaking the process by which you can efficiently import data, give user feedback, and clean up the residual data when done.

One thing to note is that some imports are quicker than others – that is, perhaps the file that’s being imported is small, the data that has to be added is small, and/or perhaps there are very few API calls and/or queries that have to be made when import process is happening.

But assume you’re given the a CSV file that contains records for importing hundreds (to thousands and upwards) of rows of data each of which corresponds to a custom post type having its own set of custom terms for custom taxonomies.

There’s a lot to handle. You want to check…

  • Titles (or another property) to see if the post already exists, and skip it if it does exist
  • Taxonomies and terms to see if they need if they already exist or need to be created
  • Ensure that you’re appending terms rather than overwriting what may exist on a post that already exists.
  • …and so on.

This can be a time consuming process not just from a programmatic standpoint, but from actually importing the data. A the reason that this can be a problem is because:

  • The user selects a file to upload
  • They click on the ‘Upload’ button
  • The process kicks off and the page blocks until the process is complete
  • There may or may not be a timeout on the server

This means that the user has very little feedback as to what’s happening, and the page continues to load for minutes while the data is processing or even stops execution because the server terminates the script.

Bummer.

But there are a few things to do that I’ve already mentioned to tackle this problem in order to make sure that we’re doing what we can to mitigate any potential PHP timeouts and to display progress to the user.

To do this, though, it requires a bit of Ajax so if you’re not familiar with how to do this in WordPress, then I recommend checking out the associated Codex article. Though I won’t be talking about it in this article, it’ll be necessary in the next few.

1. Preventing PHP Timeouts

Every web server has some period of time at which they terminate the execution of a PHP script. On localhost, it’s really easy to get lazy and set an extraordinarily high number, but you don’t always have that control on servers to which you’re deploying your work.

To address the issue of PHP timeouts, the first thing that I’ve found useful is the split the incoming CSV into a number of files. You can do this a number of ways, but for the sake of this article, I figured using 10 rows is easy enough.

First, we need to determine how many files we need to write:

  1. Read the contents of the file and `explode` each line into an index of an array.
  2. Count the total number of rows and store it in a variable.
  3. Store the title row (since it’s needed at the top of each file) in its own variable.
  4. Divide the total number of rows by 10 to determine how many files are needed.
  5. Use the modulus operator to determine how many rows will be written to the final file (since it’s not likely to be a number evenly divided by 10).

Then we need to actually write the files, so:

  1. From zero up to the total number of rows of 10, create a file that contains the title row and then 10 rows of data from the CSV.
  2. Write the remaining rows into a final file prepending the file with the title row.

You can name the files whatever you like, but I tend to be partial to suffixing the file with _n where nn is equal to the number of files that will ultimately be written out.

So an example file may be acme_csv_11.csv.

Here’s a non-production ready PHP-based example. Again, this is just for demonstration purposes. I wouldn’t recommend copy, pasting, or even using this exactly as written.

Having small, multiple files improves a number of things:

  • It makes it far less likely for PHP timeouts to happen because the script it executing on a smaller file
  • It makes it easy to compare data from what’s in the file to what’s being imported
  • It makes it possible to display a progress bar (via Ajax) to give the user feedback.

Not bad, right?

More To Come

Over the next few posts, I’ll be walking through the rest of the steps – from importing the files, displaying progress, and cleaning up after yourself, but this first post is simply to build a case as for why dividing up large files matters, and how it can mitigate timeouts.

23 Comments

  1. anuj

    Same boat as you. Frustated on not having got a single CSV import across my dozen projects that I can say is scaleable and handles all situations.

    Eager to see the following parts. Will send across our flow for your reference as well.

    • Tom

      Hopefully what I’m working through in this series will help!

  2. Pablo

    Those are really good advices. I made some CSV imports to WordPress but I always thought my way wasn’t to ready for high volumes of data. Looking forward to your next posts!

    (I’ve been reading your blog for a while, I love the articles and all the research you’re doing, thanks!)

  3. Ihor Vorotnov

    Hey Tom,
    I would use file() instead of file_get_contents() – it returns an array and you don’t need to explode() the string.
    I’m currently figuring out how to safely import a huge (45k rows) json file. Each row (line) should create a user, add some usermeta , a custom post type post (3 posts in different languages), custom taxonomy terms. And, of course, do a lot of checks for term_exists etc, cyrillic title transliteration and sending email for newly created users. I’m definitely interested in that ‘displaying progress’ part :) As for now, I’m going to create a separate php script which will recieve a single json object and perform all tasks for one user and its stuff. From my plugin page it will be called via ajax, so it will return info for every single entry. What I’m not sure about is how to iterate over that 45k json objects. Splitting into multiple files sounds like an idea that will work!

    • Tom

      I would use file() instead of file_get_contents() – it returns an array and you don’t need to explode() the string.

      The main reason I use file_get_contents over file is because of how file handles line endings. That is, file will parse end-of-line characters which can cause problems given files coming from other operating systems (regardless of the one you’re on) where as file_get_contents reads everything into a string.

      Each row (line) should create a user, add some usermeta , a custom post type post (3 posts in different languages), custom taxonomy terms. And, of course, do a lot of checks for term_exists etc, cyrillic title transliteration and sending email for newly created users.

      Wow – impressive … and time-consuming ;). But done write, at least PHP and the WordPress API make this pretty straightforward.

      What I’m not sure about is how to iterate over that 45k json objects. Splitting into multiple files sounds like an idea that will work!

      This should be trivial, but for what it’s worth I’d use Ajax and a progress indicator to show even when this is happening. In my work, right now at least, I haven’t been showing that because the file generation is so fast.

      But at 45k rows, that’s just a “Step 1” of many, many more steps :).

      • Ihor Vorotnov

        Yep, step 1, I agree. That’s a json export file from MongoDB, from old version of the site which was written in Python and used MongoDB. Currently investigating if it makes sense to use WP JSON API to handle creation of users and content related to it. Maybe it will be just faster to tweak endpoints for my needs. The only question is multilingual content. While working on this project I faced all the problems WordPress has with multilingual content. The biggest problem is requirement to have unique slugs for posts, pages, taxonomies and other objects. WPML plugin makes it possible to have pages and posts with the same slug in different languages, but not the taxonomies. Trying to find any way to hack it without touching WP core.

        Regarding line endings and file() function – yes, I agree. If you don’t have complete control over the file, you have to rely on safer function.

  4. Firdaus Zahari

    Interesting post! It is also worth noting that PHP also provide native functions that made it easier to work with .csv files

    http://php.net/manual/en/function.fgetcsv.php
    http://php.net/manual/en/function.fputcsv.php

    • Tom

      Yes – definitely worth mentioning that!

      The main reason that I used a different approach is because a CSV could have other delimited values between commas. Fo, for example, say I have three columns (one, two, and three) and two supports multiple values. Then I have:

      Tom, Value 1 | Value 2 | Value 3, McFarlin

      Where the value for column two is a pipe-delimited value that will require additional parsing.

  5. Marshall

    try wp csv plugin by cpk web solutions. He does not maintain the plug in any more but it has handled anything I have used it for for both export and import.
    Suggest you do an export on any existing database to get the structure then make your csv match the structure.
    This plugin has handled posts, pages and custom products too.
    And it will handle large files too. You may have to up the file size and time out on your php.ini file first.

  6. Spot

    thanks for sharing your ideas on this matter. it did helped me a lot to make my custom WP eCommerce csv product import routine work.

    now i can import 20K plus records without time-out and have added a progress bar too.

    • Tom

      That’s great! Love hearing that.

    • khan

      Hey, Any one give me a direction how to auto load each file without preventing time out.

      As after dividing the csv into multiple files I got more than 250 files now if I use ajax then how I iterate file by file, Is it not the same thing ?

  7. Pete

    Do you have a suggestion as for to save the CSV partials to? Or is any ol’ location fair game?

    • Pete

      Oops, typo. Should have read “Do you have a suggestion as for the location to save the CSV partials to?”

      • Tom

        Anything is fair game, imho, since they are going to be deleted when the process has completed. Overall, it depends on the gig, but I normally just write them to a temporary directory inside of where the plugin resides.

        Some people will criticize this, but I’ve yet to run into any issues in doing this.

  8. Pete

    Using this code, my CSV parser was skipping rows. I ended up finding out that it was because in all of the files except the first, there wasn’t a proper line break between the title row and the data rows. Made the following change to fix it:

    Changed:
    $csv_data = $title_row;

    To:
    $csv_data = $title_row . “\n”;

    • Tom

      Thanks for sharing what you did – YMMV based on how the CSV is formatted, so it’s good to have that here :).

      • Pete

        Sure things! Thanks, by the way, for putting these posts together! I’m working on a project that needed this, and your posts have been super helpful.

        I did notice a couple other things worth mentioning, for the benefit of future readers:

        1) Your code snippet ACTUALLY puts 11 rows into each file, which means that the number of files based on $rows_of_ten isn’t accurate (and the code may save out a few blank CSV partials at the end). To fix, just:

        Change:
        for( $i = 0; $i <= 10; $i++ ) {
        To:
        for( $i = 0; $i < 10; $i++ ) {

        2) With the code changes above, the remainder from our $rows_of_ten is cropped off. The number of rows that we have leftover has been saved in $remaining_row, but it isn't currently being used for anything. The following changes will address that issue:

        Change:
        while( 0 < $rows_of_ten ) {
        To:
        while( 0 <= $rows_of_ten ) {

        Add this before the row loop (i.e. "for( $i = 0; $i < 10; $i++ ) {"):
        // Calculate number of loops for this file
        $loops = ( $rows_of_ten == 0 && $remaining_row != 0 ) ? $remaining_row + 1 : 10 ;

        Change:
        for( $i = 0; $i < 10; $i++ ) {
        To:
        for( $i = 0; $i < $loops; $i++ ) {

        Voila! That should take care of it!

  9. John

    Missing rows depending on what number I chunk by…

    the csv file i’m parsing has 1038 rows + header so a total of 1039. it seems when i have it set to 10 even including petes instructions different rows end up missing, but for some reason the number 43 works fine? also when i chunk by 86 no rows are missing either.

    • Tom

      This sounds like you need to use the modulus operator. Since you have 1038 rows, dividing it by 10 is going to leave a remainder.

      So 1038 / 10 will give you the number of rows of 10, then 1038 % 10 will give you the number of remaining rows.

      From there, you should be able to iterate through the list and then process the numbers as you please.

  10. Chris

    Great series. I am trying to better understand what production ready means. lm not looking for code but more the types of questions I should be asking . Using the csv example, does production worthiness mean separating this code into smaller testable functions and as mentioned in the article adding the error checking and sanitization? Are there other considerations? Any help would be most appreciated.

    • Tom

      I am trying to better understand what production ready means

      There are generally three environments code is developed:

      1. Development refers to the local machine on which the code is developed
      2. Staging refers to the place where you, the client, and automated tests can try out the features and projects as they’ve undergone development
      3. Production refers to the live environment in which the code runs — it’s generally hands off and once code passes Staging then it’s deployed to Production normally using some type of deployment tool that’s connected to a source control system.

      So to have something production-ready means that it’s ready to be deployed from development and testing to the main environment.

      does production worthiness mean separating this code into smaller testable functions and as mentioned in the article adding the error checking and sanitization?

      Not necessarily – I mean, writing smaller, testable functions are things that should be done during development time as well should error checking and sanitization.

      Hope this helps!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

© 2023 Tom McFarlin

Theme by Anders NorenUp ↑