Recent interviews and presentations

CryptPad and the team have received some attention recently through various channels: a Reddit AMA thread, podcast, presentation, and blog interview. Whether you prefer to read, watch or listen there is a way for you to get an up-to-date presentation of CryptPad.

Aaron MacSween was invited for an Ask Me Anything (AMA) thread on r/privacy on Reddit. The thread started on November 27th 2020 and lasted for a couple of days. 98 questions were raised about everything from CryptPad’s business model, to future development plans, the broader state of open-source software, and the EU’s recent debates against encryption.

Aaron was also a guest on the We Don’t Stream podcast in an episode titled Imagine Google Docs but without the spyware. This was a general presentation of CryptPad, the origin story and the motivations behind it. Each guest is asked to nominate an NGO to encourage donations. Aaron chose the Electronic Frontier Foundation so please consider supporting them.

Ludovic Dubost was interviewed on Website Planet, answering questions about XWiki and CryptPad. The interview covers the founding of XWiki, the effects of the COVID crisis on the company and CryptPad, as well as the future of the open-source industry.

Ludovic also presented CryptPad at esLibre 2020, a Spanish open-source conference organised this year by King Juan Carlos University. Like many others this year this was a “virtual” event, the video is available on our PeerTube channel. Ludovic covered the background of the project and gave a demo-tour of CryptPad.

The outage of December 8th, 2020 - a postmortem

On December 8th, 2020 a malfunction in the water-cooling system at a data-center in Roubaix, France caused an unrecoverable error in the physical machine which hosts CryptPad.fr. The service was unavailable for approximately 27 hours while we diagnosed a corrupted OS, provisioned a new server, and migrated user data to the new system.

What happened

Our team works remotely across three different timezones, so for the sake of simplicity I’ll summarize the timeline of the service outage in Central European Time (CET) using a 24-hour clock.

December 8, 05:30 - Server update deployed

I occasionally start my working days very early in the morning when we have as few users connected as possible. On the days where a server restart is necessary, I do it at this time to minimize the number of active users that might be inconvenienced by the momentary service interruption.

During the process of deploying a minor patch to optimize how the service loads and evicts document metadata I noticed that our monthly full-disk backup was running. We run a less intensive incremental backup on a daily basis, but having a regular full-disk backup ensures that restoration does not become increasingly difficult over time.

We’ve been breaking our records for the highest ever number of concurrent users on CryptPad.fr on a regular basis, so I tend to pay close attention to how our server is performing and how small changes in our code affect its performance.

A record number of concurrently connected clients, as reported by our admin panel

My colleagues hadn’t started their shifts yet, but I left a message informing them that the server would probably be under more load than usual. We were considering also deploying an update to our client code, but we usually avoid doing so on backup days to help ensure that things go smoothly.

Me jinxing our server for the day

December 8, 12:42 - CryptPad goes down

CryptPad is developed by a company called XWiki. We host everything using virtual machines provisioned on the same dedicated physical servers as the rest of the company’s infrastructure. Performance metrics and monitoring for our other sites suggests that the host machine did not fail instantly, rather, its performance degraded over a relatively short period of time. Some services failed sooner as the host system tried to de-prioritize less critical systems.

The last line in our server’s log was written to the disk at 12:42. I was away from my desk to eat a meal, so I didn’t notice that anything had gone wrong. I returned to find some user reports sent at around 13:00 that the service had been unavailable for some time. We occasionally receive such reports that turn out to be user error (typos in URLs or DNS problems), but in this case it was easy to confirm as a systemic problem since my CryptPad tabs were also disconnected.

My first assumption was that an error in the the code I’d deployed earlier in the day that had caused the server to crash. I tried to log into our servers from my terminal with no success, then tried to ping the server’s domain name, then its raw IP address, at which point I realized that the machine was completely powered down or otherwise unreachable.

The last time we were surprised with this kind of outage was in November 2017 when a power outage and a generator failure took several data-centers completely offline. That outage lasted 3.5 hours, which seemed very bad at the time, but I was expecting something similar.

December 8, 13:35 - Infra is on it

Since the same physical infrastructure hosts a large number of sites the outage had been noticed by many of our company’s employees almost instantly. We have lots of monitoring in place to send warnings when things are performing poorly (or not at all), but I learned via our company’s internal chat service that at least one of our physical servers had had a critical failure and that Kevin (our resident infra expert) was working on it. CryptPad’s track record for uptime until this point was very good, and most of that was due to Kevin, so I tried to leave him alone so he could focus on diagnosing and possibly fixing the problem. Since there didn’t seem to be anything I could do on that front to help the situation I started to respond to the related GitHub issues and messages in our public chat channel to inform our users what was going on.

At this point I also noticed several messages from my colleagues congratulating me on my five-year anniversary at the company. I’d forgotten the date, and grateful as I was for the wishes, this wasn’t how I’d expected to celebrate the milestone.

December 8, 15:00 - Host machine comes back online

By 15:00 the physical server that had gone offline had been powered back up. This meant that VMware (the software we use to host many virtual machines on one very powerful machine) was running again, though some more work was needed to bring many of its hosted VMs back online. Kevin immediately began running a range of integrity checks to confirm that the hardware was functioning correctly before relaunching services. Some VMs that required fewer resources were able to be re-launched very quickly, but CryptPad requires more storage than most of the wikis we host, and disk checks tend to require more time than other diagnostics.

At 15:40 these disk integrity checks were interrupted when one of the data-center technicians (who I’m sure was also having a bad day) had to take the server back offline to transfer our hardware to a new location in the same building. Access was restored just a few minutes later, but we had to restart our integrity checks.

December 8, 16:20 - First disk integrity check completes

Forty minutes after the manual intervention, the first of three disk checkups had completed. VMware was reporting that all systems were operational, however, the VM that usually hosts our API server was failing to boot. Kevin was able to launch a Debian rescue system from a live disk and mount the system for inspection, but there was still no obvious indication why the system wouldn’t boot. He proceeded to launch checks for the remaining two disks while he continued to search for the cause of the failure.

December 8, 18:30 - Initial failure traced to a cooling malfunction

Throughout the duration of this downtime Kevin had been on and off the phone with the data-center technicians getting updates about what had happened and whether we should expect any further problems. By 18:30 we were informed of the cooling system’s malfunction. While it was somewhat comforting to know that the problem had nothing to do with code we’d written, it was also frustrating to be reminded that there will likely always be physical events like this that we can neither control nor predict.

As twitter user @RimaSghaier noted, the internet is still very physical.

December 8, 19:30 - File transfer commences

By 19:30, between myself, Kevin, Ludovic (the company’s CEO), and one of Ludovic’s friends who has some more experience with the intricacies of bootable filesystems, we’d made no progress diagnosing why the affected VM would not boot outside of the environment of the rescue disk. We had access to all the system’s files and all of the integrity checks had passed, but there seemed to be problem with the root filesystem. We decided that the safest thing to do was to provision a new VM and begin transferring the relevant files. We could interrupt the process if we discovered the reason for the failure, but it was already late in the day and we had no promising leads.

It took only a few minutes to provision a nearly identical VM and we immediately began transferring files via the data-center’s internal network. Unfortunately, there was around 750GB of data to transfer at a variable rate that did not seem very promising.

Until this point I’d been very hopeful that at any minute we would find some trick to get the original server back online. As it became increasingly apparent that this was unlikely and that we’d need to wait for the file transfer to finish we shifted our focus to damage control.

The API server that hosts our database and Nodejs server had been offline, but we actually serve our static assets (HTML, Javascript, etc) from a different machine that had stayed online. I’d been distracted by the actual system outage and hadn’t thought to update our front-end to inform all our users of what was going on, though I had been posting to our Mastodon and Twitter accounts.

I hacked together and deployed some very basic HTML as quickly as I could, explaining what was happening and directing users to our social media for updates. This was deployed by 19:43.

CryptPad's down page

December 8, 20:00 - I try to get some sleep

Finally, after about 7 hours of downtime and a 14.5 hour shift on my part, we left the servers alone to continue their work and decided to get some rest for the following day. We expected the file transfer to take at least 10 more hours to complete, so I set my alarm for the following morning and called it a day.

December 9, 16:14

December 9th was not especially eventful. I spent most of the day idly monitoring the progress of the network file transfer. I was far too distracted to be productive with anything else, and anyway it seemed prudent to save my energy for when the transfer completed.

By about 13:30 the transfer was 90% complete and I began to pre-configure as much as possible on the new system so that we could bring everything back up as quickly as possible. I prepared and reviewed a list of final tasks with Kevin and Yann in the final 30 minutes of the transfer, and we started working as soon as it finished.

We were able to complete the system’s setup in around 20 minutes, including a last-minute configuration fix to restrict the service to our IP addresses before we launched it. This restriction allowed us to access CryptPad as normal before anyone else. We took about ten minutes to test the platform, loading any documents we’d been editing leading up to the crash and confirming that everything was behaving as expected.

Finally, by 16:14, after a bit more than 27 hours of downtime, we removed the IP address restriction and removed the downtime notice I’d deployed the evening before.

Difficulties and lessons learned

I’ll start by saying in very simple terms that this experience sucked. I know it was very frustrating for our users who couldn’t access their documents while the server was offline. I certainly had a terrible two shifts. It was stressful for everyone on our team, and I suspect it was similarly unpleasant for the data-center technicians as well.

It should be more obvious given the root meaning of the word internet, but we all depend on many systems functioning to maintain our daily routines. The majority of our users only contact us to report bugs. Kevin and I mostly end up chatting when one of us notices irregular server behaviour. We only contact OVH when our servers have problems, and they probably don’t deal too much with their municipal electricity and fuel providers except when their power goes out and they fall back to using generators. We are most aware of the systems that sustain us when they break.

On a positive note, though, I was pleasantly surprised by how understanding people were about the situation. One of our paying users cancelled their subscription, but it seems the outage served to remind many people that there are humans working on this project, and so we’ve actually seen an increase in the rate of donations and subscriptions in the week since. We greatly appreciate everyone’s generosity!

Some users seem to have understandably lost some confidence in our platform, as we’ve seen slightly fewer users at the usual peak hours (2700 concurrent connections instead of 3000). On the other hand, it seems like the downtime page led to a significant increase in our follower count on social media.

Many of our users rely on CryptPad as a persistent home for their documents, and in these cases downtime is very inconvenient. During the outage, however, I learned about this software which randomly redirects users to publicly hosted instances of open-source software platforms. If you use CryptPad as a place to collaborate rather than a place to store documents, then you could try cryptpad.random-redirect.de to find alternatives. If you host a CryptPad instance you could even inquire about adding your server to the list. One of the great things about open-source software is that failures that affect one server or service do not need to have global effects.

Despite the positive aspects of our community’s response to this event, I regret that it took so long migrate to a new machine. The simple fact is that while we (mostly Kevin and Ludovic) have put in a lot of effort to making sure that our hosting infrastructure is reliable, we were unprepared for the task of rapidly migrating our entire database to a new machine. We’re hosting about six times more data now than we were at the start of the year. Until now we’ve had little cause to consider the increasing difficulty of managing this growing dataset and with everything else that has happened this year there has been little opportunity to do so. This event made it abundantly clear that we’re going to have to find the required time.

What we plan to do

It would be an understatement to say that I have a bit of an idealist stance when it comes to software. This is why I work on open-source, privacy-preserving tech. It’s terrible that modern, web-based software is as fragile as it is. That said, the alternative of emailing static documents to colleagues (or between devices) also makes it difficult to be productive.

It’s a bit ridiculous that a broken cooling system in northern France can mean that our 20,000 daily active users lose the ability to edit or even read their documents for more than a day. More frustrating is the fact that we were very nearly in a good position to mitigate many of the adverse effects of this outage. We’ve been working on some new offline-first functionality in CryptPad over the last few months and, as noted above, we were considering deploying the first phase of these improvements the day of the outage.

Our first offline features were deployed yesterday as a part of our 3.25.0 release. Now, every time you load a document in CryptPad you’re also populating an advanced cache in your browser. For now this only has the effect of reducing the total time to load cached documents, since we still wait for confirmation from the server that this is the most recent version of the content before removing the loading screen.

Our next step will be to merge a branch of our code which will instead load and display the last known state of any document in your local cache in offline mode, regardless of whether you’re able to reach our database server. This would have alleviated some of the inconvenience of our outage, since users were still able to load the platform’s HTML and JavaScript that would have at least let them access cached documents.

The next major feature will be the use of service-workers to enable browsers to use very advanced caching policies and load our client-side code even while entirely offline, allowing full access to cached documents under almost any circumstance. We expect to deploy these updates in early January 2021 as a part of our upcoming 4.0.0 release.

One of my favourite academic papers defines the term gray failures, in which well-intentioned attempts to introduce redundancy into online systems can paradoxically increases the likelihood of service degradation or interruptions. In the last few weeks both Amazon and Google (some of the richest companies on the planet, in case you haven’t heard of them) have experienced severe service outages. There are very few easy answers in this area, but we’re going to learn from this situation and work on solutions that would have helped at least let us recover more quickly.

If the next data-center failure happens in another three years I hope it will only last a small fraction of the time, and that our software will be so resilient you’ll hardly notice. In the meantime our team greatly appreciates all your support!

Looking back, looking forward

It was four years ago when I first started working full-time on CryptPad. At that point fewer than 10 people used the service on a weekly basis. Our development team was included in that list, often multiple times since we visited from both our office and our homes.

In those early days the platform was much more of a toy than a tool. There was no CryptDrive for storing documents, no login, markdown rendering, file upload, kanban, or whiteboard. It was the first of four years of a research project in which we were responsible for building a variety of collaborative editors. We mostly used CryptPad to prototype new technologies before committing to a much more complex integration into the larger project. Nobody insisted that our editors include the extra privacy features we designed, yet, among our small team we definitely hoped they would catch on.

We knew that as long as we produced viable editors and passed our project’s yearly reviews we didn’t have to worry about our jobs. It felt like we were supposed to take risks and we certainly did. The stakes were low. Sometimes if we wanted to test the platform together we’d just push our code to our production server. Occasionally we’d edit files directly on the server to cut out additional steps. We did our work as quickly as we could without having to worry about the consequences because nobody was relying on us for their safety.

It was an exciting time.

Two thousand and nineteen

Our situation today is drastically different. Privacy is very much in the public eye, although the news is more often bad than good. In any case, instead of ten weekly visitors CryptPad.fr now supports more than ten thousand.

Many of those that trust us to protect their information have no cause to use our service other than the very reasonable expectation that nobody will access their content without their consent. We’re pleased to be able to offer this peace of mind and we appreciate that we need this demographic and its expectations to become the norm if those with more extreme requirements are to blend in with the crowd. As the saying goes: privacy is a team sport.

As proud as I am of the project’s advancement since our humble beginnings, I still feel as though we’ve been playing this sport defensively in these last 365 days. We began the year with the knowledge that our stable funding was about to dry up and that our efforts to sustain the project via subscriptions and donations were not going to be enough. At the same time, increasingly more of our time was occupied just keeping up with regular issues: answering emails, fixing bugs, and managing a progressively more complex codebase. Meanwhile, we had to consider the effects of every change on those users whose physical safety occasionally depends on their privacy.

Fortunately for us and our community we received an enormous amount of support from Europe’s Next Generation Internet initiative, both in terms of publicity through the presentation of an NGI award and monetary contributions through the NLnet PET grant program. We’ve still had to cope with an endless stream of feature requests and correspondences, but the funding definitely addressed our existential worries for a time.

In the course of our CryptPad Teams project we struggled to balance all the responsibilities of our position and as a result it’s taken somewhat longer to complete the project than we planned. I now have a better appreciation of how much easier a project can appear in its planning compared to its execution. The opportunity to go slightly over budget on a small project has been a welcome learning experience that I hope not to repeat.

Looking forward

At this stage in our project it isn’t enough for our team to try to keep up with tickets on our issue tracker. Reactionary decisions won’t make our project sustainable, nor will they effectively serve the community that has helped us get this far. That’s why in 2020 we’re going to focus on project governance and providing a cohesive vision with the hope of getting more of our stakeholders directly involved in its success.

I spent a large part of this holiday season making small changes to make it easier to correctly configure a CryptPad instance. Starting in January we’re going to continue this effort to support the 300 independent instance administrators with a radical overhaul of our documentation, along with simplified guides for users and more detailed guides for contributors.

Our immediate roadmap will also feature further development of our admin panel to ensure that community instances can be governed by team members lacking advanced technological expertise. Beyond that we’re looking forward to some big improvements to the tools that are most essential to effectively coordinate distributed groups of people, namely our rich text, spreadsheet, and kanban apps.

There’s still a lot of work we can do to improve the social integrations first proposed in our Teams project. We’ll continue to streamline the process of onboarding new team members and add in some even more advanced controls for very sensitive data.

I’ve been hesitant to commit to development time that doesn’t yet have a source of funding but in the coming year I hope to be able to deliver an improved experience for users of mobile and touch-enabled devices.

How you can help

Privacy should not be a luxury item. CryptPad has been built largely with public money and we’re committed to continuing its development as a public good. Continued monetary contributions via donations enable us to offer our services to users regardless of whether they can contribute themselves.

Along with subscriptions to our platform, our independent revenue helps to finance all the minor tasks that don’t easily fit into the narrative of a successful grant proposal. Every cent of these revenue streams go back into development and we do our best to get the most value out of your contributions.

There are, of course, many other ways you can contribute. Any publicity you can generate will free us to spend less time marketing and more time improving the software and its documentation. Sharing our messages on social media with your followers helps a lot, so please follow us on the Fediverse and Twitter. We especially appreciate personal messages that tell the world exactly what it is you love about CryptPad.

We’re also happy to support and publicize offline events promoting the project. If you’re comfortable speaking in public and would like to represent us in your community feel free to contact us about and we’ll see how we can help.

As we produce more documentation we’ll also need help reviewing it and keeping it up to date. Every little bit helps, whether it’s a page or a line of documentation corrected. Finally, we welcome any efforts to translate CryptPad into a new language or to help those already working on our existing translations.

Wishes for 2020

I made a deliberate choice in naming the most recent cycle of releases after extinct animals. We are living through a major extinction event and growing list of crises. More than ever we need a hopeful vision of the future.

I’m personally grateful for the opportunity to offer tools to support these endeavors.

Embrace private spaces.
Connect with those around you.
Organize and build a better future together.

See you in 2020!

Yesterday I made a mess

Normally when I write a blog post it’s because I have exciting news to share. This time it’s not a fun occasion because the only good news I have is that the bad news isn’t permanent.

The bad news is that during some database maintenance yesterday (June 13th) I accidentally removed some of the data from users’ encrypted drives. The good news is that these files were archived, not deleted, and that everything can be recovered.

Before I get into the details of why this happened I’d like to clarify which user data was archived and how to check if your account was one of those affected.

How to tell if you were affected

First off, everything is related to my actions administrating the database of CryptPad.fr. Users of other instances have nothing to worry about unless their administrator did the exact same thing as I did, which is unlikely.

Secondly, the issue is limited to shared folders and non-owned files contained within them. If you don’t use shared folders you won’t be affected.

Thirdly, as far as we can tell you need to have visited CryptPad.fr between May 28th and June 13th in order to have run some incorrect code.

Finally, nothing was archived unless it had not been active within the preceding 90 days. In the case of shared folders, this would mean any change to the content or structure of the shared folder, such as adding or removing a document or renaming or moving any of its contents. In the case of pads, if a user with the rights to edit the document loaded it without making any changes, that would classify it as active.

To summarise:

Some of your data could have been archived if you visited https://CryptPad.fr between May 28th and June 13th (2019) and have one or more shared folders in your CryptDrive which have not been modified within the last 90 days.

Checking if you were affected

It should be fairly easy to tell if your account was affected by opening your CryptDrive. Affected shared folders will be visible in the tree on the left of your drive because they’ll have lost their titles, as highlighted in red below:

archived shared folders

How we’re going to handle this

As I said, none of the data was deleted, just archived. It’s still on the same server that hosts the rest of our database, it’s just been moved to a different location to make it inaccessible.

I’ve already restored all of those files which were archived except for 237 cases. Affected Users that visited CryptPad.fr between the removal and restoration of the data would have automatically created a new folder in the same location as the old one, and that complicates things for us. Since we don’t know whether they might have decided to put new documents in that folder in the meantime, it’s dangerous for us to overwrite the new data with the old.

It’s going to take us a few days to figure out if we can use some fancier methodology to identify what data we can safely reinstate. In the meantime, we’ve already fixed the underlying issues that caused this data to be miscategorized, and developed some new tooling for safely diagnosing and restoring archived data.

Since we know that those affected by this error visited since our last release day and that they had content older than 90 days, we assume they’re going to come back to the platform. If you do come back and see something resembling the image above, please do let us know by emailing us at contact@cryptpad.fr. We can manually restore any files that haven’t already been restored.

I’m very sorry for any inconvenience this might have caused and I’m grateful that the damage wasn’t worse. I’ll take this as an opportunity to prove my commitment to protecting user data, whether it be from surveillance or from my own mistakes.

Post-mortem

With all the practical details addressed for those who only have the time to make sure their own data is safe, I’ll go further into the specifics for anyone who might be interested.

The pinning race condition (May 28th)

On May 28th we released CryptPad Xenops. It introduced notifications for users through the use of something we’ve been calling “encrypted mailboxes”. Each registered user now has a mailbox through which any other user can send messages, currently for friend requests, and soon for other features.

While we were implementing the function which loads new messages from this mailbox we introduced a bug which caused some other functions to be executed in the wrong order. I personally reviewed the code but didn’t see the bug.

Registered users are able to send instructions to the server not to delete data that is relevant to them. We call this process “pinning” and it’s done every time a user uses the service.

What should have happened is that users should have loaded their drive, then loaded their shared folders, then pinned all the contained files. Instead, they loaded their drive and then started loading their shared folders and started the pinning process in parallel. This caused what’s called a race condition, which means that two things happen at the same time, and sometimes they happen in different orders.

Race conditions are especially annoying because sometimes they only occur under certain circumstances, so these bugs tend to slip past basic testing unless you already know what you’re looking for. In our case, losing the race meant that files weren’t pinned and consequently the server didn’t have an accurate notion of which data was worth keeping.

Running out of space (June 3rd)

Several months ago a user contacted us saying that data had disappeared from their drive. This was quite scary from our perspective as for every user that contacts us about something we can generally assume that there are many more that had the same issue, but didn’t say anything.

We spent several days debugging their problem and developing tools which would analyze the history of their drive without exposing any of their encrypted content to us. In the end, it turned out that the files didn’t ever exist in their history, so it wasn’t a matter of us losing that data. Nevertheless, the situation was stressful enough that we turned off all of our scripts for deleting inactive data until we could sort out a more reliable methodology for handling data.

With that regular process not in place, and with increasingly more users visiting our service, our database continued to grow at an accellerating pace. On June 3rd we started receiving automated emails from XWiki’s infrastructure services that we were down to 20% of our disk space. We had been meaning to handle this problem for some time but with 33 emails arriving in our inboxes each day we finally decided to prioritize it.

Replaced the race condition (June 6th)

After the Xenops release we noticed an error that was occurring in our browser consoles fairly regularly and decided to debug it. We tracked it down and fixed it, but since we weren’t looking for the other race condition described above, we managed to change the code in such a way that a functionally identical race condition was still present. We fixed one issue, but pads still weren’t being pinned reliably.

Incorrect data archival (June 13th)

Having proceeded with fixing a variety of other bugs, I turned my attention back to solving our storage issue. Deleting data hadn’t become any less scary than it had always been so I proceeded with caution, implementing an archival system that would move inactive data to what we termed “cold storage” for a set period before removing it permanently.

I implemented some code for iterating over our complete database and used that to create a script for checking the most recent modifications to user data. I read through it a number of times, tested it on my local database and had my colleague review it and test it on his machine. Before using it on our production database I made sure to also write and test a script that would restore archived files in case anything went wrong.

I think I must have sat in front of my laptop and stared at my screen for between five and ten minutes before I hit enter on the command to run the script. I had the code for the script on another monitor, and I double-checked it before deciding to proceed. I reloaded my drive to make sure everything was still there once it finished running, and it was. After twenty minutes or so of testing everything seemed alright, so I went on with my day.

Later on we finally noticed that there was a problem with one of our user accounts, specifically with a shared folder having disappeared. We stayed at the office late into the evening to figure out what had happened, and ended up tracking the problem to the pinning logic before deciding to follow up on it in the morning.

Final debugging and restoration (June 14th)

With as restful a night as I could manage under the circumstances, I came back to the office this morning with a bit of perspective on the issue. I wrote up a pad which collected all the information we had into one place, identifying the circumstances under which we believed the problem could occur.

I reviewed the script which restored archived files, making sure that it would not overwrite any user data if utilized. My colleague implemented a fix for the race condition which contributed to the pinning issue, which I deployed as soon as I could review it.

After writing a few more scripts I was able to determine the number of shared folders which had been replaced with conflicting entries with the same identifiers (237). Knowing this number allowed me to determine how to handle the issue. If the number was significantly smaller it might have been easier to handle, but the order of magnitude is such that we’ll have to figure out an automated way to deal with the issue or else spend the next few weeks responding to emails and manually recovering files.

With a better grasp on the situation and with some confidence that it wasn’t the database processing scripts which were incorrect, I restored the archived files with the exception of those which conflicted with the production database.

Conclusion

If I’ve learned anything in my time working on CryptPad it’s that I should appreciate the reasons why the majority of the software industry doesn’t work with encrypted database as we do. Even on a good day it can be a harder job than it would otherwise be. On a day like today we end up having to reason with what the clientside code would have done under various circumstances and think about what information we can access.

In any case, I’m very happy that we decided to turn off our deletion scripts months ago. Had they still been active, this relatively mild pinning and archival bug would have resulted in data loss.

While we can tell that 237 shared folders were affected, we still have to think about how the absence of that data would be handled by the code for user’s CryptDrives. To further complicate things, we have to think beyond what our code would do and into what users might have done in reaction to what they saw. If they saw and removed the now-empty shared folders in their drive, they no longer have the encryption keys to decrypt them even though we’ve now restored the underlying data. Because we’ve spent so much time trying to protect our users’ privacy we can’t actually ascertain if they’ve interacted with this part of their drive at all.

On one hand, it makes my life that much more stressful to have to figure out the answers to these problems. On the other, I’m hopeful that by doing this work now I’ll help pave the way for more developers to create services which offer similar protection for their users’ data.

As stated above, if this particular mistake affected you, don’t hesitate to contact us. Otherwise, I can only hope that the way we handle it ensures that you continue to trust us with your data.